CC: kbuild-all(a)lists.01.org
TO: Dennis Li <Dennis.Li(a)amd.com>
CC: Alex Deucher <alexander.deucher(a)amd.com>
CC: Hawking Zhang <Hawking.Zhang(a)amd.com>
tree:
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git master
head: 7b287a5c6ac518c415a258f2aa7b1ebb25c263d2
commit: df9c8d1aa278c435c30a69b8f2418b4a52fcb929 [11519/13035] drm/amdgpu: fix system hang
issue during GPU reset
:::::: branch date: 6 hours ago
:::::: commit date: 3 days ago
config: i386-randconfig-c001-20200730 (attached as .config)
compiler: gcc-9 (Debian 9.3.0-14) 9.3.0
If you fix the issue, kindly add following tag as appropriate
Reported-by: kernel test robot <lkp(a)intel.com>
Reported-by: Julia Lawall <julia.lawall(a)lip6.fr>
coccinelle warnings: (new ones prefixed by >>)
> drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4304:3-9: preceding
lock on line 4293
drivers/gpu/drm/amd/amdgpu/amdgpu_device.c:4460:1-7: preceding
lock on line 4293
--
> drivers/gpu/drm/amd/amdgpu/amdgpu_xgmi.c:447:1-7: preceding lock
on line 410
#
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git/commi...
git remote add linux-next
https://git.kernel.org/pub/scm/linux/kernel/git/next/linux-next.git
git remote update linux-next
git checkout df9c8d1aa278c435c30a69b8f2418b4a52fcb929
vim +4304 drivers/gpu/drm/amd/amdgpu/amdgpu_device.c
3f12acc8d6d4b2 Evan Quan 2020-04-21 4236
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4237 /**
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4238 * amdgpu_device_gpu_recover - reset
the asic and recover scheduler
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4239 *
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4240 * @adev: amdgpu device pointer
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4241 * @job: which job trigger hang
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4242 *
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4243 * Attempt to reset the GPU if it has
hung (all asics).
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4244 * Attempt to do soft-reset or
full-reset and reinitialize Asic
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4245 * Returns 0 for success or an error on
failure.
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4246 */
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4247
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4248 int amdgpu_device_gpu_recover(struct
amdgpu_device *adev,
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4249 struct amdgpu_job *job)
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4250 {
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4251 struct list_head device_list,
*device_list_handle = NULL;
7dd8c205eaedfa Evan Quan 2020-04-16 4252 bool need_full_reset = false;
7dd8c205eaedfa Evan Quan 2020-04-16 4253 bool job_signaled = false;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4254 struct amdgpu_hive_info *hive = NULL;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4255 struct amdgpu_device *tmp_adev =
NULL;
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4256 int i, r = 0;
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4257 bool need_emergency_restart = false;
3f12acc8d6d4b2 Evan Quan 2020-04-21 4258 bool audio_suspended = false;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4259
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4260 /**
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4261 * Special case: RAS triggered and
full reset isn't supported
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4262 */
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4263 need_emergency_restart =
amdgpu_ras_need_emergency_restart(adev);
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4264
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4265 /*
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4266 * Flush RAM to disk so that after
reboot
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4267 * the user can read log and see why
the system rebooted.
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4268 */
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4269 if (need_emergency_restart &&
amdgpu_ras_get_context(adev)->reboot) {
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4270 DRM_WARN("Emergency
reboot.");
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4271
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4272 ksys_sync_helper();
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4273 emergency_restart();
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4274 }
d5ea093eebf022 Andrey Grodzovsky 2019-08-22 4275
b823821f2244ad Le Ma 2019-11-27 4276 dev_info(adev->dev, "GPU %s
begin!\n",
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4277 need_emergency_restart ? "jobs
stop":"reset");
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4278
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4279 /*
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4280 * Here we trylock to avoid chain of
resets executing from
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4281 * either trigger by jobs on different
adevs in XGMI hive or jobs on
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4282 * different schedulers for same
device while this TO handler is running.
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4283 * We always reset all schedulers for
device and all devices for XGMI
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4284 * hive so that should take care of
them too.
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4285 */
df9c8d1aa278c4 Dennis Li 2020-07-08 4286 hive = amdgpu_get_xgmi_hive(adev,
false);
df9c8d1aa278c4 Dennis Li 2020-07-08 4287 if (hive) {
df9c8d1aa278c4 Dennis Li 2020-07-08 4288 if
(atomic_cmpxchg(&hive->in_reset, 0, 1) != 0) {
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4289 DRM_INFO("Bailing on TDR for
s_job:%llx, hive: %llx as another already in progress",
0b2d2c2eecf27f Andrey Grodzovsky 2019-08-27 4290 job ? job->base.id : -1,
hive->hive_id);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4291 return 0;
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4292 }
df9c8d1aa278c4 Dennis Li 2020-07-08 @4293 mutex_lock(&hive->hive_lock);
df9c8d1aa278c4 Dennis Li 2020-07-08 4294 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4295
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4296 /*
9e94d22c008585 Evan Quan 2020-04-16 4297 * Build list of devices to reset.
9e94d22c008585 Evan Quan 2020-04-16 4298 * In case we are in XGMI hive mode,
resort the device list
9e94d22c008585 Evan Quan 2020-04-16 4299 * to put adev in the 1st position.
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4300 */
9e94d22c008585 Evan Quan 2020-04-16 4301 INIT_LIST_HEAD(&device_list);
9e94d22c008585 Evan Quan 2020-04-16 4302 if
(adev->gmc.xgmi.num_physical_nodes > 1) {
9e94d22c008585 Evan Quan 2020-04-16 4303 if (!hive)
9e94d22c008585 Evan Quan 2020-04-16 @4304 return -ENODEV;
9e94d22c008585 Evan Quan 2020-04-16 4305 if
(!list_is_first(&adev->gmc.xgmi.head, &hive->device_list))
9e94d22c008585 Evan Quan 2020-04-16 4306
list_rotate_to_front(&adev->gmc.xgmi.head, &hive->device_list);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4307 device_list_handle =
&hive->device_list;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4308 } else {
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4309
list_add_tail(&adev->gmc.xgmi.head, &device_list);
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4310 device_list_handle =
&device_list;
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4311 }
26bc534094ed45 Andrey Grodzovsky 2018-11-22 4312
12ffa55da60f83 Andrey Grodzovsky 2019-08-30 4313 /* block all schedulers and reset
given job's ring */
12ffa55da60f83 Andrey Grodzovsky 2019-08-30 4314 list_for_each_entry(tmp_adev,
device_list_handle, gmc.xgmi.head) {
df9c8d1aa278c4 Dennis Li 2020-07-08 4315 if
(!amdgpu_device_lock_adev(tmp_adev)) {
9e94d22c008585 Evan Quan 2020-04-16 4316 DRM_INFO("Bailing on TDR for
s_job:%llx, as another already in progress",
9e94d22c008585 Evan Quan 2020-04-16 4317 job ? job->base.id : -1);
df9c8d1aa278c4 Dennis Li 2020-07-08 4318 r = 0;
df9c8d1aa278c4 Dennis Li 2020-07-08 4319 goto skip_recovery;
7c6e68c777f109 Andrey Grodzovsky 2019-09-13 4320 }
7c6e68c777f109 Andrey Grodzovsky 2019-09-13 4321
3f12acc8d6d4b2 Evan Quan 2020-04-21 4322 /*
3f12acc8d6d4b2 Evan Quan 2020-04-21 4323 * Try to put the audio codec into
suspend state
3f12acc8d6d4b2 Evan Quan 2020-04-21 4324 * before gpu reset started.
3f12acc8d6d4b2 Evan Quan 2020-04-21 4325 *
3f12acc8d6d4b2 Evan Quan 2020-04-21 4326 * Due to the power domain of the
graphics device
3f12acc8d6d4b2 Evan Quan 2020-04-21 4327 * is shared with AZ power domain.
Without this,
3f12acc8d6d4b2 Evan Quan 2020-04-21 4328 * we may change the audio hardware
from behind
3f12acc8d6d4b2 Evan Quan 2020-04-21 4329 * the audio driver's back. That
will trigger
3f12acc8d6d4b2 Evan Quan 2020-04-21 4330 * some audio codec errors.
3f12acc8d6d4b2 Evan Quan 2020-04-21 4331 */
3f12acc8d6d4b2 Evan Quan 2020-04-21 4332 if
(!amdgpu_device_suspend_display_audio(tmp_adev))
3f12acc8d6d4b2 Evan Quan 2020-04-21 4333 audio_suspended = true;
3f12acc8d6d4b2 Evan Quan 2020-04-21 4334
9e94d22c008585 Evan Quan 2020-04-16 4335
amdgpu_ras_set_error_query_ready(tmp_adev, false);
9e94d22c008585 Evan Quan 2020-04-16 4336
52fb44cf30fc6b Evan Quan 2020-04-16 4337
cancel_delayed_work_sync(&tmp_adev->delayed_init_work);
52fb44cf30fc6b Evan Quan 2020-04-16 4338
9e94d22c008585 Evan Quan 2020-04-16 4339 if (!amdgpu_sriov_vf(tmp_adev))
9e94d22c008585 Evan Quan 2020-04-16 4340 amdgpu_amdkfd_pre_reset(tmp_adev);
9e94d22c008585 Evan Quan 2020-04-16 4341
fdafb3597a2cc4 Evan Quan 2019-06-26 4342 /*
fdafb3597a2cc4 Evan Quan 2019-06-26 4343 * Mark these ASICs to be reseted as
untracked first
fdafb3597a2cc4 Evan Quan 2019-06-26 4344 * And add them back after reset
completed
fdafb3597a2cc4 Evan Quan 2019-06-26 4345 */
fdafb3597a2cc4 Evan Quan 2019-06-26 4346
amdgpu_unregister_gpu_instance(tmp_adev);
fdafb3597a2cc4 Evan Quan 2019-06-26 4347
a2f63ee8b5eabd Evan Quan 2020-04-16 4348 amdgpu_fbdev_set_suspend(tmp_adev,
1);
565d1941557756 Evan Quan 2020-03-11 4349
f1c1314be42971 xinhui pan 2019-07-04 4350 /* disable ras on ALL IPs */
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4351 if (!need_emergency_restart
&&
b823821f2244ad Le Ma 2019-11-27 4352
amdgpu_device_ip_need_full_reset(tmp_adev))
f1c1314be42971 xinhui pan 2019-07-04 4353 amdgpu_ras_suspend(tmp_adev);
f1c1314be42971 xinhui pan 2019-07-04 4354
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4355 for (i = 0; i < AMDGPU_MAX_RINGS;
++i) {
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4356 struct amdgpu_ring *ring =
tmp_adev->rings[i];
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4357
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4358 if (!ring ||
!ring->sched.thread)
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4359 continue;
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4360
0b2d2c2eecf27f Andrey Grodzovsky 2019-08-27 4361 drm_sched_stop(&ring->sched,
job ? &job->base : NULL);
7c6e68c777f109 Andrey Grodzovsky 2019-09-13 4362
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4363 if (need_emergency_restart)
7c6e68c777f109 Andrey Grodzovsky 2019-09-13 4364
amdgpu_job_stop_all_jobs_on_sched(&ring->sched);
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4365 }
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4366 }
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4367
bb5c7235eaafb4 Wenhui Sheng 2020-07-13 4368 if (need_emergency_restart)
7c6e68c777f109 Andrey Grodzovsky 2019-09-13 4369 goto skip_sched_resume;
7c6e68c777f109 Andrey Grodzovsky 2019-09-13 4370
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4371 /*
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4372 * Must check guilty signal here since
after this point all old
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4373 * HW fences are force signaled.
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4374 *
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4375 * job->base holds a reference to
parent fence
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4376 */
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4377 if (job &&
job->base.s_fence->parent &&
7dd8c205eaedfa Evan Quan 2020-04-16 4378
dma_fence_is_signaled(job->base.s_fence->parent)) {
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4379 job_signaled = true;
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4380 dev_info(adev->dev, "Guilty
job already signaled, skipping HW reset");
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4381 goto skip_hw_reset;
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4382 }
1d721ed679db18 Andrey Grodzovsky 2019-04-18 4383
:::::: The code at line 4304 was first introduced by commit
:::::: 9e94d22c008585815f32630ee7d0d28c4ec12bb7 drm/amdgpu: optimize the gpu reset for
XGMI setup V2
:::::: TO: Evan Quan <evan.quan(a)amd.com>
:::::: CC: Alex Deucher <alexander.deucher(a)amd.com>
---
0-DAY CI Kernel Test Service, Intel Corporation
https://lists.01.org/hyperkitty/list/kbuild-all@lists.01.org