drm/msm: GPU recovery, IRQ, and state capture fixes#1365
Conversation
PR #1365 — validate-patchPR: #1365
Final Summary
|
PR #1365 — checker-log-analyzerPR: #1365
Detailed report: Full report
|
|
@veereshbagale please refer to previous PRs in the commit mesaage, please correct the FROMLIST, signoff, Link |
|
also kernel check complaining about the "CHECK: Alignment should match open parenthesis" please cross check if the checker fails is valid related to the code |
During recovery, it is not safe to retire the hung submit before we recover the GPU. Retiring the submit triggers BO free and that can result in GPU pagefaults since the GPU may be actively accessing those BOs. To fix this, retire the submits after gpu recovery is complete in recover_worker(). Fixes: 1a370be ("drm/msm: restart queued submits after hang") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-2-2caa04f7287c@oss.qualcomm.com
The GPUCC register list for A663 is incorrect, which can cause out-of-bounds register access during GPU state capture. Update it to use the correct register ranges. Fixes: 5773cce ("drm/msm/a6xx: Add support for A663") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-3-2caa04f7287c@oss.qualcomm.com
A621 uses an incorrect GPUCC register list during state capture. The existing list matches A623/A663. Rename it accordingly and add a dedicated A621 GPUCC register list. Fixes: 11cdb81 ("drm/msm/a6xx: Fix gpucc register block for A621") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-4-2caa04f7287c@oss.qualcomm.com
4c7696b to
d359d29
Compare
Once a hang is triggered by the msm_recovery test, the gpu error irq remains asserted and triggers an interrupt storm. In the worst case, this IRQ storm lands on the CPU core where the hangcheck timer is scheduled, blocking it from running. This eventually leads to CPU watchdog timeouts. To fix this, mask the gpu error irqs during msm_recovery test and enable them back during the recovery. Fixes: 5edf275 ("drm/msm: Add debugfs to disable hw err handling") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-5-2caa04f7287c@oss.qualcomm.com
get_pid_task() increments the task reference count, but the corresponding put_task_struct() was missing in the else branch, leaking a reference on every GPU hang recovery. Fixes: 25654a1 ("drm/msm: Update global fault counter when faulty process has already ended") Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com> Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com> Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com> Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-6-2caa04f7287c@oss.qualcomm.com
d359d29 to
f79b7b1
Compare
Bug fixes for Adreno GPU driver covering recovery correctness, IRQ handling, and state capture.
drm/msm: Recover HW before retire hung submit
Retiring a hung submit before GPU recovery could trigger page faults since the GPU may still be accessing those BOs. Reorder to recover first, then retire.
drm/msm/a6xx: Fix A663 GPUCC register list for state capture
A663 was incorrectly using A621's GPUCC register list, risking out-of-bounds register access during GPU state capture.
drm/msm/a6xx: Fix A621 GPUCC register list for state capture
A621 was sharing an incorrect register list with A623/A663. Adds a dedicated A621 GPUCC register list and renames the existing one to a623_gpucc_reg.
drm/msm/a6xx: Fix IRQ storm during msm_recovery test
When a hang is triggered via msm_recovery debugfs, the GPU error IRQ stays asserted causing an interrupt storm that can block the hangcheck timer and trigger CPU watchdog timeouts. Mask error IRQs during the test and re-enable them at recovery time.
drm/msm: Fix task_struct reference leak in recover_worker (0006)
get_pid_task() increments the refcount but put_task_struct() was missing in the else branch, leaking a reference on every GPU hang.