Skip to content

drm/msm: GPU recovery, IRQ, and state capture fixes#1365

Open
veereshbagale wants to merge 5 commits into
qualcomm-linux:tech/mm/gpufrom
veereshbagale:tech/mm/gpu
Open

drm/msm: GPU recovery, IRQ, and state capture fixes#1365
veereshbagale wants to merge 5 commits into
qualcomm-linux:tech/mm/gpufrom
veereshbagale:tech/mm/gpu

Conversation

@veereshbagale

Copy link
Copy Markdown

Bug fixes for Adreno GPU driver covering recovery correctness, IRQ handling, and state capture.

  • drm/msm: Recover HW before retire hung submit
    Retiring a hung submit before GPU recovery could trigger page faults since the GPU may still be accessing those BOs. Reorder to recover first, then retire.

  • drm/msm/a6xx: Fix A663 GPUCC register list for state capture
    A663 was incorrectly using A621's GPUCC register list, risking out-of-bounds register access during GPU state capture.

  • drm/msm/a6xx: Fix A621 GPUCC register list for state capture
    A621 was sharing an incorrect register list with A623/A663. Adds a dedicated A621 GPUCC register list and renames the existing one to a623_gpucc_reg.

  • drm/msm/a6xx: Fix IRQ storm during msm_recovery test
    When a hang is triggered via msm_recovery debugfs, the GPU error IRQ stays asserted causing an interrupt storm that can block the hangcheck timer and trigger CPU watchdog timeouts. Mask error IRQs during the test and re-enable them at recovery time.

  • drm/msm: Fix task_struct reference leak in recover_worker (0006)
    get_pid_task() increments the refcount but put_task_struct() was missing in the else branch, leaking a reference on every GPU hang.

@qcomlnxci qcomlnxci requested review from a team and quic-rajeshk and removed request for a team June 16, 2026 06:26
@qlijarvis

Copy link
Copy Markdown

PR #1365 — validate-patch

PR: #1365

Verdict Issues Detailed Report
⚠️ 2 Full report

Final Summary

  1. Lore link present: Not provided in agent output
  2. Lore link matches PR commits: Not provided in agent output
  3. Upstream patch status: Not provided in agent output
  4. PR present in qcom-next: Not provided in agent output
Verdict: ⚠️ — click to expand

🔍 Patch Validation

PR: #1365 (5-patch series: drm/msm fixes)
Upstream commit: N/A (series marked "Submitted" but not found on lore.kernel.org)
Verdict: ⚠️ PARTIAL


Patch 1/5: drm/msm: Recover HW before retire hung submit

Commit Message

Check Status Note
Subject matches upstream N/A Not found on lore; subject follows kernel style
Body preserves rationale Clear explanation of GPU pagefault issue
Fixes tag present/correct Fixes: 1a370be ("drm/msm: restart queued submits after hang")
Authorship preserved Jie Zhang + Akhil P Oommen sign-offs
Backport note (if applicable) ⚠️ Upstream-Status: Submitted [https://lore.kernel.org/linux-arm-msm/] — generic list URL, no message-ID

Diff

File Status Notes
drivers/gpu/drm/msm/msm_gpu.c Moves gpu->funcs->recover() before retire_submits() — logic correct

Issues

  • Upstream-Status tag: Uses generic mailing list URL instead of specific message-ID or patchwork link
  • Cannot verify upstream: Patch not found on lore.kernel.org/linux-arm-msm/ as of 2026-06-16

Patch 2/5: drm/msm/a6xx: Fix A663 GPUCC register list for state capture

Commit Message

Check Status Note
Subject matches upstream N/A Not found on lore; subject follows kernel style
Body preserves rationale Explains out-of-bounds register access issue
Fixes tag present/correct Fixes: 5773cce ("drm/msm/a6xx: Add support for A663")
Authorship preserved Jie Zhang + Akhil P Oommen sign-offs
Backport note (if applicable) ⚠️ Upstream-Status: Submitted [https://lore.kernel.org/linux-arm-msm/] — generic list URL

Diff

File Status Notes
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c Adds adreno_is_a663() to condition — logic correct

Issues

  • Upstream-Status tag: Uses generic mailing list URL instead of specific message-ID
  • Cannot verify upstream: Patch not found on lore.kernel.org

Patch 3/5: drm/msm/a6xx: Fix A621 GPUCC register list for state capture

Commit Message

Check Status Note
Subject matches upstream N/A Not found on lore; subject follows kernel style
Body preserves rationale Explains A621 using incorrect register list
Fixes tag present/correct Fixes: 11cdb81 ("drm/msm/a6xx: Fix gpucc register block for A621")
Authorship preserved Jie Zhang + Akhil P Oommen sign-offs
Backport note (if applicable) ⚠️ Upstream-Status: Submitted [https://lore.kernel.org/linux-arm-msm/] — generic list URL

Diff

File Status Notes
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c Splits A621 from A623/A663 register list — logic correct
drivers/gpu/drm/msm/adreno/a6xx_gpu_state.h Adds a621_gmu_gpucc_registers[] and a623_gmu_gpucc_registers[]

Issues

  • Upstream-Status tag: Uses generic mailing list URL instead of specific message-ID
  • Cannot verify upstream: Patch not found on lore.kernel.org

Patch 4/5: drm/msm/a6xx: Fix IRQ storm during msm_recovery test

Commit Message

Check Status Note
Subject matches upstream N/A Not found on lore; subject follows kernel style
Body preserves rationale Explains IRQ storm blocking hangcheck timer
Fixes tag present/correct Fixes: 5edf275 ("drm/msm: Add debugfs to disable hw err handling")
Authorship preserved Jie Zhang + Akhil P Oommen sign-offs
Backport note (if applicable) ⚠️ Upstream-Status: Submitted [https://lore.kernel.org/linux-arm-msm/] — generic list URL

Diff

File Status Notes
drivers/gpu/drm/msm/adreno/a5xx_gpu.c Masks interrupts when disable_err_irq set
drivers/gpu/drm/msm/adreno/a6xx_gpu.c Masks interrupts when disable_err_irq set
drivers/gpu/drm/msm/adreno/a8xx_gpu.c Masks interrupts when disable_err_irq set
drivers/gpu/drm/msm/msm_gpu.c Re-enables error IRQs in recover_worker()

Issues

  • Upstream-Status tag: Uses generic mailing list URL instead of specific message-ID
  • Cannot verify upstream: Patch not found on lore.kernel.org

Patch 5/5: drm/msm: Fix task_struct reference leak in recover_worker

Commit Message

Check Status Note
Subject matches upstream N/A Not found on lore; subject follows kernel style
Body preserves rationale Explains missing put_task_struct() causing leak
Fixes tag present/correct Fixes: 25654a1 ("drm/msm: Update global fault counter when faulty process has already ended")
Authorship preserved Jie Zhang + Akhil P Oommen sign-offs
Backport note (if applicable) ⚠️ Upstream-Status: Submitted [https://lore.kernel.org/linux-arm-msm/] — generic list URL

Diff

File Status Notes
drivers/gpu/drm/msm/msm_gpu.c Adds put_task_struct(task) in else branch — fixes reference leak

Issues

  • Upstream-Status tag: Uses generic mailing list URL instead of specific message-ID
  • Cannot verify upstream: Patch not found on lore.kernel.org

Series-Wide Issues

  1. Upstream-Status tags incomplete: All 5 patches use Upstream-Status: Submitted [https://lore.kernel.org/linux-arm-msm/] with only the mailing list URL. Best practice is to include the specific message-ID or patchwork link once posted.

  2. Patches not found on lore: Searched lore.kernel.org/linux-arm-msm/ and lore.kernel.org/all/ — no matches found for any of the 5 patch subjects or author. This suggests:

    • Patches may not have been posted yet (despite "Submitted" status)
    • Patches may be under review on a different list
    • Lore indexing may be delayed
  3. Commit message quality: All commit messages follow kernel style with clear problem statements and rationale.

  4. Fixes tags: All 5 patches include proper Fixes: tags with commit SHA and subject.

  5. Sign-off chain: All patches have proper sign-off chain (author + co-maintainer).

  6. Code changes: All diffs appear technically correct based on the stated problems:

    • Patch 1: Ordering fix to prevent GPU pagefaults
    • Patch 2: Adds missing A663 to register list
    • Patch 3: Splits A621 from A623/A663 register lists
    • Patch 4: Masks IRQs to prevent interrupt storm
    • Patch 5: Fixes task_struct reference leak

Verdict

⚠️ PARTIAL PASS — Patches are technically sound and follow kernel coding standards, but upstream verification is blocked because:

  1. Patches are not yet indexed on lore.kernel.org (despite "Submitted" status)
  2. Upstream-Status tags use generic mailing list URLs instead of specific message-IDs
  3. Cannot verify if patches match upstream versions (if they exist)

Recommendation:

  • If patches have been posted upstream, update Upstream-Status tags with specific message-IDs or patchwork links
  • If patches have NOT been posted yet, change Upstream-Status to "Pending" or "To be submitted"
  • Once posted, verify that PR commits match the upstream versions exactly

Final Summary

  1. Lore link present: ⚠️ Partial — Generic mailing list URL provided, but no specific message-IDs for the 5 patches
  2. Lore link matches PR commits: N/A — Cannot verify; patches not found on lore.kernel.org as of 2026-06-16
  3. Upstream patch status: ⚠️ Unknown — Marked "Submitted" but not indexed on lore; may be pending or under review
  4. PR present in qcom-next: Not checked — No kernel tree access per constraints

@qlijarvis

Copy link
Copy Markdown

PR #1365 — checker-log-analyzer

PR: #1365
Checker run: https://github.com/qualcomm-linux/kernel-config/actions/runs/27598598211

Checker Result Summary
Checker Result Summary
checkpatch ⚠️ 2 commits have alignment style issues (non-blocking CHECK warnings)
dt-binding-check No device tree binding issues
dtb-check No device tree blob issues
sparse-check No sparse warnings
check-uapi-headers No UAPI header issues
check-patch-compliance All 5 commits missing required subject prefix
tag-check N/A Not applicable for topic branches
qcom-next-check ⚠️ Patches have Upstream-Status: Submitted but missing FROMLIST: prefix

Detailed report: Full report

Checker analysis — click to expand

🤖 CI Checker Analysis (checker-log-analyzer)

PR: drm/msm GPU recovery and register fixes (#1365)
Source: https://github.com/qualcomm-linux/kernel-config/actions/runs/27598598211

Checker Result Summary
checkpatch ⚠️ 2 commits have alignment style issues (non-blocking CHECK warnings)
dt-binding-check No device tree binding issues
dtb-check No device tree blob issues
sparse-check No sparse warnings
check-uapi-headers No UAPI header issues
check-patch-compliance All 5 commits missing required subject prefix
tag-check N/A Not applicable for topic branches
qcom-next-check ⚠️ Patches have Upstream-Status: Submitted but missing FROMLIST: prefix

❌ check-patch-compliance

Root cause: All 5 commits are missing the required FROMLIST: subject prefix despite having Upstream-Status: Submitted tags.

Failure details:

Checking commit: drm/msm: Recover HW before retire hung submit
Commit summary does not start with a required prefix

Checking commit: drm/msm/a6xx: Fix A663 GPUCC register list for state capture
Commit summary does not start with a required prefix

Checking commit: drm/msm/a6xx: Fix A621 GPUCC register list for state capture
Commit summary does not start with a required prefix

Checking commit: drm/msm/a6xx: Fix IRQ storm during msm_recovery test
Commit summary does not start with a required prefix

Checking commit: drm/msm: Fix task_struct reference leak in recover_worker
Commit summary does not start with a required prefix

Fix: Add FROMLIST: prefix to all commit subjects since they have Upstream-Status: Submitted tags. The subjects should be:

  1. FROMLIST: drm/msm: Recover HW before retire hung submit
  2. FROMLIST: drm/msm/a6xx: Fix A663 GPUCC register list for state capture
  3. FROMLIST: drm/msm/a6xx: Fix A621 GPUCC register list for state capture
  4. FROMLIST: drm/msm/a6xx: Fix IRQ storm during msm_recovery test
  5. FROMLIST: drm/msm: Fix task_struct reference leak in recover_worker

Reproduce locally:

git log --oneline <base>..<head>
# Verify each commit subject starts with FROMLIST:, UPSTREAM:, or BACKPORT:

⚠️ checkpatch (non-blocking)

Root cause: Two commits have minor alignment style issues that checkpatch flags as CHECK warnings (not errors).

Failure details:

Commit 3/5 (5cbab28782d0):

CHECK: Alignment should match open parenthesis
#36: FILE: drivers/gpu/drm/msm/adreno/a6xx_gpu_state.c:1252:
+		_a6xx_get_gmu_registers(gpu, a6xx_state, &a623_gpucc_reg,
+			&a6xx_state->gmu_registers[2], false);

Commit 4/5 (b4631ef4a65d):

CHECK: Alignment should match open parenthesis
#36: FILE: drivers/gpu/drm/msm/adreno/a5xx_gpu.c:1286:
+		gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
+			       A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |

Fix: Align continuation lines with the opening parenthesis:

For commit 3/5:

_a6xx_get_gmu_registers(gpu, a6xx_state, &a623_gpucc_reg,
                        &a6xx_state->gmu_registers[2], false);

For commit 4/5:

gpu_write(gpu, REG_A5XX_RBBM_INT_0_MASK,
          A5XX_RBBM_INT_0_MASK_CP_CACHE_FLUSH_TS |

Reproduce locally:

./scripts/checkpatch.pl --strict --git <base>..<head>

Verdict

2 blockers must be fixed:

  1. BLOCKER: Add FROMLIST: prefix to all 5 commit subjects (check-patch-compliance failure)
  2. OPTIONAL: Fix alignment issues in commits 3/5 and 4/5 (checkpatch CHECK warnings are non-blocking but recommended to fix)

The PR cannot be merged until the subject prefix issue is resolved. The alignment issues are style preferences and can be addressed optionally.

@shivrawa

Copy link
Copy Markdown

@veereshbagale please refer to previous PRs

in the commit mesaage, please correct the FROMLIST, signoff, Link

f67b888

@shivrawa

Copy link
Copy Markdown

also kernel check complaining about the "CHECK: Alignment should match open parenthesis"

please cross check if the checker fails is valid related to the code
https://github.com/qualcomm-linux/kernel-config/actions/runs/27598598211/job/81594164692

Jie Zhang added 3 commits June 16, 2026 18:03
During recovery, it is not safe to retire the hung submit before we
recover the GPU. Retiring the submit triggers BO free and that can
result in GPU pagefaults since the GPU may be actively accessing those
BOs.

To fix this, retire the submits after gpu recovery is complete in
recover_worker().

Fixes: 1a370be ("drm/msm: restart queued submits after hang")
Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com>
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-2-2caa04f7287c@oss.qualcomm.com
The GPUCC register list for A663 is incorrect, which can cause
out-of-bounds register access during GPU state capture.

Update it to use the correct register ranges.

Fixes: 5773cce ("drm/msm/a6xx: Add support for A663")
Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com>
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-3-2caa04f7287c@oss.qualcomm.com
A621 uses an incorrect GPUCC register list during state capture.

The existing list matches A623/A663. Rename it accordingly and add a
dedicated A621 GPUCC register list.

Fixes: 11cdb81 ("drm/msm/a6xx: Fix gpucc register block for A621")
Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com>
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-4-2caa04f7287c@oss.qualcomm.com
@qcomlnxci qcomlnxci requested a review from a team June 16, 2026 12:38
Jie Zhang added 2 commits June 16, 2026 19:06
Once a hang is triggered by the msm_recovery test, the gpu error irq
remains asserted and triggers an interrupt storm. In the worst case,
this IRQ storm lands on the CPU core where the hangcheck timer is
scheduled, blocking it from running. This eventually leads to CPU
watchdog timeouts.

To fix this, mask the gpu error irqs during msm_recovery test and
enable them back during the recovery.

Fixes: 5edf275 ("drm/msm: Add debugfs to disable hw err handling")
Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com>
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-5-2caa04f7287c@oss.qualcomm.com
get_pid_task() increments the task reference count, but the
corresponding put_task_struct() was missing in the else branch,
leaking a reference on every GPU hang recovery.

Fixes: 25654a1 ("drm/msm: Update global fault counter when faulty process has already ended")
Signed-off-by: Veeresh Bagale <vbagale@qti.qualcomm.com>
Signed-off-by: Jie Zhang <jie.zhang@oss.qualcomm.com>
Signed-off-by: Akhil P Oommen <akhilpo@oss.qualcomm.com>
Link: https://lore.kernel.org/linux-arm-msm/20260605-assorted-fixes-june-v1-6-2caa04f7287c@oss.qualcomm.com
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants