Triangular Inverse Kernel (continuation of Zouzias' impl)#830
Triangular Inverse Kernel (continuation of Zouzias' impl)#830MirkoDeVita98 wants to merge 1 commit into
Conversation
There was a problem hiding this comment.
Code Review
This pull request introduces a new triangular matrix inverse example, including a recursive unrolled AICore kernel, orchestration logic, and a Python test suite. It also updates the benchmark_bgemm test configuration and adds debug logging to the scene test framework. Review feedback suggests removing leftover debug print statements, correcting documentation comments about the configuration tensor layout, and simplifying the kernel dispatch function by removing redundant template parameters.
fb54569 to
9bbf06e
Compare
|
Important Review skippedAuto incremental reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Organization UI Review profile: CHILL Plan: Pro Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
📝 WalkthroughWalkthroughThis PR introduces a complete triangular matrix inversion feature: an AICORE kernel implementing a recursive unrolled algorithm for inverting upper/lower-triangular fp16/bf16 matrices, paired with orchestration dispatch and a test suite for validation. ChangesTriangular Inverse Feature
Estimated Code Review Effort🎯 3 (Moderate) | ⏱️ ~25 minutes Poem
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
d9154fd to
244882d
Compare
There was a problem hiding this comment.
Actionable comments posted: 5
🧹 Nitpick comments (1)
examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py (1)
91-93: ⚡ Quick winFix the stale matrix-construction comment.
Line 91–Line 93 say the diagonal is set to
[0.5, 1.5], but this code path does not do that. Please update the comment to match the actual behavior (strictly triangular input, unit diagonal handled via+ Iin reference inversion).🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py` around lines 91 - 93, Update the stale comment describing matrix construction to reflect the actual behavior: note that the code builds strictly triangular fp16 matrices (zeros out the off-triangle and does not set diagonal values to [0.5,1.5]) and that the unit diagonal is introduced only when computing the reference inverse via "+ I" (i.e., the test uses strictly triangular input and adds the identity in the reference inversion). Mention the fp16/strictly-triangular inputs and the "+ I" reference inversion so future readers understand how invertibility is handled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cpp`:
- Around line 765-786: The switch on matrix_size in the block that calls
runKernelTriInvRecUnroll (cases 16/32/64/128) lacks a default and thus silently
does nothing for unsupported sizes; either add a guarded default branch that
logs an error/throws/returns a failure status (ensuring M_inv is not left stale)
or validate matrix_size earlier in aicpu_orchestration_entry and fail fast with
a clear error; locate the switch using the symbol matrix_size and the
runKernelTriInvRecUnroll instantiations and implement one of these checks so
callers receive a loud, deterministic error for unsupported sizes.
- Line 23: Update the header/comment for args[3] to match the actual config
layout read by the kernel (where config values are extracted into matrix_size,
num_matrices, is_lower, block_dim); change the doc from "int64[3]: [matrix_size,
num_matrices, is_lower]" to "int64[4]: [matrix_size, num_matrices, is_lower,
block_dim]" so it accurately documents the values consumed by the code that
reads config into those four int64 variables.
- Around line 789-830: The wrapper run_tri_inv_rec_unroll_per_num_matrices
currently has unused template parameters NumTilesPerCubeIter and IsBSND; remove
them from its template parameter list so it becomes template<typename InputT,
typename OutputT> AICORE void run_tri_inv_rec_unroll_per_num_matrices(...),
leaving the internal forwards to run_tri_inv_rec_unroll(...) unchanged (those
calls still specify NumTilesPerCubeIter and IsBSND as before). Update any call
sites that instantiate run_tri_inv_rec_unroll_per_num_matrices (e.g., places
that passed <half, half, 1, false>) to drop the unused template arguments so the
wrapper is instantiated only with InputT and OutputT. Ensure symbols to edit:
run_tri_inv_rec_unroll_per_num_matrices and its callers; do not change
run_tri_inv_rec_unroll signature.
- Around line 77-89: The loop over seq_idx is unbounded and can read past
cu_seqlens via cu_seqlens[seq_idx + 1]; change the loop to be bounded by the
number of sequences (or cu_seqlens length) and handle the “not found” case:
accept an additional parameter like seq_count (or cu_seqlens_len) and iterate
using for (uint32_t seq_idx = 0; seq_idx + 1 < seq_count; ++seq_idx) (or check
seq_idx + 1 < cu_seqlens_len before indexing), keep the existing logic that
computes seq_end/seq_len/seq_num_chunks and returns when chunk_idx falls into
the range, and if the loop finishes without finding the chunk return a safe
default or error (e.g., an invalid tile info or throw/assert) to avoid
out-of-bounds reads; update all call sites of GetBSNDVarlenTileInfoFromCuSeqlens
accordingly.
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/orchestration/triangular_inverse_orch.cpp`:
- Line 20: The argument-layout comment is using the wrong tensor index: the
config is read via orch_args.tensor(3) in this file (see uses of
orch_args.tensor(3) around the config handling), but the doc comment currently
labels it as tensor(4) and omits tensor(3); update the comment so the config
line reads "tensor(3) = config (INPUT) int64[4]: [matrix_size, num_matrices,
is_lower, block_dim]" (and ensure other tensor indices in that comment are
sequential and accurate to the orch_args.tensor(N) usages).
---
Nitpick comments:
In
`@examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py`:
- Around line 91-93: Update the stale comment describing matrix construction to
reflect the actual behavior: note that the code builds strictly triangular fp16
matrices (zeros out the off-triangle and does not set diagonal values to
[0.5,1.5]) and that the unit diagonal is introduced only when computing the
reference inverse via "+ I" (i.e., the test uses strictly triangular input and
adds the identity in the reference inversion). Mention the
fp16/strictly-triangular inputs and the "+ I" reference inversion so future
readers understand how invertibility is handled.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Organization UI
Review profile: CHILL
Plan: Pro
Run ID: 42bed645-9f72-456c-a3a0-2d1c29509b75
📒 Files selected for processing (3)
examples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/aic/kernel_tri_inv_rec_unroll.cppexamples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/kernels/orchestration/triangular_inverse_orch.cppexamples/a2a3/tensormap_and_ringbuffer/triangular_inverse_example/test_triangular_inverse.py
Fixes hw-native-sys#900 The AICore kernel loader (`simpler_setup/elf_parser.py`) silently dropped `.text._Z*` group sections (out-of-line template instantiations) and `.rela.text*` relocations when extracting a `.text` payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue hw-native-sys#831 / PR hw-native-sys#830) or silently-wrong partial output (issue hw-native-sys#900). Both symptoms are extremely hard to root-cause from the runtime error alone. This change is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the `.o` contains `.text._Z*` or any `.rela.text*` entries. The error names the offending sections and points at the `always_inline` kernel-side workaround. The literal-`.text` extraction path is otherwise unchanged — working kernels stay byte-identical (verified against the PA highperf `.o` from PR hw-native-sys#899 and the existing fully-inlined kernels). Loader-side relocation application is a separable follow-up; this PR just closes the silent-failure mode. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fixes #900 The AICore kernel loader (`simpler_setup/elf_parser.py`) silently dropped `.text._Z*` group sections (out-of-line template instantiations) and `.rela.text*` relocations when extracting a `.text` payload from a `.o`. The unresolved `BL`/`B` targets in `.text` then branched to garbage on device, manifesting as CANN 507018 watchdog timeouts (issue #831 / PR #830) or silently-wrong partial output (issue #900). Both symptoms are extremely hard to root-cause from the runtime error alone. This change is the minimum to keep the next person from repeating that diagnostic loop: a pre-flight scan that fails loud if the `.o` contains `.text._Z*` or any `.rela.text*` entries. The error names the offending sections and points at the `always_inline` kernel-side workaround. The literal-`.text` extraction path is otherwise unchanged — working kernels stay byte-identical (verified against the PA highperf `.o` from PR #899 and the existing fully-inlined kernels). Loader-side relocation application is a separable follow-up; this PR just closes the silent-failure mode. Co-authored-by: Chao Wang <26245345+ChaoWao@users.noreply.github.com>
a2a7b86 to
c3d4ade
Compare
|
@MirkoDeVita98 , can you please update the PR description with Added a few more tests. Simulation |
4e0a2db to
1145957
Compare
1145957 to
92971af
Compare
92971af to
495663d
Compare
…unmodified The qwen3_14b_decode SceneTestCase failed with output NaN (simpler#1102). Root cause: the fused-attention kernels read their AIV sub-block id via a [[block_local]] static (pypto_runtime_subblock_id) populated from the runtime dispatch context, which emits a benign *data* relocation against .text. elf_parser.extract_text_section rejected *any* .rela.text entry, so the verbatim codegen was rejected — and the hw-native-sys#900 hand-edit that worked around it (rewriting to the native get_subblockid()) changed the sub-block id semantics, mis-writing the current-token K/V and poisoning attention -> output NaN. - elf_parser: reject only genuine branch relocations (R_AARCH64_CALL26/JUMP26) — the BL/B-to-garbage case hw-native-sys#830/hw-native-sys#831 guards against — and let benign non-branch relocations and unreferenced .text._Z* COMDAT duplicates through, matching pypto's loader. - Re-harvest the example from current pypto codegen (35 incores: 8 AIC + 27 AIV, was 39) and drop the hw-native-sys#900 hand-edit; kernels are now used verbatim. - Verified deterministic PASS on a2a3 (out + both KV pools match the torch reference), matching pypto execute_compiled on the identical artifacts.
…unmodified The qwen3_14b_decode SceneTestCase failed with output NaN (simpler#1102). Root cause: the fused-attention kernels read their AIV sub-block id via a [[block_local]] static (pypto_runtime_subblock_id) populated from the runtime dispatch context, which emits a benign *data* relocation against .text. elf_parser.extract_text_section rejected *any* .rela.text entry, so the verbatim codegen was rejected — and the hw-native-sys#900 hand-edit that worked around it (rewriting to the native get_subblockid()) changed the sub-block id semantics, mis-writing the current-token K/V and poisoning attention -> output NaN. - elf_parser: reject only genuine branch relocations (R_AARCH64_CALL26/JUMP26) — the BL/B-to-garbage case hw-native-sys#830/hw-native-sys#831 guards against — and let benign non-branch relocations and unreferenced .text._Z* COMDAT duplicates through, matching pypto's loader. Update test_elf_parser.py to pin the new contract. - Re-harvest the example from current pypto codegen (35 incores: 8 AIC + 27 AIV, was 39) and drop the hw-native-sys#900 hand-edit; kernels are now used verbatim (clang-format applied). - Verified deterministic PASS on a2a3 (out + both KV pools match the torch reference), matching pypto execute_compiled on the identical artifacts.
…TestCase Self-contained SceneTestCase port of pypto-lib decode_fwd_layers (N=2): a fused 2-layer Qwen3-14B decode chunk (hidden -> hidden, no LM head), harvested verbatim from pypto codegen (36 sources: orchestration + 35 incores, 8 AIC + 27 AIV) with a 2-layer torch golden. Parameter regime matches stress_profile.py: BATCH=16, MAX_SEQ=5500, decode seq_len=3500. Replaces the earlier single-layer example. Loader fix (elf_parser) so the harvested kernels run unmodified. The fused-attention kernels read their AIV sub-block id via a [[block_local]] static (pypto_runtime_subblock_id) populated from the dispatch context, which emits a benign data relocation against .text. extract_text_section rejected ANY .rela.text entry, forcing a hw-native-sys#900 hand-edit (native get_subblockid()) that changed the sub-block id semantics and produced output NaN (simpler#1102). elf_parser now rejects only genuine branch relocations (R_AARCH64_CALL26/JUMP26 -- the BL/B-to-garbage case hw-native-sys#830/hw-native-sys#831) and lets benign non-branch relocations and unreferenced .text._Z* COMDAT duplicates through, matching pypto's loader. test_elf_parser.py pins the new contract. Verified deterministic PASS on a2a3 (out + both KV pools match the torch reference), matching pypto execute_compiled on the identical artifacts.
…TestCase Self-contained SceneTestCase port of pypto-lib decode_fwd_layers (N=2): a fused 2-layer Qwen3-14B decode chunk (hidden -> hidden, no LM head), harvested verbatim from pypto codegen (36 sources: orchestration + 35 incores, 8 AIC + 27 AIV) with a 2-layer torch golden. Parameter regime matches stress_profile.py: BATCH=16, MAX_SEQ=5500, decode seq_len=3500. Replaces the earlier single-layer example. The orchestration is migrated to main's TaskArgs ABI (hw-native-sys#1093/hw-native-sys#1104): entry takes const L2TaskArgs&, externals via orch_args.tensor(N).ref(), per-task args use L0TaskArgs. The incores are used unmodified. Loader fix (elf_parser) so the harvested kernels run unmodified. The fused-attention kernels read their AIV sub-block id via a [[block_local]] static (pypto_runtime_subblock_id) populated from the dispatch context, which emits a benign data relocation against .text. extract_text_section rejected ANY .rela.text entry, forcing a hw-native-sys#900 hand-edit (native get_subblockid()) that changed the sub-block id semantics and produced output NaN (simpler#1102). elf_parser now rejects only genuine branch relocations (R_AARCH64_CALL26/JUMP26 -- the BL/B-to-garbage case hw-native-sys#830/hw-native-sys#831) and lets benign non-branch relocations and unreferenced .text._Z* COMDAT duplicates through, matching pypto's loader. test_elf_parser.py pins the new contract. Verified deterministic PASS on a2a3 (out + both KV pools match the torch reference), matching pypto execute_compiled on the identical artifacts.
Self-contained SceneTestCase port of pypto-lib decode_fwd_layers (N=2): a fused 2-layer Qwen3-14B decode chunk (hidden -> hidden, no LM head), harvested verbatim from pypto codegen (36 sources: orchestration + 35 incores, 8 AIC + 27 AIV) with a 2-layer torch golden. Parameter regime matches stress_profile.py: BATCH=16, MAX_SEQ=5500, decode seq_len=3500. Replaces the earlier single-layer example. The orchestration is migrated to main's TaskArgs ABI (hw-native-sys#1093/hw-native-sys#1104): entry takes const L2TaskArgs&, externals via orch_args.tensor(N).ref(), per-task args use L0TaskArgs. The incores are used unmodified. The test is SKIPPED on device. fa_fused_aiv emits a [[block_local]] static (pypto_runtime_subblock_id) for the AIV sub-block id, whose benign non-branch .text relocation is rejected by simpler's strict .text-only loader. The loader is intentionally left unchanged -- the hw-native-sys#900/hw-native-sys#830/hw-native-sys#831 guard against unapplied branch relocations is NOT relaxed. Un-skipping requires removing the [[block_local]] dependency: pto-isa accepting an explicit sub_block_id (so the kernel passes get_sub_block_id(args)), or the runtime programming the FFTS sub-block register so native get_subblockid() returns 0/1 (today 0 for both AIVs, confirmed by tensor dump -- both lanes collide -> attention NaN). The math is correct on a loader that accepts the benign relocation. See README.md.
Self-contained SceneTestCase port of pypto-lib decode_fwd_layers (N=2): a fused 2-layer Qwen3-14B decode chunk (hidden -> hidden, no LM head), harvested verbatim from pypto codegen (36 sources: orchestration + 35 incores, 8 AIC + 27 AIV) with a 2-layer torch golden. Parameter regime matches stress_profile.py: BATCH=16, MAX_SEQ=5500, decode seq_len=3500. Replaces the earlier single-layer example. The orchestration is migrated to main's TaskArgs ABI (hw-native-sys#1093/hw-native-sys#1104): entry takes const L2TaskArgs&, externals via orch_args.tensor(N).ref(), per-task args use L0TaskArgs. The incores are used unmodified. The test is SKIPPED on device. fa_fused_aiv emits a [[block_local]] static (pypto_runtime_subblock_id) for the AIV sub-block id, whose benign non-branch .text relocation is rejected by simpler's strict .text-only loader. The loader is intentionally left unchanged -- the hw-native-sys#900/hw-native-sys#830/hw-native-sys#831 guard against unapplied branch relocations is NOT relaxed. Un-skipping requires removing the [[block_local]] dependency: pto-isa accepting an explicit sub_block_id (so the kernel passes get_sub_block_id(args)), or the runtime programming the FFTS sub-block register so native get_subblockid() returns 0/1 (today 0 for both AIVs, confirmed by tensor dump -- both lanes collide -> attention NaN). The math is correct on a loader that accepts the benign relocation. See README.md.
Self-contained SceneTestCase port of pypto-lib decode_fwd_layers (N=2): a fused 2-layer Qwen3-14B decode chunk (hidden -> hidden, no LM head), harvested verbatim from pypto codegen (36 sources: orchestration + 35 incores, 8 AIC + 27 AIV) with a 2-layer torch golden. Parameter regime matches stress_profile.py: BATCH=16, MAX_SEQ=5500, decode seq_len=3500. Replaces the earlier single-layer example. The orchestration is migrated to main's TaskArgs ABI (hw-native-sys#1093/hw-native-sys#1104): entry takes const L2TaskArgs&, externals via orch_args.tensor(N).ref(), per-task args use L0TaskArgs. The incores are used unmodified. The test is SKIPPED on device. fa_fused_aiv emits a [[block_local]] static (pypto_runtime_subblock_id) for the AIV sub-block id, whose benign non-branch .text relocation is rejected by simpler's strict .text-only loader. The loader is intentionally left unchanged -- the hw-native-sys#900/hw-native-sys#830/hw-native-sys#831 guard against unapplied branch relocations is NOT relaxed. Un-skipping requires removing the [[block_local]] dependency: pto-isa accepting an explicit sub_block_id (so the kernel passes get_sub_block_id(args)), or the runtime programming the FFTS sub-block register so native get_subblockid() returns 0/1 (today 0 for both AIVs, confirmed by tensor dump -- both lanes collide -> attention NaN). The math is correct on a loader that accepts the benign relocation. See README.md.
…1088) Self-contained SceneTestCase port of pypto-lib decode_fwd_layers (N=2): a fused 2-layer Qwen3-14B decode chunk (hidden -> hidden, no LM head), harvested verbatim from pypto codegen (36 sources: orchestration + 35 incores, 8 AIC + 27 AIV) with a 2-layer torch golden. Parameter regime matches stress_profile.py: BATCH=16, MAX_SEQ=5500, decode seq_len=3500. Replaces the earlier single-layer example. The orchestration is migrated to main's TaskArgs ABI (#1093/#1104): entry takes const L2TaskArgs&, externals via orch_args.tensor(N).ref(), per-task args use L0TaskArgs. The incores are used unmodified. The test is SKIPPED on device. fa_fused_aiv emits a [[block_local]] static (pypto_runtime_subblock_id) for the AIV sub-block id, whose benign non-branch .text relocation is rejected by simpler's strict .text-only loader. The loader is intentionally left unchanged -- the #900/#830/#831 guard against unapplied branch relocations is NOT relaxed. Un-skipping requires removing the [[block_local]] dependency: pto-isa accepting an explicit sub_block_id (so the kernel passes get_sub_block_id(args)), or the runtime programming the FFTS sub-block register so native get_subblockid() returns 0/1 (today 0 for both AIVs, confirmed by tensor dump -- both lanes collide -> attention NaN). The math is correct on a loader that accepts the benign relocation. See README.md.
Added a few more tests.
Simulation