Specialize non-reducing kernel: drop reduction-only iszero hoist#69
Open
lkdvos wants to merge 1 commit into
Open
Specialize non-reducing kernel: drop reduction-only iszero hoist#69lkdvos wants to merge 1 commit into
lkdvos wants to merge 1 commit into
Conversation
The dim-1 inner core in `_mapreduce_kernel_expr` carried an `iszero(stride_1_1)` branch that hoists `A1[I1]` out of the `@simd` loop. That hoist only matters for reductions, where the destination stride along the inner dim is zero. For the non-reducing path (`op === nothing`: map!/permute/copy!/fill!/...) every destination element is written exactly once, so the branch is dead and the loop body was needlessly generated twice. Emit the plain `@simd` loop (one body, no runtime branch) for `op === nothing`, keeping the hoist for reductions. Runtime-neutral (no permute regression) and trims a bit of non-reducing compile time. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Member
Author
|
TLDR here: there was a branch that is never hit for non-reductions, which we can eliminate to speed up compilation and which might actually slightly improve runtimes since we can slightly reduce the size of the generated code. I did some benchmarks that show slight improvements, nothing major but every small bit helps! |
Codecov Report❌ Patch coverage is
🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
The dimension-1 inner core in
_mapreduce_kernel_exprwas generated with aniszero(stride_1_1)branch:This hoist only earns its keep for reductions, where the destination has
stride 0 along the inner loop dim (the same element is accumulated into
repeatedly). For the non-reducing path —
op === nothing, i.e.map!/permutedims!/copy!/fill!/conj!, all of which reach the kernel via_mapreduce_fuse!(f, nothing, nothing, ...)— every destination element iswritten exactly once, so
stride_1_1is always nonzero and the hoist branch isdead. Yet it was still emitted, generating the inner
@simdbody twice andleaving a runtime branch in the hot
map!/permute loop.This PR keys the inner-core construction on
op: the non-reducing case emits asingle direct
@simdloop (one body, no branch); the reducing case is unchanged.It is correct even in the (non-occurring) zero-stride case for
op === nothing:the plain loop simply overwrites
A1[I1]each iteration (last write wins),which matches the hoisted form.
Why it's safe
This is a pure generation-time specialization of the staged
Expr— it doesnot introduce any function-call boundary inside the loop nest, so it avoids the
whole-nest-optimization pitfalls that make the bandwidth-bound permute path
fragile.
Benchmarks
Measured with
benchmark/runtime_bench.jlandbenchmark/compile_bench.jl,baseline vs branch run back-to-back on the same machine, using the unchanged
reduce_*cases as a drift control (this workstation drifts a few % betweenruns).
So: runtime-neutral, a small compile-time win for
map!/permute specializations,and one fewer dead branch + half the inner-body codegen on the non-reducing path.
Full test suite passes (including GPU/
JLArray/CuArraypaths).🤖 Generated with Claude Code