Skip to content

Specialize non-reducing kernel: drop reduction-only iszero hoist#69

Open
lkdvos wants to merge 1 commit into
mainfrom
ld-nonreduce-kernel
Open

Specialize non-reducing kernel: drop reduction-only iszero hoist#69
lkdvos wants to merge 1 commit into
mainfrom
ld-nonreduce-kernel

Conversation

@lkdvos

@lkdvos lkdvos commented Jun 24, 2026

Copy link
Copy Markdown
Member

What

The dimension-1 inner core in _mapreduce_kernel_expr was generated with an
iszero(stride_1_1) branch:

if iszero(stride_1_1)          # hoist A1[I1] out of the loop
    a = A1[I1]; @simd ...; A1[I1] = a
else
    @simd ... A1[I1] = op(A1[I1], f(...)) ...
end

This hoist only earns its keep for reductions, where the destination has
stride 0 along the inner loop dim (the same element is accumulated into
repeatedly). For the non-reducing path — op === nothing, i.e.
map!/permutedims!/copy!/fill!/conj!, all of which reach the kernel via
_mapreduce_fuse!(f, nothing, nothing, ...) — every destination element is
written exactly once, so stride_1_1 is always nonzero and the hoist branch is
dead. Yet it was still emitted, generating the inner @simd body twice and
leaving a runtime branch in the hot map!/permute loop.

This PR keys the inner-core construction on op: the non-reducing case emits a
single direct @simd loop (one body, no branch); the reducing case is unchanged.

It is correct even in the (non-occurring) zero-stride case for op === nothing:
the plain loop simply overwrites A1[I1] each iteration (last write wins),
which matches the hoisted form.

Why it's safe

This is a pure generation-time specialization of the staged Expr — it does
not introduce any function-call boundary inside the loop nest, so it avoids the
whole-nest-optimization pitfalls that make the bandwidth-bound permute path
fragile.

Benchmarks

Measured with benchmark/runtime_bench.jl and benchmark/compile_bench.jl,
baseline vs branch run back-to-back on the same machine, using the unchanged
reduce_* cases as a drift control (this workstation drifts a few % between
runs).

changed (permute/add) control (reductions, unchanged) read as
Runtime (single-thread) −0.9% mean −2.6% mean (drift) neutral — no permute regression
Compile −5.2% (add −8.4%, permute −2.7%) −3.1% (drift) ~2% real reduction for non-reducing

So: runtime-neutral, a small compile-time win for map!/permute specializations,
and one fewer dead branch + half the inner-body codegen on the non-reducing path.

Full test suite passes (including GPU/JLArray/CuArray paths).

🤖 Generated with Claude Code

The dim-1 inner core in `_mapreduce_kernel_expr` carried an
`iszero(stride_1_1)` branch that hoists `A1[I1]` out of the `@simd`
loop. That hoist only matters for reductions, where the destination
stride along the inner dim is zero. For the non-reducing path
(`op === nothing`: map!/permute/copy!/fill!/...) every destination
element is written exactly once, so the branch is dead and the loop
body was needlessly generated twice.

Emit the plain `@simd` loop (one body, no runtime branch) for
`op === nothing`, keeping the hoist for reductions. Runtime-neutral
(no permute regression) and trims a bit of non-reducing compile time.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@lkdvos

lkdvos commented Jun 24, 2026

Copy link
Copy Markdown
Member Author

TLDR here: there was a branch that is never hit for non-reductions, which we can eliminate to speed up compilation and which might actually slightly improve runtimes since we can slightly reduce the size of the generated code.

I did some benchmarks that show slight improvements, nothing major but every small bit helps!

@lkdvos lkdvos requested a review from Jutho June 24, 2026 01:09
@codecov

codecov Bot commented Jun 24, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 90.90909% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
src/mapreduce.jl 90.90% 2 Missing ⚠️
Files with missing lines Coverage Δ
src/mapreduce.jl 80.76% <90.90%> (+0.15%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant