perf(moe): fused decode-MoE kernel foundation (#268)#274
Merged
Conversation
Starts the fused decode-MoE kernel work (option A from the #268 investigation). The MoE decode gap is GPU-bound idle between small kernels on the expert path, so the real win needs a single Metal kernel for the single-token expert computation (gather + 4/6-bit dequant + gate/up/down + swiglu + weighted-sum). The cheap small fusions do not help: fusing the combine or router post-processing saves ~1 dispatch/layer (~0.18% by the dispatch-count math), and per-dispatch overhead is not the bottleneck (graph build is 0.8 ms/tok vs 20.3 ms GPU). This PR lays the groundwork the kernel needs: - docs/benchmark_results/fused-moe-decode-kernel-design.md: the kernel design (fast::metal_kernel based, same JIT path as ssm_update_kernel), the seq_len==1 dispatch guard, the per-step validation gate (RMS < 5e-3, greedy parity, decode bench, trace check), and a one-PR-per-step roadmap. - scripts/capture_moe_decode_trace.sh: capture one warm MoE decode token as a Metal trace (gputrace or xctrace) to localize the inter-kernel GPU idle before and after the kernel lands. No model behavior change. The fused kernel itself lands in the next PR, trace-directed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Starts the fused decode-MoE kernel effort (option A from the #268 investigation), as a new PR off latest main. Foundation only: design + trace harness, no model behavior change.
Why not the cheap fusions
The MoE decode gap is GPU-bound idle between small kernels on the expert path (Step 0/1 of the investigation: graph build 0.8 ms/tok vs 20.3 ms GPU, ~16-20% bandwidth). Fusing the combine (
moe_weighted_sum) or the router post-processing saves ~1 dispatch/layer ≈ 48/token ≈ 38 µs against a 21 ms token (~0.18%). Negligible. The win has to come from a single Metal kernel over the expert path (gather + 4/6-bit dequant + gate/up/down + swiglu + weighted-sum) so the GPU stops idling between thegather_qmmcalls.What's here
docs/benchmark_results/fused-moe-decode-kernel-design.md— the kernel design built on thefast::metal_kernelJIT path (same asssm_update_kernel): inputs/outputs/template args, theseq_len == 1dispatch guard withSwitchGLUfallback, the in-kernel affine dequant-GEMV as the hard part, risks (4/6-bit + mixed bits), the validation gate (RMS < 5e-3, greedy parity, decode bench, trace check), and a one-PR-per-step roadmap.scripts/capture_moe_decode_trace.sh— capture one warm MoE decode token as a Metal trace (gputraceorxctrace) to localize the expert-path GPU idle before/after.docs/benchmarks.md.Scope
Documentation + tooling. Issue #268 stays open; the fused kernel lands next, trace-directed per the roadmap. Builds on the merged investigation report.