[AMD] Enable AITER MoE for MiniMax-M3 MI355X vLLM MTP benchmarks#1955
[AMD] Enable AITER MoE for MiniMax-M3 MI355X vLLM MTP benchmarks#1955Fangzhou-Ai wants to merge 4 commits into
Conversation
Mirror the STP AITER MoE and shared-expert fusion knobs on the MXFP4/MXFP8 EAGLE3 launchers, including INT6 quick-reduce on FP8 MTP. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28403385639 |
There was a problem hiding this comment.
LGTM — straightforward AITER MoE enablement for the MTP benchmark scripts, mirroring the STP knobs.
Extended reasoning...
Overview
This PR enables AITER MoE on the MiniMax-M3 MI355X EAGLE3 MTP benchmark launchers (MXFP4 and MXFP8 variants) by exporting three VLLM_ROCM_USE_AITER* env vars and passing --moe-backend aiter to vllm serve. The MXFP8 path additionally sets VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT6. A corresponding perf-changelog.yaml entry is appended. Total surface: two bash benchmark recipes and one YAML changelog entry.
Security risks
None. This is benchmark configuration touching env vars and a CLI flag for a vLLM serve invocation — no auth, crypto, permissions, or data handling logic.
Level of scrutiny
Low. These are sandbox benchmark scripts (not production-critical code paths), the changes are mechanical env-var exports plus a single new flag, and the PR description confirms manual validation showed EAGLE3 MTP works with the same AITER MoE stack as STP. Both files pass bash -n per the test plan.
Other factors
The single inline finding is a nit about a forward-reference comment pointing at the STP recipe in minimaxm3_fp4_mi355x_vllm.sh which, on current main, does not yet contain the AITER knobs (those land in #1954). It's a doc-only concern with no runtime impact and the PR description openly acknowledges the #1954 dependency. The change otherwise follows the established pattern visible in the existing perf-changelog entries and recipe scripts.
| # minimaxm3_fp4_mi355x_vllm.sh and uses three speculative tokens from | ||
| # Inferact/MiniMax-M3-EAGLE3. The pinned nightly includes upstream AMD | ||
| # MiniMax-M3 SupportsEagle3 support, so no runtime model patch is needed. | ||
| # MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419). |
There was a problem hiding this comment.
🟡 Nit: line 8 says "MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419)." but on current main that STP recipe sets no AITER env vars and no --moe-backend — its own header comment says it "lets vLLM select the MoE backend." Those STP knobs come from #1954, which hasn't landed. If #1955 merges first, this cross-reference is wrong. Consider pointing at #1954 directly, or just describing the AITER setup (with the vllm#46419 credit) without claiming the STP file already does the same.
Extended reasoning...
What the comment claims vs. what is true on main
The new line 8 in benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh reads:
# MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh (AITER MoE, vllm#46419).But benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh on current main does not yet set VLLM_ROCM_USE_AITER, VLLM_ROCM_USE_AITER_MOE, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS, or pass --moe-backend aiter. Its line 6 explicitly states the opposite: "…lets vLLM select the MoE backend." A git grep --moe-backend confirms the only occurrences of --moe-backend aiter in the tree are the two MTP files introduced by this PR.
Step-by-step proof
cat benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.shon the current branch's view of main shows lines 5–6 saying the recipe "uses the text-only language-model path and lets vLLM select the MoE backend."grep -E 'VLLM_ROCM_USE_AITER|moe-backend' benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.shreturns nothing.git log --all --oneline | grep -i 1954returns nothing — [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 has not landed.- The PR description's own test plan says: "Apply full-sweep-enabled (or full-sweep-fail-fast) after [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 lands." That acknowledges the ordering dependency exists.
- Therefore, at the moment this PR's diff is merged (absent [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954), reader X opening
minimaxm3_fp4_mi355x_vllm_mtp.shand following the line-8 pointer tominimaxm3_fp4_mi355x_vllm.shwill find a file whose header comment says vLLM chooses the backend — directly contradicting the "mirrors … AITER MoE" claim.
Addressing the refutation
The refutation argues this is (a) a known dependency, (b) flagged in the test plan, (c) a nit, and (d) forward-looking design language is normal for paired PRs. Points (a)–(c) are accurate and are exactly why this is filed at nit severity, not blocking. The remaining concern is narrow: the comment's surface reading is a present-tense factual claim ("mirrors"), and merge order in this repo is not actually pinned — #1955 can land before, after, or instead of #1954. If #1954 is rebased, re-scoped, or abandoned, this comment ships indefinitely as a dangling cross-reference. The refutation's framing of "a comment that's true after the next-PR-in-the-stack merges" assumes a merge order that the PR description hopes for but does not enforce.
Impact
Documentation-only; no runtime effect. The AITER env vars and --moe-backend aiter flag in this file are self-contained and correct on their own. The risk is purely reader confusion if/when they trace the cross-reference and find the STP recipe in a state that contradicts the comment.
How to fix
One of three trivial options:
- Reword to credit only the upstream design reference:
# MoE serving uses AITER MoE (vllm#46419). - Forward-reference [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 explicitly:
# MoE serving mirrors the STP AITER MoE setup from #1954. - Land [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 first, then this comment becomes accurate as written.
This is in-scope to flag because the comment is newly added by this PR, the fix is one-line, and option 1 makes the comment robust to any merge order without coupling the two PRs.
Use the emulation linear backend for MXFP8 EAGLE3 serving on MI355X. Co-authored-by: Cursor <cursoragent@cursor.com>
Pin nightly-4559c43a for AITER MoE, shared-expert fusion, and FP8 linear-backend emulation support on all four single-node configs. Co-authored-by: Cursor <cursoragent@cursor.com>
| # Run with CUDA graphs (no --enforce-eager): VLLM_USE_BREAKABLE_CUDAGRAPH=0 | ||
| # avoids the M3-decode breakable-cudagraph path that previously forced eager. | ||
| export VLLM_USE_BREAKABLE_CUDAGRAPH=0 | ||
| export VLLM_ROCM_USE_AITER=1 |
There was a problem hiding this comment.
will need to set VLLM_ROCM_USE_AITER=0 when enable ep
|
same reminder here #1954 (comment) |
|
Splitting this into two smaller PRs to make them easier to review and merge independently — one for FP4 MTP and one for FP8 MTP:
(The original was conflicting because |
Summary
minimaxm3_fp4_mi355x_vllm_mtp.sh,minimaxm3_fp8_mi355x_mtp.sh), mirroring the STP knobs from [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954.VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_MOE=1, andVLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1; pass--moe-backend aiter.VLLM_ROCM_QUICK_REDUCE_QUANTIZATION=INT6on the MXFP8 MTP path.perf-changelog.yamltriggers forminimaxm3-fp4-mi355x-vllm-mtpandminimaxm3-fp8-mi355x-vllm-mtp.Why
Manual validation showed EAGLE3 MTP serving works with the same AITER MoE stack as STP. This lands the benchmark coverage Andy asked to defer from #1954.
Test plan
bash -non both MTP benchmark scriptspython utils/matrix_logic/generate_sweep_configs.py test-config --config-keys minimaxm3-fp4-mi355x-vllm-mtp minimaxm3-fp8-mi355x-vllm-mtp --config-files .github/configs/amd-master.yaml --no-evalsfull-sweep-enabled(orfull-sweep-fail-fast) after [AMD] Enable AITER MoE for MiniMax-M3 FP4 MI355X vLLM STP #1954 lands, or dispatch e2e against this branchMade with Cursor