[Klaud Cold] [AMD] Enable AITER MoE for MiniMax-M3 MI355X FP4 vLLM MTP benchmark#1958
[Klaud Cold] [AMD] Enable AITER MoE for MiniMax-M3 MI355X FP4 vLLM MTP benchmark#1958functionstackx wants to merge 3 commits into
Conversation
Split the FP4 MTP half out of #1955, rebased on current main. - Gate AITER MoE on non-EP configs only: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 and pass --moe-backend aiter when EP is off; set VLLM_ROCM_USE_AITER=0 when EP is enabled (DP attention or EP > 1), since AITER MoE is incompatible with expert parallelism. Addresses review #1955 (discussion_r3495386866). - Bump minimaxm3-fp4-mi355x-vllm-mtp to the AITER MoE nightly (nightly-4559c43a9526597c00cbcc4f59979496500268d1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
4 similar comments
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
| - "Enable AITER MoE on the MiniMax-M3 MI355X single-node vLLM EAGLE3 MTP MXFP4 benchmark for non-EP configs: export VLLM_ROCM_USE_AITER=1, VLLM_ROCM_USE_AITER_MOE=1, and VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, and pass --moe-backend aiter." | ||
| - "EP and DP-attention configs keep VLLM_ROCM_USE_AITER=0 since AITER MoE is incompatible with expert parallelism (vLLM #46419)." | ||
| - "Pin vllm/vllm-openai-rocm:nightly-4559c43a9526597c00cbcc4f59979496500268d1 (from nightly-3f5a1e1733200760169ff31ebe60a271072b199e)." | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1958 |
There was a problem hiding this comment.
🔴 The new changelog entry's pr-link is https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING instead of /pull/1958. After merge this leaves a 404 link in the changelog (every other entry in the file uses a real PR number, e.g. 1954/1952 immediately above). The GitHub PR-diff renderer may substitute the actual PR number on display, so this is best verified by reading the file on disk — tail -1 perf-changelog.yaml shows the literal string PENDING. Fix is a one-line replace of PENDING with 1958 before merge.
Extended reasoning...
What the bug is. The final line of perf-changelog.yaml (line 4333) is:
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDINGThe author wrote this entry using a placeholder (PENDING) intended to be swapped for the real PR number before merge, but the swap never happened. Note that GitHub's PR-diff renderer (and the harness's diff_context) often substitute the actual PR number for display in the rendered diff, which can hide this on review — direct disk/git inspection is authoritative.
Verification. Running tail -1 perf-changelog.yaml against the working tree shows pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING. The same string appears in git show HEAD:perf-changelog.yaml (i.e. the committed bytes for f59dca0 on this branch). Every adjacent entry uses a real PR number — the immediately preceding entry uses 1954 on line 4325, and the prior entries use 1952 (4318), 1942 (4312), and 1941 (4305). So the convention in the file is to write the real PR number, and this row visibly diverges.
Impact / proof of broken link. After merge, anyone following the changelog link
https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING will hit a GitHub 404 — there is no PR numbered "PENDING". Concretely, walking through it:
- A user (or downstream tool that ingests
perf-changelog.yaml) sees the new entry forminimaxm3-fp4-mi355x-vllm-mtp. - They click
pr-link: .../pull/PENDING. - GitHub parses
PENDINGas the PR number, fails to find a matching numeric ID, returns 404. - The user has no way to navigate to the originating PR [Klaud Cold] [AMD] Enable AITER MoE for MiniMax-M3 MI355X FP4 vLLM MTP benchmark #1958 from the changelog.
Why existing tooling doesn't catch it. perf-changelog.yaml is a plain YAML data file — there is no schema validator in this PR's CI that asserts pr-link matches a numeric PR id, so the lint/build is happy with the literal string PENDING. The reviewer eye can also slide past it because GitHub's PR-diff renderer substitutes the actual PR number into the displayed diff when reviewing this PR. (That substitution is presentational only; the committed bytes are unchanged.)
Fix. Replace PENDING with 1958 on line 4333:
- pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/PENDING
+ pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1958This is a one-line change matching the format used by every other entry in the file. Severity is nit/normal — broken metadata link in a tracked changelog, no runtime/benchmark impact, but it should be fixed before merge so the link resolves correctly.
| # AITER MoE accelerates the dense (non-EP) MoE path but is incompatible with | ||
| # expert parallelism, so disable it when EP is enabled (DP attention or EP > 1). | ||
| # https://github.com/SemiAnalysisAI/InferenceX/pull/1955#discussion_r3495386866 | ||
| MOE_ARGS=() | ||
| if [ "${DP_ATTENTION}" = "true" ] || [ "$EP_SIZE" -gt 1 ]; then | ||
| export VLLM_ROCM_USE_AITER=0 | ||
| else | ||
| export VLLM_ROCM_USE_AITER=1 | ||
| export VLLM_ROCM_USE_AITER_MOE=1 | ||
| export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 | ||
| MOE_ARGS=(--moe-backend aiter) | ||
| fi | ||
|
|
There was a problem hiding this comment.
🟣 Pre-existing follow-up (not blocking this PR): the STP sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh (introduced by #1954, not touched by this PR) unconditionally exports VLLM_ROCM_USE_AITER=1 / VLLM_ROCM_USE_AITER_MOE=1 / VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1 and passes --moe-backend aiter, with no DP-attention / EP_SIZE gate. Its search space in .github/configs/amd-master.yaml (minimaxm3-fp4-mi355x-vllm) sweeps several EP / DP-attn points (tp:8 ep:8, tp:4 ep:4, tp:2 ep:2, tp:8 ep:8 dp-attn:true), which will hit the very same AITER-MoE-incompatible-with-EP issue the new MTP gate (lines 42-54) was written to avoid — please backport the same if/else block to the STP recipe in a follow-up.
Extended reasoning...
What the bug is
This PR correctly adds an EP/DP-attention gate around AITER MoE in minimaxm3_fp4_mi355x_vllm_mtp.sh (new lines 42-54), so that VLLM_ROCM_USE_AITER is set to 0 (and --moe-backend aiter is dropped from the serve command) whenever DP_ATTENTION=true or EP_SIZE>1. The PR description and the file header explicitly justify this: "MoE serving mirrors minimaxm3_fp4_mi355x_vllm.sh ... except AITER MoE is gated off when expert parallelism is enabled", which is exactly what @hongxiayang's review on #1955 (discussion_r3495386866 — "will need to set VLLM_ROCM_USE_AITER=0 when enable ep") asked for.
However, the STP sibling benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm.sh — last touched by #1954, and not in this PR's diff — still unconditionally exports the three AITER vars (its lines 35-37) and always passes --moe-backend aiter (its line 65), with no DP/EP guard. Its existing PARALLEL_ARGS block (lines 44-52) still handles DP_ATTENTION=true and EP_SIZE>1, proving the recipe IS invoked under EP — the AITER vars are live during those runs.
The code path that triggers it
The STP search space lives in .github/configs/amd-master.yaml under minimaxm3-fp4-mi355x-vllm (lines visible in the preloaded modified-files dump):
- { tp: 8, ep: 8, conc-start: 1, conc-end: 512 }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 16, conc-end: 128 }
- { tp: 8, ep: 8, dp-attn: true, conc-start: 256, conc-end: 1024 }Every one of those rows runs minimaxm3_fp4_mi355x_vllm.sh with EP_SIZE>1 or DP_ATTENTION=true, which is exactly the configuration the new MTP gate was added to avoid.
Step-by-step proof
- The sweep dispatcher picks the row
{ tp: 8, ep: 8, conc-start: 1, conc-end: 512 }and invokesminimaxm3_fp4_mi355x_vllm.shwithTP=8andEP_SIZE=8. - Lines 35-37 of that script unconditionally export
VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_MOE=1, andVLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1. There is no gatingiflike the one this PR adds at lines 42-54 of the MTP recipe. - The script's
PARALLEL_ARGSblock (lines 44-52) reaches theelif [ "$EP_SIZE" -gt 1 ]branch and adds--enable-expert-parallelto the vLLM serve args. - Line 65 then passes
--moe-backend aitertovllm serve(also unconditionally). - vLLM starts with AITER MoE on and expert parallelism on — the exact incompatibility flagged by @hongxiayang and addressed by this PR for the MTP recipe ([ROCm]Enable AITER MoE backend for MiniMax-M3-MXFP4 vllm-project/vllm#46419). The run will crash, or silently produce incorrect MoE outputs, depending on the ROCm/vLLM stack.
The same path is hit by the tp:4 ep:4, tp:2 ep:2, and tp:8 ep:8 dp-attn:true rows.
Why existing code doesn't prevent it
There simply is no guard on the STP side — the AITER vars and --moe-backend aiter are set at the top of the script, before PARALLEL_ARGS is computed, and never reset based on DP_ATTENTION / EP_SIZE. The MTP recipe in this PR introduces such a guard; the STP recipe was last touched in #1954, which added the unconditional AITER exports without any EP-incompatibility consideration.
Impact
All four EP/DP-attn STP sweep points (tp:8 ep:8 across both ISL/OSL pairs, tp:4 ep:4, tp:2 ep:2, and tp:8 ep:8 dp-attn:true) will hit the AITER+EP incompatibility — the MoE serving will either crash on startup or silently misbehave, leaving the EP portions of the STP sweep with no usable data. The non-EP STP rows (tp:8, tp:4) are unaffected because AITER MoE is the intended path there.
How to fix it
Backport the same if/else block from the new MTP recipe (lines 42-54 of minimaxm3_fp4_mi355x_vllm_mtp.sh) into the STP recipe, replacing the unconditional exports at lines 35-37 and gating --moe-backend aiter behind an array (MOE_ARGS) interpolated into the vLLM serve command:
MOE_ARGS=()
if [ "${DP_ATTENTION}" = "true" ] || [ "$EP_SIZE" -gt 1 ]; then
export VLLM_ROCM_USE_AITER=0
else
export VLLM_ROCM_USE_AITER=1
export VLLM_ROCM_USE_AITER_MOE=1
export VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1
MOE_ARGS=(--moe-backend aiter)
fiWhy this is pre-existing severity (not blocking)
The broken pattern was introduced by #1954 and lives in a file this PR does not modify, call, or extend. This PR scopes itself to the MTP file (description: "Splits the FP4 MTP half out of #1955") and explicitly acknowledges the divergence from STP in the new file header comment. The fix belongs in a separate follow-up PR so this MTP gate can land cleanly on its own.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28422097175 |
The AITER-MoE nightly ships a torch build without torch.ao.quantization.pt2e, breaking Quark's MXFP4 dequant (mxfp4_utils._dequant_mxfp4) that EP/DP-attn configs fall back to when AITER fused MoE is disabled, crashing engine-core startup. - EP/DP-attn: keep VLLM_ROCM_USE_AITER=1 (AITER dequant, avoids Quark) and set VLLM_ROCM_USE_AITER_MOE=0, instead of disabling AITER entirely. - Drop EP/DP-attn search-space entries for 8k1k; keep them for 1k1k - Update perf-changelog accordingly.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=28443025904 |
Splits the FP4 MTP half out of #1955, rebased on current
main(the original PR was conflicting becausemainalready movedminimaxm3-fp4-mi355x-vllmSTP to the new nightly via #1954). The FP8 MTP half will be a separate PR for easier merging.Changes
benchmarks/single_node/fixed_seq_len/minimaxm3_fp4_mi355x_vllm_mtp.sh— enable AITER MoE only on non-EP configs:VLLM_ROCM_USE_AITER=1,VLLM_ROCM_USE_AITER_MOE=1,VLLM_ROCM_USE_AITER_FUSION_SHARED_EXPERTS=1, and--moe-backend aiter.DP_ATTENTION=trueorEP_SIZE > 1):VLLM_ROCM_USE_AITER=0and no--moe-backend aiter, since AITER MoE is incompatible with expert parallelism..github/configs/amd-master.yaml— bumpminimaxm3-fp4-mi355x-vllm-mtpimage tonightly-4559c43a9526597c00cbcc4f59979496500268d1.perf-changelog.yaml— changelog entry for the above.Review addressed
Implements @hongxiayang's review on #1955 (#1955 (comment)) — "will need to set VLLM_ROCM_USE_AITER=0 when enable ep" — via an
if/elseon the EP condition. This matters here because the FP4 MTP sweep includes severalep: 8,ep: 4, anddp-attn: truepoints.🤖 Generated with Claude Code