Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags#1957
Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags#1957amd-fuyuajin wants to merge 1 commit into
Conversation
Apply the validated MI355X TP=2 decode recipe to the single-node MTP benchmark (env/flags only; tuned aiter CSVs and sglang kernel patches are upstreamed separately): - speculative-algorithm NEXTN (built-in MTP head) + spec-v1 chain path - INT4 quick-reduce + single-stage AITER AllReduce - enable dp-attention, mamba-ssm-dtype bf16, disable shared-experts fusion - drop allreduce-fusion / unified-attn / flydsl-force / page-size 16 - guard hf download for local model paths
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. 感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致 如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢 PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow 一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。 如需更多帮助,PR 作者可通过 Slack 联系核心维护者。 |
There was a problem hiding this comment.
LGTM — env/flags-only tuning of a single benchmark recipe with thorough A/B results showing throughput and latency wins and no accuracy regression.
Extended reasoning...
Overview
This PR modifies a single benchmark launch script (benchmarks/single_node/fixed_seq_len/qwen3.5_fp4_mi355x_mtp.sh) for the Qwen3.5-397B-A17B-MXFP4 MI355X MTP recipe. Changes are limited to environment variables, SGLang CLI flags, and a guard around hf download for absolute model paths. No production code, library code, or shared infrastructure is touched.
Security risks
None. This is a benchmark launch shell script run by operators in controlled environments. The new hf download guard ([[ "$MODEL" != /* ]]) is a safe path-shape check that avoids re-downloading when an absolute local path is provided.
Level of scrutiny
Low. Benchmark recipe scripts are not production-critical code paths — they are tunable artifacts for performance characterization. The change is mechanical (flag and env updates), self-contained, and accompanied by detailed A/B data (3 interleaved passes per concurrency, run-to-run CV ≤ ~1%) demonstrating +4–11% throughput, lower TPOT/p99 TTFT, and unlocking concurrency 256 that previously OOMed. gsm8k accuracy is unchanged (78.62% → 78.70%).
Other factors
The bug hunting system found no issues. The PR scope is explicitly bounded to env/flags (tuned AITER CSVs and SGLang source patches are deferred to their respective upstreams). The auto-posted recipe-reminder is a generic checklist, not a substantive review comment. No CODEOWNER-owned code is touched.
Summary
Updates the Qwen3.5-397B-A17B-MXFP4 MI355X MTP launch recipe
(
benchmarks/single_node/fixed_seq_len/qwen3.5_fp4_mi355x_mtp.sh) with a tuned,env/flags-only configuration. The change is +4–11% throughput across the
concurrency sweep, with lower TPOT and p99 TTFT, and it unlocks concurrency 256
(where the current script OOMs) — at no accuracy cost.
Scope: environment variables and server flags only. Tuned AITER kernel CSVs and
SGLang source patches are intentionally out of scope here and will be upstreamed
to their respective repos (aiter / sglang).
What changed
qwen3.5_fp4_mi355x_mtp.sh(single file):spec-v1 chain path); add
SGLANG_ENABLE_SPEC_V2=0.ROCM_QUICK_REDUCE_QUANTIZATION=INT4andAITER_AR_1STAGE=1(INT4 quick-reduce + single-stage AITER AllReduce — reduces TP=2 AllReduce latency).
--enable-dp-attention,--mamba-ssm-dtype bfloat16,--disable-shared-experts-fusion.--mem-fraction-static0.8 → 0.85.SGLANG_USE_AITER_UNIFIED_ATTN,AITER_FLYDSL_FORCE,--enable-aiter-allreduce-fusion,--page-size 16.hf downloadso local (absolute-path) model dirs are not re-downloaded.A/B results
Qwen3.5-397B-A17B-MXFP4, MI355X, TP=2, ISL=OSL=1024. Arm A = current
mainscript,Arm B = this recipe. Stock image
sglang-hyperloom-env:v1(no tuned CSVs present —isolates the env/flags layer). 3 interleaved passes per point, server torn down
between runs; values are the per-point median (run-to-run CV ≤ ~1%).
(total tok/s = input + output.)
Highlights
TPOT 38.3 → 34.9 ms, p99 TTFT 4138 → 3337 ms ≈ −19%); not a throughput/latency trade.
no KV pool); Arm B fits and serves it at ~8.9k tok/s (NEXTN has no separate draft model,
mem-frac 0.85, shared-experts fusion disabled → more KV headroom on 2 GPUs).
Accuracy
gsm8k 5-shot, strict exact-match, served via the launch script in
EVAL_ONLYmode:No regression (+0.08 pts, within noise).