Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags by amd-fuyuajin · Pull Request #1957 · SemiAnalysisAI/InferenceX

amd-fuyuajin · 2026-06-29T23:41:11Z

Summary

Updates the Qwen3.5-397B-A17B-MXFP4 MI355X MTP launch recipe
(benchmarks/single_node/fixed_seq_len/qwen3.5_fp4_mi355x_mtp.sh) with a tuned,
env/flags-only configuration. The change is +4–11% throughput across the
concurrency sweep, with lower TPOT and p99 TTFT, and it unlocks concurrency 256
(where the current script OOMs) — at no accuracy cost.

Scope: environment variables and server flags only. Tuned AITER kernel CSVs and
SGLang source patches are intentionally out of scope here and will be upstreamed
to their respective repos (aiter / sglang).

What changed

qwen3.5_fp4_mi355x_mtp.sh (single file):

Speculative algorithm EAGLE → NEXTN (built-in MTP head on the ROCm-hardened
spec-v1 chain path); add SGLANG_ENABLE_SPEC_V2=0.
Add ROCM_QUICK_REDUCE_QUANTIZATION=INT4 and AITER_AR_1STAGE=1
(INT4 quick-reduce + single-stage AITER AllReduce — reduces TP=2 AllReduce latency).
Add --enable-dp-attention, --mamba-ssm-dtype bfloat16,
--disable-shared-experts-fusion.
--mem-fraction-static 0.8 → 0.85.
Drop SGLANG_USE_AITER_UNIFIED_ATTN, AITER_FLYDSL_FORCE,
--enable-aiter-allreduce-fusion, --page-size 16.
Guard hf download so local (absolute-path) model dirs are not re-downloaded.

A/B results

Qwen3.5-397B-A17B-MXFP4, MI355X, TP=2, ISL=OSL=1024. Arm A = current main script,
Arm B = this recipe. Stock image sglang-hyperloom-env:v1 (no tuned CSVs present —
isolates the env/flags layer). 3 interleaved passes per point, server torn down
between runs; values are the per-point median (run-to-run CV ≤ ~1%).

conc	A total tok/s	B total tok/s	Δtotal	A median TPOT (ms)	B median TPOT (ms)	A p99 TTFT (ms)	B p99 TTFT (ms)
4	1168.4	1218.1	+4.3%	6.5	6.5	289	256
8	1810.9	1936.9	+7.0%	8.5	8.0	435	400
16	2705.1	2994.8	+10.7%	11.3	10.5	565	498
32	3782.4	4208.1	+11.3%	16.4	14.9	1161	919
64	5223.0	5712.5	+9.4%	23.8	22.2	2143	1800
128	6565.2	7301.2	+11.2%	38.3	34.9	4138	3337
256	OOM	8925.5	n/a	—	57.4	—	6554

(total tok/s = input + output.)

Highlights

+4–11% throughput at every concurrency, largest in the conc 16–128 mid/high-batch regime.
Lower latency simultaneously — median TPOT and p99 TTFT both drop (e.g. conc 128:
TPOT 38.3 → 34.9 ms, p99 TTFT 4138 → 3337 ms ≈ −19%); not a throughput/latency trade.
Unlocks conc 256 — Arm A OOMs (EAGLE's separate draft model + mem-frac 0.8 leaves
no KV pool); Arm B fits and serves it at ~8.9k tok/s (NEXTN has no separate draft model,
mem-frac 0.85, shared-experts fusion disabled → more KV headroom on 2 GPUs).

Accuracy

gsm8k 5-shot, strict exact-match, served via the launch script in EVAL_ONLY mode:

Arm	strict exact-match
A (EAGLE, main)	78.62%
B (NEXTN, recipe)	78.70%

No regression (+0.08 pts, within noise).

Apply the validated MI355X TP=2 decode recipe to the single-node MTP benchmark (env/flags only; tuned aiter CSVs and sglang kernel patches are upstreamed separately): - speculative-algorithm NEXTN (built-in MTP head) + spec-v1 chain path - INT4 quick-reduce + single-stage AITER AllReduce - enable dp-attention, mamba-ssm-dtype bf16, disable shared-experts fusion - drop allreduce-fusion / unified-attn / flydsl-force / page-size 16 - guard hf download for local model paths

github-actions · 2026-06-29T23:41:19Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

感谢你的贡献！对于 vLLM 与 SGLang，请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致，请先创建一个 PR，之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准，使整个 ML 社区都能从你的辛勤工作中受益！谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动（flake），重新运行失败的任务即可解决。如果选择重新运行失败的任务，PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档：https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言，PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准，然后再请求核心维护者审阅。

如需更多帮助，PR 作者可通过 Slack 联系核心维护者。

claude

LGTM — env/flags-only tuning of a single benchmark recipe with thorough A/B results showing throughput and latency wins and no accuracy regression.

Extended reasoning...

Overview

This PR modifies a single benchmark launch script (benchmarks/single_node/fixed_seq_len/qwen3.5_fp4_mi355x_mtp.sh) for the Qwen3.5-397B-A17B-MXFP4 MI355X MTP recipe. Changes are limited to environment variables, SGLang CLI flags, and a guard around hf download for absolute model paths. No production code, library code, or shared infrastructure is touched.

Security risks

None. This is a benchmark launch shell script run by operators in controlled environments. The new hf download guard ([[ "$MODEL" != /* ]]) is a safe path-shape check that avoids re-downloading when an absolute local path is provided.

Level of scrutiny

Low. Benchmark recipe scripts are not production-critical code paths — they are tunable artifacts for performance characterization. The change is mechanical (flag and env updates), self-contained, and accompanied by detailed A/B data (3 interleaved passes per concurrency, run-to-run CV ≤ ~1%) demonstrating +4–11% throughput, lower TPOT/p99 TTFT, and unlocking concurrency 256 that previously OOMed. gsm8k accuracy is unchanged (78.62% → 78.70%).

Other factors

The bug hunting system found no issues. The PR scope is explicitly bounded to env/flags (tuned AITER CSVs and SGLang source patches are deferred to their respective upstreams). The auto-posted recipe-reminder is a generic checklist, not a substantive review comment. No CODEOWNER-owned code is touched.

amd-fuyuajin requested a review from a team June 29, 2026 23:41

claude Bot reviewed Jun 29, 2026

View reviewed changes

github-project-automation Bot added this to InferenceMAX Board Jun 29, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags#1957

Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags#1957
amd-fuyuajin wants to merge 1 commit into
mainfrom
perf/qwen3.5-fp4-mi355x-mtp-recipe

amd-fuyuajin commented Jun 29, 2026

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

claude Bot left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

amd-fuyuajin commented Jun 29, 2026

Summary

What changed

A/B results

Accuracy

Uh oh!

github-actions Bot commented Jun 29, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Overview

Security risks

Level of scrutiny

Other factors

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant