Skip to content

Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags#1957

Open
amd-fuyuajin wants to merge 1 commit into
mainfrom
perf/qwen3.5-fp4-mi355x-mtp-recipe
Open

Update Qwen3.5 FP4 MI355X MTP recipe with tuned env/flags#1957
amd-fuyuajin wants to merge 1 commit into
mainfrom
perf/qwen3.5-fp4-mi355x-mtp-recipe

Conversation

@amd-fuyuajin

Copy link
Copy Markdown
Collaborator

Summary

Updates the Qwen3.5-397B-A17B-MXFP4 MI355X MTP launch recipe
(benchmarks/single_node/fixed_seq_len/qwen3.5_fp4_mi355x_mtp.sh) with a tuned,
env/flags-only configuration. The change is +4–11% throughput across the
concurrency sweep, with lower TPOT and p99 TTFT, and it unlocks concurrency 256
(where the current script OOMs) — at no accuracy cost.

Scope: environment variables and server flags only. Tuned AITER kernel CSVs and
SGLang source patches are intentionally out of scope here and will be upstreamed
to their respective repos (aiter / sglang).

What changed

qwen3.5_fp4_mi355x_mtp.sh (single file):

  • Speculative algorithm EAGLE → NEXTN (built-in MTP head on the ROCm-hardened
    spec-v1 chain path); add SGLANG_ENABLE_SPEC_V2=0.
  • Add ROCM_QUICK_REDUCE_QUANTIZATION=INT4 and AITER_AR_1STAGE=1
    (INT4 quick-reduce + single-stage AITER AllReduce — reduces TP=2 AllReduce latency).
  • Add --enable-dp-attention, --mamba-ssm-dtype bfloat16,
    --disable-shared-experts-fusion.
  • --mem-fraction-static 0.8 → 0.85.
  • Drop SGLANG_USE_AITER_UNIFIED_ATTN, AITER_FLYDSL_FORCE,
    --enable-aiter-allreduce-fusion, --page-size 16.
  • Guard hf download so local (absolute-path) model dirs are not re-downloaded.

A/B results

Qwen3.5-397B-A17B-MXFP4, MI355X, TP=2, ISL=OSL=1024. Arm A = current main script,
Arm B = this recipe. Stock image sglang-hyperloom-env:v1 (no tuned CSVs present —
isolates the env/flags layer). 3 interleaved passes per point, server torn down
between runs; values are the per-point median (run-to-run CV ≤ ~1%).

conc A total tok/s B total tok/s Δtotal A median TPOT (ms) B median TPOT (ms) A p99 TTFT (ms) B p99 TTFT (ms)
4 1168.4 1218.1 +4.3% 6.5 6.5 289 256
8 1810.9 1936.9 +7.0% 8.5 8.0 435 400
16 2705.1 2994.8 +10.7% 11.3 10.5 565 498
32 3782.4 4208.1 +11.3% 16.4 14.9 1161 919
64 5223.0 5712.5 +9.4% 23.8 22.2 2143 1800
128 6565.2 7301.2 +11.2% 38.3 34.9 4138 3337
256 OOM 8925.5 n/a 57.4 6554

(total tok/s = input + output.)

Highlights

  • +4–11% throughput at every concurrency, largest in the conc 16–128 mid/high-batch regime.
  • Lower latency simultaneously — median TPOT and p99 TTFT both drop (e.g. conc 128:
    TPOT 38.3 → 34.9 ms, p99 TTFT 4138 → 3337 ms ≈ −19%); not a throughput/latency trade.
  • Unlocks conc 256 — Arm A OOMs (EAGLE's separate draft model + mem-frac 0.8 leaves
    no KV pool); Arm B fits and serves it at ~8.9k tok/s (NEXTN has no separate draft model,
    mem-frac 0.85, shared-experts fusion disabled → more KV headroom on 2 GPUs).

Accuracy

gsm8k 5-shot, strict exact-match, served via the launch script in EVAL_ONLY mode:

Arm strict exact-match
A (EAGLE, main) 78.62%
B (NEXTN, recipe) 78.70%

No regression (+0.08 pts, within noise).

Apply the validated MI355X TP=2 decode recipe to the single-node MTP
benchmark (env/flags only; tuned aiter CSVs and sglang kernel patches are
upstreamed separately):
- speculative-algorithm NEXTN (built-in MTP head) + spec-v1 chain path
- INT4 quick-reduce + single-stage AITER AllReduce
- enable dp-attention, mamba-ssm-dtype bf16, disable shared-experts fusion
- drop allreduce-fusion / unified-attn / flydsl-force / page-size 16
- guard hf download for local model paths
@amd-fuyuajin amd-fuyuajin requested a review from a team June 29, 2026 23:41
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

@claude claude Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — env/flags-only tuning of a single benchmark recipe with thorough A/B results showing throughput and latency wins and no accuracy regression.

Extended reasoning...

Overview

This PR modifies a single benchmark launch script (benchmarks/single_node/fixed_seq_len/qwen3.5_fp4_mi355x_mtp.sh) for the Qwen3.5-397B-A17B-MXFP4 MI355X MTP recipe. Changes are limited to environment variables, SGLang CLI flags, and a guard around hf download for absolute model paths. No production code, library code, or shared infrastructure is touched.

Security risks

None. This is a benchmark launch shell script run by operators in controlled environments. The new hf download guard ([[ "$MODEL" != /* ]]) is a safe path-shape check that avoids re-downloading when an absolute local path is provided.

Level of scrutiny

Low. Benchmark recipe scripts are not production-critical code paths — they are tunable artifacts for performance characterization. The change is mechanical (flag and env updates), self-contained, and accompanied by detailed A/B data (3 interleaved passes per concurrency, run-to-run CV ≤ ~1%) demonstrating +4–11% throughput, lower TPOT/p99 TTFT, and unlocking concurrency 256 that previously OOMed. gsm8k accuracy is unchanged (78.62% → 78.70%).

Other factors

The bug hunting system found no issues. The PR scope is explicitly bounded to env/flags (tuned AITER CSVs and SGLang source patches are deferred to their respective upstreams). The auto-posted recipe-reminder is a generic checklist, not a substantive review comment. No CODEOWNER-owned code is touched.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant