Skip to content

Add MTP evaluation#1953

Open
hjjq wants to merge 31 commits into
mainfrom
codex/m1-speedbench-evals
Open

Add MTP evaluation#1953
hjjq wants to merge 31 commits into
mainfrom
codex/m1-speedbench-evals

Conversation

@hjjq

@hjjq hjjq commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

M1 of #1651
Currently tested against DSV4 vLLM & SGL. Still needs some iterations before merging.

qiching and others added 30 commits June 2, 2026 14:23
Push-button (workflow_dispatch) collection of the DeepSeek-V4-Pro
SPEED-Bench acceptance-length matrix (thinking on/off x MTP 1-8) on
self-hosted B300 runners, optionally opening a PR that updates
benchmarks/speedbench-reference-al.yaml.

- benchmarks/single_node/dsv4_fp4_b300_vllm_speedbench_matrix.sh:
  per (thinking, MTP) cell, serve vLLM, run SPEED-Bench, derive AL from
  /metrics, and emit the YAML matrix. Serves from MODEL_PATH (the local
  pre-staged weights resolved by the launcher), falling back to MODEL for
  a standalone local run. Carries a temporary --chat-template-kwargs shim
  until vllm-project/vllm#44244 lands in the benchmark image (idempotent,
  applied only for thinking-on cells).
- runners/launch_b300-nv.sh: add opt-in BENCH_SCRIPT_OVERRIDE and
  SALLOC_TIME_LIMIT hooks; both default to the prior behavior.
- .github/workflows/speedbench-al.yml: workflow_dispatch entry point;
  MODEL is the HF id so the launcher resolves the staged MODEL_PATH.
Make the workflow default to Option 1 (upload the AL matrix as an
artifact for manual review/paste) rather than auto-opening a PR. The
auto-PR path stays available as an opt-in (open-pr: true), but keeping
it off by default avoids exposing a write-scoped PAT on the self-hosted
runner and matches the repo's artifact-collection convention.
Address review:
- Model is now a workflow input (model + model-prefix, default
  deepseek-ai/DeepSeek-V4-Pro / dsv4). MODEL, MODEL_PREFIX, EXP_NAME,
  BENCH_SCRIPT_OVERRIDE, artifact names and the Create-PR branch/title/body
  are all derived from those inputs. The emitted YAML top-level key is now
  derived from the model (MODEL_KEY, defaults to the model basename lowercased).
- Move the collector to benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh
  and fix its benchmark_lib.sh source path (../ -> ../../) for the deeper dir.
…dbench-evals

# Conflicts:
#	.github/workflows/speedbench-al.yml
#	benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_trt_mtp.sh
#	benchmarks/single_node/fixed_seq_len/dsv4_fp4_b300_trt_mtp.sh
#	benchmarks/single_node/speedbench/dsv4_fp4_b300_vllm.sh
#	runners/launch_b300-nv.sh
#	runners/launch_gb200-nv.sh
#	runners/launch_gb300-cw.sh
#	utils/collect_eval_results.py
#	utils/evals/EVALS.md
@hjjq hjjq requested a review from a team June 29, 2026 19:54
@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.


感谢你的贡献!对于 vLLM 与 SGLang,请确保你的 recipe 与官方 vLLM recipes 和/或 SGLang cookbook 保持一致

如果不一致,请先创建一个 PR,之后我们才能将你的单节点 PR 合并到 master 分支。让我们确保文档保持一流水准,使整个 ML 社区都能从你的辛勤工作中受益!谢谢

PR 作者有责任确保合并后所有 GitHub Action 任务完全通过。 很多时候失败只是偶发抖动(flake),重新运行失败的任务即可解决。如果选择重新运行失败的任务,PR 作者有责任确保其最终通过。参见 GitHub 关于重新运行失败任务的文档:https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

一般而言,PR 作者应先向相应公司的 CODEOWNERS 请求审阅并获得 PR 批准,然后再请求核心维护者审阅。

如需更多帮助,PR 作者可通过 Slack 联系核心维护者。

Comment on lines 347 to 360
conc_label = f"[conc={match.group(1)}] " if match else ""
with open(f) as fh:
data = json.load(fh)

speedbench_ok, speedbench_checked = validate_speedbench_al(data, f)
if speedbench_checked:
checked += speedbench_checked
if not speedbench_ok:
failed = True
continue

for task, metrics in data.get("results", {}).items():
min_score, source = resolve_threshold(config, prefix, task, args.min_score)
for name, val in metrics.items():

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 validate_scores.py runs validate_batch_manifest on all globbed results*.json before the per-file SpeedBench skip in the main loop, so SpeedBench AL result files (results_speedbench_al_{mode}_mtp{mtp}.json) trip the manifest check fatally. With --expected-concs (multi-node eval-only), the single-conc path hard-fails with 'eval must produce exactly one result file' because the lm-eval + SpeedBench JSONs make len==2; in batched mode (eval_concs in meta_env.json), every SpeedBench file lacks a _concN suffix and triggers 'batched eval result lacks a concurrency suffix'. Partition SpeedBench files out of result_files before calling validate_batch_manifest (or have validate_batch_manifest skip files containing 'speedbench_al_eval_version').

Extended reasoning...

What the bug is

The PR adds SpeedBench AL eval result files alongside lm-eval results. run_speedbench_al_eval (in benchmark_lib.sh) writes results_speedbench_al_${mode}_mtp${mtp}.json into EVAL_RESULT_DIR (or PWD in the batched branch at benchmark_lib.sh:1713-1714), and write_dynamo_speedbench_al_from_logs.sh writes the same file shape into $GITHUB_WORKSPACE for multi-node Dynamo MTP runs.

validate_scores.py uses --results-glob 'results*.json' by default, which matches both the lm-eval JSONs and the new SpeedBench files. The PR teaches the per-file loop to skip SpeedBench files via the new validate_speedbench_al helper (utils/evals/validate_scores.py:347-360), but validate_batch_manifest is invoked at line 325 — before that loop — and it has no awareness of SpeedBench files.

Code path that triggers the failure

Two fatal paths in validate_batch_manifest:

  1. Non-batched multi-node MTP (--expected-concs passed, eval_concs absent from meta_env.json — exactly the configuration emitted by benchmark-multinode-tmpl.yml:321 and write_dynamo_speedbench_al_from_logs.sh, which does not write meta_env.json):

    • Line 117-118: if len(result_files) != 1: errors.append('eval must produce exactly one result file').
    • result_files contains the lm-eval JSON (copied from $LOGS_DIR/eval_results) and the SpeedBench JSON (written by write_dynamo_speedbench_al_from_logs.sh), so len == 2 and the run hard-fails.
  2. Batched single-node lm-eval + MTP (eval_concs in meta_env.json):

    • Lines 168-176 iterate every result file and require each to match CONC_SUFFIX_RE = r'_conc(\\d+)(?:_\\d+)?\.json$'. SpeedBench files are named results_speedbench_al_${mode}_mtp${mtp}.jsonrun_speedbench_al_eval is called once per engine, not per concurrency — so they never match. Each SpeedBench file triggers 'batched eval result lacks a concurrency suffix'.

Why existing code doesn't prevent it

The fix for SpeedBench files lives in the per-file loop (validate_speedbench_al at line 350, with continue at line 356), but validate_batch_manifest runs at line 325 — before the loop. By the time the loop's continue fires, failed = True has already been set by the manifest check, and main() returns 1.

Impact

Every multi-node MTP eval-only run after this PR merges will fail the 'Verify eval scores' step (benchmark-multinode-tmpl.yml:321). Every batched single-node lm-eval + MTP run will also fail. This is precisely the headline feature this PR adds (M1 of #1651, 'tested against DSV4 vLLM & SGL'), so the breakage is on the PR's own surface, not on unrelated configs.

Step-by-step proof (Scenario A — batched single-node MTP)

  1. User runs a DSV4 vLLM MTP single-node eval with multiple EVAL_CONCURRENT_REQUESTS (e.g. '1 32').
  2. run_eval (benchmark_lib.sh:1703-1714) takes the batched branch, exports EVAL_RESULT_DIR=$(pwd), and calls run_speedbench_al_eval once — writing results_speedbench_al_on_mtp2.json to PWD.
  3. The same branch then runs lm-eval per conc, producing results_*_conc1.json and results_*_conc32.json. append_lm_eval_summary writes meta_env.json containing eval_concs: [1, 32].
  4. benchmark-tmpl.yml:306 invokes python3 utils/evals/validate_scores.py in PWD (without --expected-concs).
  5. main() globs results*.json → 3 files (2 lm-eval + 1 SpeedBench).
  6. validate_batch_manifest is called at line 325. Meta has eval_concs, so we fall through to lines 168-176.
  7. For each result_file, CONC_SUFFIX_RE.search is called. results_*_conc1.json matches, results_*_conc32.json matches, but results_speedbench_al_on_mtp2.json does not (no _concN suffix).
  8. errors.append('batched eval result lacks a concurrency suffix: results_speedbench_al_on_mtp2.json').
  9. main() prints FAIL: batched eval result lacks a concurrency suffix: … and returns 1, even though the lm-eval and SpeedBench results are individually valid.

How to fix

Filter SpeedBench files out of result_files before invoking validate_batch_manifest — e.g. partition by reading each candidate file once and excluding any whose contents contain the speedbench_al_eval_version key, or by filename pattern results_speedbench_al_*.json. Alternatively, teach validate_batch_manifest itself to ignore SpeedBench files.

Comment on lines +1189 to +1199
_speedbench_prepare_dataset() {
local speedbench_dir="$1"
if [[ -f "$speedbench_dir/qualitative.jsonl" ]]; then
return 0
fi
mkdir -p "$speedbench_dir"
python3 -m pip install -q datasets tiktoken
curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \
| python3 - --config qualitative --output_dir "$speedbench_dir"
[[ -f "$speedbench_dir/qualitative.jsonl" ]]
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟣 SpeedBench dataset prep in _speedbench_prepare_dataset carries two pre-existing repo hygiene issues that this PR now centralizes into a shared helper: (1) prepare.py is fetched from NVIDIA-NeMo/Skills@refs/heads/main and piped directly into python3 — unpinned and non-auditable; (2) the default speedbench_dir=$(pwd)/speed_bench_data resolves to /workspace/speed_bench_data in single-node MTP scripts (PWD is /workspace), creating a new directory under /workspace which AGENTS.md explicitly forbids during a benchmark. Both patterns already exist in 7+ benchmarks/single_node/speedbench/*_b300_vllm.sh scripts, but since this PR moves them into a shared helper that more recipes will use, it's a good moment to (a) pin the curl URL to a commit SHA and download-then-execute, and (b) default to /tmp/speed_bench_data (or any non-/workspace path).

Extended reasoning...

Where this lives

benchmarks/benchmark_lib.sh:1189-1199 — the new _speedbench_prepare_dataset helper introduced by this PR:

_speedbench_prepare_dataset() {
    local speedbench_dir="$1"
    if [[ -f "$speedbench_dir/qualitative.jsonl" ]]; then
        return 0
    fi
    mkdir -p "$speedbench_dir"
    python3 -m pip install -q datasets tiktoken
    curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py \
      | python3 - --config qualitative --output_dir "$speedbench_dir"
    [[ -f "$speedbench_dir/qualitative.jsonl" ]]
}

It is invoked from run_speedbench_al_eval (benchmark_lib.sh:1255) which defaults speedbench_dir="${SPEEDBENCH_DIR:-$(pwd)/speed_bench_data}".

Issue 1 — unpinned curl-pipe-python

refs/heads/main resolves to whatever NVIDIA-NeMo/Skills HEAD is at curl time. There is no commit SHA, no tag, and no checksum. Consequences:

  1. Supply chain: any force-push, account compromise, or accidental breaking change to that upstream file silently executes arbitrary Python inside the benchmark CI container, which has cluster credentials, model weights, and CI tokens accessible to it.
  2. Reproducibility: every CI invocation can see a different prepare.py. That defeats the point of the golden-AL comparison this PR introduces — if upstream changes how the SpeedBench corpus is constructed, the AL reference stops being meaningful.
  3. Auditability: piping straight into python3 means the executed bytes never touch disk, so CI logs cannot show what was executed for a post-hoc diff between a known-good and a regressed run.

The fix is mechanical: replace refs/heads/main with a specific commit SHA in the URL and use the download-then-execute idiom (curl -L … -o /tmp/prepare.py && python3 /tmp/prepare.py …) so CI logs can attest to what ran.

Issue 2 — new directory under /workspace

AGENTS.md (line 149) states explicitly:

No new directories in /workspace during a benchmark (files are fine).

Walking through the call chain for the new MTP recipes touched by this PR:

  1. benchmarks/single_node/fixed_seq_len/dsv4_fp4_b200_vllm_mtp.sh and …_b300_vllm_mtp.sh use absolute /workspace/server.log and --result-dir /workspace/ paths.
  2. runners/launch_b200-dgxc.sh sets --container-workdir=$CONTAINER_MOUNT_DIR with default /workspace for every image except the deepseek-v4-blackwell sglang one (/ix).
  3. So when _speedbench_prepare_dataset runs and SPEEDBENCH_DIR is unset, $(pwd)/speed_bench_data expands to /workspace/speed_bench_data, and mkdir -p creates it on first invocation.

This directory persists across benchmark runs on the same self-hosted runner (the helper is cache-aware: it short-circuits once qualitative.jsonl exists, but the directory itself stays). Self-hosted runners that switch users between jobs can hit checkout failures from root-owned residue under /workspace.

Fix: local speedbench_dir="${SPEEDBENCH_DIR:-/tmp/speed_bench_data}" (or any non-/workspace path). The SPEEDBENCH_DIR override is already plumbed through.

Why pre_existing, not normal

Both patterns are repo conventions today. The unpinned curl … prepare.py | python3 line appears verbatim in benchmarks/single_node/speedbench/{dsr1,dsv4,glm5,glm52,kimik2.5,minimaxm3,qwen3.5}_fp4_b300_vllm.sh. The SPEEDBENCH_DIR=${SPEEDBENCH_DIR:-/workspace/speed_bench_data} pattern likewise pre-dates this PR. Verifiers split between pre_existing (because the diff didn't introduce either anti-pattern) and normal (because the shared helper increases blast radius — the new MTP flow goes through this path on more recipes than the original 7).

I'm filing this as pre_existing: the diff consolidates existing behavior and the right fix is a uniform update across all callers (the new helper plus the 7 existing scripts) rather than a one-off in this PR. It is worth flagging in review now because (a) centralizing the bad pattern is the natural moment to fix it cluster-wide, and (b) the PR description explicitly says it 'needs some iterations before merging'.

Proof — concrete trace

Starting from dsv4_fp4_b200_vllm_mtp.sh with RUN_EVAL=true and SPEC_DECODING=mtp:

  1. The recipe is launched by runners/launch_b200-dgxc.sh with --container-workdir=/workspace (default CONTAINER_MOUNT_DIR).
  2. Inside the recipe, run_eval --framework lm-eval --port "$PORT" is called.
  3. run_eval (benchmark_lib.sh:1764) calls run_speedbench_al_eval "${forwarded[@]}" since the single concurrency branch is taken.
  4. run_speedbench_al_eval (benchmark_lib.sh:1255) sets speedbench_dir="${SPEEDBENCH_DIR:-$(pwd)/speed_bench_data}". With SPEEDBENCH_DIR unset and PWD=/workspace, this is /workspace/speed_bench_data.
  5. _speedbench_prepare_dataset "/workspace/speed_bench_data" runs: qualitative.jsonl doesn't yet exist, so mkdir -p /workspace/speed_bench_data creates a new directory under /workspace — violating AGENTS.md line 149.
  6. The next line, curl -LsSf https://raw.githubusercontent.com/NVIDIA-NeMo/Skills/refs/heads/main/nemo_skills/dataset/speed-bench/prepare.py | python3 - …, fetches whatever is at main and runs it without saving the script to disk. A determined-attacker scenario aside, the practical risk is reproducibility: a CI run last week and one today can see different generation logic for the very dataset that gates the golden-AL pass/fail check.

Both fixes are one-line each. Recommend applying them in this PR (or a precursor) and back-porting to the 7 existing speedbench scripts.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants