Skip to content

Multi-GPU FP8 (accelerate.dispatch_model + DelayedScaling HYBRID) — illegal memory access on H200 sm_90 #3124

@roir

Description

@roir

Summary

Multi-GPU FP8 forward through accelerate.dispatch_model + te.fp8_autocast(DelayedScaling(HYBRID)) fails with CUDA error: an illegal memory access was encountered on H200 NVL (sm_90). The same code path with single-GPU placement succeeds; the same code path on B200 (sm_100) also succeeds.

The harness that originally surfaced this presents the failure as the more generic cuBLAS Error: the function failed to launch on the GPU at transformer_engine/common/gemm/cublaslt_gemm.cu:549. The minrepro below catches the underlying illegal memory access earlier — most plausible reconciliation: same async CUDA failure surfacing at different sync points, with illegal memory access as the root cause that corrupts CUDA state before a later kernel hits the generic "launch failed" surface.

Environment

torch:           2.8.0a0+34c6371d24.nv25.08
CUDA:            13.0
cuBLAS Lt:       libcublasLt.so.13 (version 130000)
TE:              2.5.0+f05f12c
accelerate:      1.14.0
GPUs:            2 × NVIDIA H200 NVL (sm_90, 139.8 GB each)
NGC container:   nvcr.io/nvidia/pytorch:25.08-py3

Also reproduced on nvcr.io/nvidia/pytorch:25.05-py3 (TE 2.3.0) — TE-version-independent.

Reproduction

Self-contained script — no model download, no eval harness, no third-party model code. ~250 lines, runnable on any 2-GPU node:

https://github.com/NVIDIA/voltek/blob/main/scripts/te_h200_fp8_multigpu_minrepro.py

python te_h200_fp8_multigpu_minrepro.py --json > result.json

What the script does:

  1. Builds a 7-layer Linear stack at Llama-3.3-70B canonical shapes — q/k/v/o_proj (8192×{8192,1024}), gate/up_proj (8192×28672), down_proj (28672×8192). All te.pytorch.Linear, params_dtype=torch.bfloat16, bias=False.
  2. accelerate.dispatch_model(model, device_map={q/k/v/o_proj: 0, gate/up/down_proj: 1}) — splits parameters across cuda:0 / cuda:1 and attaches AlignDevicesHook.
  3. Forward at input (height=12000, hidden=8192) bf16 on cuda:0 (matches batch=4, seq=3000 of a real first MMLU forward).
  4. Wrapped in torch.autocast("cuda", dtype=torch.bfloat16) + te.fp8_autocast(enabled=True, fp8_recipe=DelayedScaling(margin=0, fp8_format=Format.HYBRID)).

Failure trace

The forward fails at the cuda:0 → cuda:1 device boundary, specifically at the element-wise gate * up operation (both tensors on cuda:1) after accelerate.AlignDevicesHook has moved the attention output across devices:

File "te_h200_fp8_multigpu_minrepro.py", line 204, in forward
    ff = gate * up   # element-wise, both cuda:1
         ~~~~~^~~~
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1

The fp8_autocast context-manager exit then raises a second time at the FP8 amax reduction:

File "transformer_engine/pytorch/fp8.py", line 477, in fp8_autocast_exit
    cls.reduce_and_update_fp8_tensors(forward=True)
File "transformer_engine/pytorch/fp8.py", line 383, in reduce_and_update_fp8_tensors
    contiguous_amax = torch.cat(amax_buffer)
                      ^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered

(Same async error, surfacing at the next CUDA sync point.)

Working baselines

Both of these use the same TE version, same DelayedScaling(HYBRID) recipe, and same Linear shapes — the only thing that differs is single-GPU vs multi-GPU, or H200 vs B200:

  • Single-GPU H200 NVL: a sibling script (scripts/te_h200_fp8_minrepro.py) that builds the same 7 Linear shapes on a single GPU (no accelerate.dispatch_model) and runs the same fp8_autocast forwards passed 21/21 cells (every shape × batch ∈ {1, 4, 16}) on the same H200 NVL hardware. Raw TE FP8 Linear GEMM on Hopper is healthy.
  • Multi-GPU B200: the same accelerate.dispatch_model + te.fp8_autocast pattern, wrapped around a real Llama-3.1-70B model with batch=4 seq~3000 MMLU prompts on 2× B200 (sm_100), succeeds end-to-end and produces measured accuracy + energy + wall-time deltas (-0.11 pp MMLU, -24.2% energy, -18.5% wall-time vs BF16). See https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md.

So the failure axis is multi-GPU TE FP8 on Hopper sm_90 specifically — not TE FP8 in general, not Hopper FP8 in general, not multi-GPU FP8 in general.

What's eliminated

Hypothesis Test Result
TE-version bug in 2.3 Bumped to NGC 25.08 with TE 2.5.0 Same failure
Insufficient cuBLAS Lt workspace CUBLASLT_WORKSPACE_SIZE=536870912 (512 MB) Same failure
OOM / memory pressure 70B BF16 fits in 2 × 141 GB with ~50 GB headroom per GPU; no OOM signals Not the cause
Raw TE FP8 Linear GEMM broken on Hopper Single-GPU minrepro 21/21 PASS Not the cause
Cross-device input movement broken on Hopper The AlignDevicesHook move itself completes; the failure is on the next kernel after the move Not the cause (alone)

Hypothesis

Some interaction between (a) accelerate.AlignDevicesHook's cross-device tensor movement and (b) TE's per-Linear lazy fp8_meta initialization (set_device + amax tensor allocation on first forward) produces a corrupted CUDA allocation or stream-binding on Hopper sm_90 specifically. On B200 sm_100 the same sequence succeeds; on H200 NVL it produces the illegal memory access on the next kernel that touches a cross-device tensor.

Workarounds requested

Any of these would unblock H200 multi-GPU FP8 deployments:

  1. A documented env var or recipe knob that disables the lazy fp8_meta init path on Hopper multi-GPU (or forces eager init at module-construction time).
  2. A documented "use device_map={...all on one device...} + manual tensor moves" pattern that bypasses accelerate.dispatch_model for FP8 multi-GPU on Hopper.
  3. A TE patch that handles the accelerate.AlignDevicesHook hook stack correctly on sm_90.
  4. A confirmation that this is fixed in a later TE version (we have not yet tested TE 2.12+; happy to test if you'd like).

Additional context

This came up via a convert_to_fp8 developer tool (https://github.com/NVIDIA/voltek/) that does an in-process structural swap nn.Linearte.pytorch.Linear and serves the converted model directly from a Python process. The Voltek conversion path is not in the failure surface — the minrepro builds the te.pytorch.Linear modules directly and reproduces the same illegal memory access. Voltek's 70B FP8 demo on H200 currently inherits this issue and is held back until upstream lands; the B200 demo path works fully end-to-end and is what https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md documents.

The minrepro script is committed at scripts/te_h200_fp8_multigpu_minrepro.py in the Voltek repo (link above). The full result.json from the run that produced this issue is at demos/k8s/voltek-native-accuracy/recorded-runs/2026-06-12-krusty-h200/peek-h200-mgpu/h200_multigpu_minrepro.json.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions