Summary
Multi-GPU FP8 forward through accelerate.dispatch_model + te.fp8_autocast(DelayedScaling(HYBRID)) fails with CUDA error: an illegal memory access was encountered on H200 NVL (sm_90). The same code path with single-GPU placement succeeds; the same code path on B200 (sm_100) also succeeds.
The harness that originally surfaced this presents the failure as the more generic cuBLAS Error: the function failed to launch on the GPU at transformer_engine/common/gemm/cublaslt_gemm.cu:549. The minrepro below catches the underlying illegal memory access earlier — most plausible reconciliation: same async CUDA failure surfacing at different sync points, with illegal memory access as the root cause that corrupts CUDA state before a later kernel hits the generic "launch failed" surface.
Environment
torch: 2.8.0a0+34c6371d24.nv25.08
CUDA: 13.0
cuBLAS Lt: libcublasLt.so.13 (version 130000)
TE: 2.5.0+f05f12c
accelerate: 1.14.0
GPUs: 2 × NVIDIA H200 NVL (sm_90, 139.8 GB each)
NGC container: nvcr.io/nvidia/pytorch:25.08-py3
Also reproduced on nvcr.io/nvidia/pytorch:25.05-py3 (TE 2.3.0) — TE-version-independent.
Reproduction
Self-contained script — no model download, no eval harness, no third-party model code. ~250 lines, runnable on any 2-GPU node:
https://github.com/NVIDIA/voltek/blob/main/scripts/te_h200_fp8_multigpu_minrepro.py
python te_h200_fp8_multigpu_minrepro.py --json > result.json
What the script does:
- Builds a 7-layer Linear stack at Llama-3.3-70B canonical shapes —
q/k/v/o_proj (8192×{8192,1024}), gate/up_proj (8192×28672), down_proj (28672×8192). All te.pytorch.Linear, params_dtype=torch.bfloat16, bias=False.
accelerate.dispatch_model(model, device_map={q/k/v/o_proj: 0, gate/up/down_proj: 1}) — splits parameters across cuda:0 / cuda:1 and attaches AlignDevicesHook.
- Forward at input
(height=12000, hidden=8192) bf16 on cuda:0 (matches batch=4, seq=3000 of a real first MMLU forward).
- Wrapped in
torch.autocast("cuda", dtype=torch.bfloat16) + te.fp8_autocast(enabled=True, fp8_recipe=DelayedScaling(margin=0, fp8_format=Format.HYBRID)).
Failure trace
The forward fails at the cuda:0 → cuda:1 device boundary, specifically at the element-wise gate * up operation (both tensors on cuda:1) after accelerate.AlignDevicesHook has moved the attention output across devices:
File "te_h200_fp8_multigpu_minrepro.py", line 204, in forward
ff = gate * up # element-wise, both cuda:1
~~~~~^~~~
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
The fp8_autocast context-manager exit then raises a second time at the FP8 amax reduction:
File "transformer_engine/pytorch/fp8.py", line 477, in fp8_autocast_exit
cls.reduce_and_update_fp8_tensors(forward=True)
File "transformer_engine/pytorch/fp8.py", line 383, in reduce_and_update_fp8_tensors
contiguous_amax = torch.cat(amax_buffer)
^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
(Same async error, surfacing at the next CUDA sync point.)
Working baselines
Both of these use the same TE version, same DelayedScaling(HYBRID) recipe, and same Linear shapes — the only thing that differs is single-GPU vs multi-GPU, or H200 vs B200:
- Single-GPU H200 NVL: a sibling script (
scripts/te_h200_fp8_minrepro.py) that builds the same 7 Linear shapes on a single GPU (no accelerate.dispatch_model) and runs the same fp8_autocast forwards passed 21/21 cells (every shape × batch ∈ {1, 4, 16}) on the same H200 NVL hardware. Raw TE FP8 Linear GEMM on Hopper is healthy.
- Multi-GPU B200: the same
accelerate.dispatch_model + te.fp8_autocast pattern, wrapped around a real Llama-3.1-70B model with batch=4 seq~3000 MMLU prompts on 2× B200 (sm_100), succeeds end-to-end and produces measured accuracy + energy + wall-time deltas (-0.11 pp MMLU, -24.2% energy, -18.5% wall-time vs BF16). See https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md.
So the failure axis is multi-GPU TE FP8 on Hopper sm_90 specifically — not TE FP8 in general, not Hopper FP8 in general, not multi-GPU FP8 in general.
What's eliminated
| Hypothesis |
Test |
Result |
| TE-version bug in 2.3 |
Bumped to NGC 25.08 with TE 2.5.0 |
Same failure |
| Insufficient cuBLAS Lt workspace |
CUBLASLT_WORKSPACE_SIZE=536870912 (512 MB) |
Same failure |
| OOM / memory pressure |
70B BF16 fits in 2 × 141 GB with ~50 GB headroom per GPU; no OOM signals |
Not the cause |
| Raw TE FP8 Linear GEMM broken on Hopper |
Single-GPU minrepro 21/21 PASS |
Not the cause |
| Cross-device input movement broken on Hopper |
The AlignDevicesHook move itself completes; the failure is on the next kernel after the move |
Not the cause (alone) |
Hypothesis
Some interaction between (a) accelerate.AlignDevicesHook's cross-device tensor movement and (b) TE's per-Linear lazy fp8_meta initialization (set_device + amax tensor allocation on first forward) produces a corrupted CUDA allocation or stream-binding on Hopper sm_90 specifically. On B200 sm_100 the same sequence succeeds; on H200 NVL it produces the illegal memory access on the next kernel that touches a cross-device tensor.
Workarounds requested
Any of these would unblock H200 multi-GPU FP8 deployments:
- A documented env var or recipe knob that disables the lazy
fp8_meta init path on Hopper multi-GPU (or forces eager init at module-construction time).
- A documented "use
device_map={...all on one device...} + manual tensor moves" pattern that bypasses accelerate.dispatch_model for FP8 multi-GPU on Hopper.
- A TE patch that handles the
accelerate.AlignDevicesHook hook stack correctly on sm_90.
- A confirmation that this is fixed in a later TE version (we have not yet tested TE 2.12+; happy to test if you'd like).
Additional context
This came up via a convert_to_fp8 developer tool (https://github.com/NVIDIA/voltek/) that does an in-process structural swap nn.Linear → te.pytorch.Linear and serves the converted model directly from a Python process. The Voltek conversion path is not in the failure surface — the minrepro builds the te.pytorch.Linear modules directly and reproduces the same illegal memory access. Voltek's 70B FP8 demo on H200 currently inherits this issue and is held back until upstream lands; the B200 demo path works fully end-to-end and is what https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md documents.
The minrepro script is committed at scripts/te_h200_fp8_multigpu_minrepro.py in the Voltek repo (link above). The full result.json from the run that produced this issue is at demos/k8s/voltek-native-accuracy/recorded-runs/2026-06-12-krusty-h200/peek-h200-mgpu/h200_multigpu_minrepro.json.
Summary
Multi-GPU FP8 forward through
accelerate.dispatch_model+te.fp8_autocast(DelayedScaling(HYBRID))fails withCUDA error: an illegal memory access was encounteredon H200 NVL (sm_90). The same code path with single-GPU placement succeeds; the same code path on B200 (sm_100) also succeeds.The harness that originally surfaced this presents the failure as the more generic
cuBLAS Error: the function failed to launch on the GPUattransformer_engine/common/gemm/cublaslt_gemm.cu:549. The minrepro below catches the underlying illegal memory access earlier — most plausible reconciliation: same async CUDA failure surfacing at different sync points, withillegal memory accessas the root cause that corrupts CUDA state before a later kernel hits the generic "launch failed" surface.Environment
Also reproduced on
nvcr.io/nvidia/pytorch:25.05-py3(TE 2.3.0) — TE-version-independent.Reproduction
Self-contained script — no model download, no eval harness, no third-party model code. ~250 lines, runnable on any 2-GPU node:
https://github.com/NVIDIA/voltek/blob/main/scripts/te_h200_fp8_multigpu_minrepro.py
python te_h200_fp8_multigpu_minrepro.py --json > result.jsonWhat the script does:
q/k/v/o_proj(8192×{8192,1024}),gate/up_proj(8192×28672),down_proj(28672×8192). Allte.pytorch.Linear,params_dtype=torch.bfloat16,bias=False.accelerate.dispatch_model(model, device_map={q/k/v/o_proj: 0, gate/up/down_proj: 1})— splits parameters across cuda:0 / cuda:1 and attachesAlignDevicesHook.(height=12000, hidden=8192)bf16 on cuda:0 (matchesbatch=4, seq=3000of a real first MMLU forward).torch.autocast("cuda", dtype=torch.bfloat16)+te.fp8_autocast(enabled=True, fp8_recipe=DelayedScaling(margin=0, fp8_format=Format.HYBRID)).Failure trace
The forward fails at the cuda:0 → cuda:1 device boundary, specifically at the element-wise
gate * upoperation (both tensors on cuda:1) afteraccelerate.AlignDevicesHookhas moved the attention output across devices:The
fp8_autocastcontext-manager exit then raises a second time at the FP8 amax reduction:(Same async error, surfacing at the next CUDA sync point.)
Working baselines
Both of these use the same TE version, same
DelayedScaling(HYBRID)recipe, and same Linear shapes — the only thing that differs is single-GPU vs multi-GPU, or H200 vs B200:scripts/te_h200_fp8_minrepro.py) that builds the same 7 Linear shapes on a single GPU (noaccelerate.dispatch_model) and runs the samefp8_autocastforwards passed 21/21 cells (every shape × batch ∈ {1, 4, 16}) on the same H200 NVL hardware. Raw TE FP8 Linear GEMM on Hopper is healthy.accelerate.dispatch_model+te.fp8_autocastpattern, wrapped around a real Llama-3.1-70B model with batch=4 seq~3000 MMLU prompts on 2× B200 (sm_100), succeeds end-to-end and produces measured accuracy + energy + wall-time deltas (-0.11 pp MMLU, -24.2% energy, -18.5% wall-time vs BF16). See https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md.So the failure axis is multi-GPU TE FP8 on Hopper sm_90 specifically — not TE FP8 in general, not Hopper FP8 in general, not multi-GPU FP8 in general.
What's eliminated
CUBLASLT_WORKSPACE_SIZE=536870912(512 MB)AlignDevicesHookmove itself completes; the failure is on the next kernel after the moveHypothesis
Some interaction between (a)
accelerate.AlignDevicesHook's cross-device tensor movement and (b) TE's per-Linear lazyfp8_metainitialization (set_device+ amax tensor allocation on first forward) produces a corrupted CUDA allocation or stream-binding on Hopper sm_90 specifically. On B200 sm_100 the same sequence succeeds; on H200 NVL it produces the illegal memory access on the next kernel that touches a cross-device tensor.Workarounds requested
Any of these would unblock H200 multi-GPU FP8 deployments:
fp8_metainit path on Hopper multi-GPU (or forces eager init at module-construction time).device_map={...all on one device...}+ manual tensor moves" pattern that bypassesaccelerate.dispatch_modelfor FP8 multi-GPU on Hopper.accelerate.AlignDevicesHookhook stack correctly on sm_90.Additional context
This came up via a
convert_to_fp8developer tool (https://github.com/NVIDIA/voltek/) that does an in-process structural swapnn.Linear→te.pytorch.Linearand serves the converted model directly from a Python process. The Voltek conversion path is not in the failure surface — the minrepro builds thete.pytorch.Linearmodules directly and reproduces the same illegal memory access. Voltek's 70B FP8 demo on H200 currently inherits this issue and is held back until upstream lands; the B200 demo path works fully end-to-end and is whathttps://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.mddocuments.The minrepro script is committed at
scripts/te_h200_fp8_multigpu_minrepro.pyin the Voltek repo (link above). The full result.json from the run that produced this issue is atdemos/k8s/voltek-native-accuracy/recorded-runs/2026-06-12-krusty-h200/peek-h200-mgpu/h200_multigpu_minrepro.json.