Multi-GPU FP8 (accelerate.dispatch_model + DelayedScaling HYBRID) — illegal memory access on H200 sm_90

## Summary

Multi-GPU FP8 forward through `accelerate.dispatch_model` + `te.fp8_autocast(DelayedScaling(HYBRID))` fails with `CUDA error: an illegal memory access was encountered` on H200 NVL (sm_90). The same code path with single-GPU placement succeeds; the same code path on B200 (sm_100) also succeeds.

The harness that originally surfaced this presents the failure as the more generic `cuBLAS Error: the function failed to launch on the GPU` at `transformer_engine/common/gemm/cublaslt_gemm.cu:549`. The minrepro below catches the underlying illegal memory access earlier — most plausible reconciliation: same async CUDA failure surfacing at different sync points, with `illegal memory access` as the root cause that corrupts CUDA state before a later kernel hits the generic "launch failed" surface.

## Environment

```
torch:           2.8.0a0+34c6371d24.nv25.08
CUDA:            13.0
cuBLAS Lt:       libcublasLt.so.13 (version 130000)
TE:              2.5.0+f05f12c
accelerate:      1.14.0
GPUs:            2 × NVIDIA H200 NVL (sm_90, 139.8 GB each)
NGC container:   nvcr.io/nvidia/pytorch:25.08-py3
```

Also reproduced on `nvcr.io/nvidia/pytorch:25.05-py3` (TE 2.3.0) — TE-version-independent.

## Reproduction

Self-contained script — no model download, no eval harness, no third-party model code. ~250 lines, runnable on any 2-GPU node:

https://github.com/NVIDIA/voltek/blob/main/scripts/te_h200_fp8_multigpu_minrepro.py

```bash
python te_h200_fp8_multigpu_minrepro.py --json > result.json
```

What the script does:

1. Builds a 7-layer Linear stack at Llama-3.3-70B canonical shapes — `q/k/v/o_proj` (8192×{8192,1024}), `gate/up_proj` (8192×28672), `down_proj` (28672×8192). All `te.pytorch.Linear`, `params_dtype=torch.bfloat16`, `bias=False`.
2. `accelerate.dispatch_model(model, device_map={q/k/v/o_proj: 0, gate/up/down_proj: 1})` — splits parameters across cuda:0 / cuda:1 and attaches `AlignDevicesHook`.
3. Forward at input `(height=12000, hidden=8192)` bf16 on cuda:0 (matches `batch=4, seq=3000` of a real first MMLU forward).
4. Wrapped in `torch.autocast("cuda", dtype=torch.bfloat16)` + `te.fp8_autocast(enabled=True, fp8_recipe=DelayedScaling(margin=0, fp8_format=Format.HYBRID))`.

## Failure trace

The forward fails at the cuda:0 → cuda:1 device boundary, specifically at the element-wise `gate * up` operation (both tensors on cuda:1) after `accelerate.AlignDevicesHook` has moved the attention output across devices:

```
File "te_h200_fp8_multigpu_minrepro.py", line 204, in forward
    ff = gate * up   # element-wise, both cuda:1
         ~~~~~^~~~
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
CUDA kernel errors might be asynchronously reported at some other API call,
so the stacktrace below might be incorrect.
For debugging consider passing CUDA_LAUNCH_BLOCKING=1
```

The `fp8_autocast` context-manager exit then raises a second time at the FP8 amax reduction:

```
File "transformer_engine/pytorch/fp8.py", line 477, in fp8_autocast_exit
    cls.reduce_and_update_fp8_tensors(forward=True)
File "transformer_engine/pytorch/fp8.py", line 383, in reduce_and_update_fp8_tensors
    contiguous_amax = torch.cat(amax_buffer)
                      ^^^^^^^^^^^^^^^^^^^^^^
torch.AcceleratorError: CUDA error: an illegal memory access was encountered
```

(Same async error, surfacing at the next CUDA sync point.)

## Working baselines

Both of these use the same TE version, same `DelayedScaling(HYBRID)` recipe, and same Linear shapes — the only thing that differs is single-GPU vs multi-GPU, or H200 vs B200:

- **Single-GPU H200 NVL**: a sibling script (`scripts/te_h200_fp8_minrepro.py`) that builds the same 7 Linear shapes on a single GPU (no `accelerate.dispatch_model`) and runs the same `fp8_autocast` forwards passed **21/21 cells** (every shape × batch ∈ {1, 4, 16}) on the same H200 NVL hardware. Raw TE FP8 Linear GEMM on Hopper is healthy.
- **Multi-GPU B200**: the same `accelerate.dispatch_model` + `te.fp8_autocast` pattern, wrapped around a real Llama-3.1-70B model with batch=4 seq~3000 MMLU prompts on 2× B200 (sm_100), succeeds end-to-end and produces measured accuracy + energy + wall-time deltas (-0.11 pp MMLU, -24.2% energy, -18.5% wall-time vs BF16). See https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md.

So the failure axis is **multi-GPU TE FP8 on Hopper sm_90 specifically** — not TE FP8 in general, not Hopper FP8 in general, not multi-GPU FP8 in general.

## What's eliminated

| Hypothesis | Test | Result |
|---|---|---|
| TE-version bug in 2.3 | Bumped to NGC 25.08 with TE 2.5.0 | Same failure |
| Insufficient cuBLAS Lt workspace | `CUBLASLT_WORKSPACE_SIZE=536870912` (512 MB) | Same failure |
| OOM / memory pressure | 70B BF16 fits in 2 × 141 GB with ~50 GB headroom per GPU; no OOM signals | Not the cause |
| Raw TE FP8 Linear GEMM broken on Hopper | Single-GPU minrepro 21/21 PASS | Not the cause |
| Cross-device input movement broken on Hopper | The `AlignDevicesHook` move itself completes; the failure is on the next kernel after the move | Not the cause (alone) |

## Hypothesis

Some interaction between (a) `accelerate.AlignDevicesHook`'s cross-device tensor movement and (b) TE's per-Linear lazy `fp8_meta` initialization (`set_device` + amax tensor allocation on first forward) produces a corrupted CUDA allocation or stream-binding on Hopper sm_90 specifically. On B200 sm_100 the same sequence succeeds; on H200 NVL it produces the illegal memory access on the next kernel that touches a cross-device tensor.

## Workarounds requested

Any of these would unblock H200 multi-GPU FP8 deployments:

1. A documented env var or recipe knob that disables the lazy `fp8_meta` init path on Hopper multi-GPU (or forces eager init at module-construction time).
2. A documented "use `device_map={...all on one device...}` + manual tensor moves" pattern that bypasses `accelerate.dispatch_model` for FP8 multi-GPU on Hopper.
3. A TE patch that handles the `accelerate.AlignDevicesHook` hook stack correctly on sm_90.
4. A confirmation that this is fixed in a later TE version (we have not yet tested TE 2.12+; happy to test if you'd like).

## Additional context

This came up via a `convert_to_fp8` developer tool (`https://github.com/NVIDIA/voltek/`) that does an in-process structural swap `nn.Linear` → `te.pytorch.Linear` and serves the converted model directly from a Python process. The Voltek conversion path is not in the failure surface — the minrepro builds the `te.pytorch.Linear` modules directly and reproduces the same illegal memory access. Voltek's 70B FP8 demo on H200 currently inherits this issue and is held back until upstream lands; the B200 demo path works fully end-to-end and is what `https://github.com/NVIDIA/voltek/blob/main/analysis/B200_FP8_LLAMA31_2026-06-11.md` documents.

The minrepro script is committed at `scripts/te_h200_fp8_multigpu_minrepro.py` in the Voltek repo (link above). The full result.json from the run that produced this issue is at `demos/k8s/voltek-native-accuracy/recorded-runs/2026-06-12-krusty-h200/peek-h200-mgpu/h200_multigpu_minrepro.json`.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-GPU FP8 (accelerate.dispatch_model + DelayedScaling HYBRID) — illegal memory access on H200 sm_90 #3124

Summary

Environment

Reproduction

Failure trace

Working baselines

What's eliminated

Hypothesis

Workarounds requested

Additional context

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Hypothesis	Test	Result
TE-version bug in 2.3	Bumped to NGC 25.08 with TE 2.5.0	Same failure
Insufficient cuBLAS Lt workspace	`CUBLASLT_WORKSPACE_SIZE=536870912` (512 MB)	Same failure
OOM / memory pressure	70B BF16 fits in 2 × 141 GB with ~50 GB headroom per GPU; no OOM signals	Not the cause
Raw TE FP8 Linear GEMM broken on Hopper	Single-GPU minrepro 21/21 PASS	Not the cause
Cross-device input movement broken on Hopper	The `AlignDevicesHook` move itself completes; the failure is on the next kernel after the move	Not the cause (alone)

Multi-GPU FP8 (accelerate.dispatch_model + DelayedScaling HYBRID) — illegal memory access on H200 sm_90 #3124

Description

Summary

Environment

Reproduction

Failure trace

Working baselines

What's eliminated

Hypothesis

Workarounds requested

Additional context

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions