NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2

Title: NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2

## Environment
- OS: WSL2 (Ubuntu), mirrored networking
- GPUs: 2x RTX 5070, 12GB each (physically separate, no NVLink)
- Model: Qwen3.6-27B (dense), config: `configs/local/qwen36_27b_dense.yaml`-style (gsq_bits=2, groupsize=128, self_attn=false, batch_size=4, max_length up to 1024, device_microbatch_size=1)
- Launch: `NPROC=2 bash scripts/run.sh` (torchrun --standalone --nproc-per-node=2)

## Symptom 1: misleading "device not ready" instead of OOM

With the default `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True` set in `scripts/_common.sh`, training fails with:

```
RuntimeError: CUDA driver error: device not ready
```

This occurs at various call sites (Gumbel quantizer init, `torch.randn`, later inside `GumbelSoftmaxFunction.backward`), all during layer 0's `mlp.down_proj` processing. It looked like GPU/driver flakiness at first.

Setting `PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False` turns this into a normal, correctly-reported OOM at the exact same point:

```
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 680.00 MiB.
GPU 0 has a total capacity of 11.94 GiB of which 0 bytes is free.
Of the allocated memory 9.81 GiB is allocated by PyTorch, and 1009.30 MiB
is reserved by PyTorch but unallocated.
```

This matches a general known PyTorch pain point (pytorch/pytorch#173049) rather than something GSQ-specific, but flagging here since `expandable_segments:True` is the repo's own default and it hides the real error on this setup.

(Side note, unrelated to GSQ: the same OOM message sometimes shows `Process X has 17179869184.00 GiB memory in use` — that's a known PyTorch display bug, see pytorch/pytorch#116928, not a real number.)

## Symptom 2: NPROC=2 OOMs where NPROC=1 succeeds, by a small margin

- `NPROC=1`: completes layer 0 fully (all 10 GSQ epochs), peak GPU usage ~11.75GB/11.94GB.
- `NPROC=2`: consistently OOMs on the very first `train_step` backward call for `mlp.down_proj` (the largest of the three MLP sub-layers here, intermediate_size=17408), needing "680.00 MiB" more, with ~9.8–10.3GB already allocated depending on config. The 680MB requested-but-missing amount was essentially constant across changes; total allocated/reserved shifted a bit per attempt but never fit.

Things that had **no effect** on this specific OOM:
- `data.batch_size` (4 → 2)
- `data.max_length` (1024 → 512 → 384)
- `NCCL_BUFFSIZE` / `NCCL_MAX_NCHANNELS` reduction

Things that helped a little but not enough:
- Making `GumbelSoftmaxFunction.backward` (in `src/quantization/gumbel_quantizer_2bit.py`) use in-place ops (`sub_`/`mul_`) instead of allocating fresh `(4, out, in)` tensors at each step, and generating `torch.randn(...)` directly in `logits_dtype` instead of `Q.dtype` (fp32) in `GumbelQuantizer2Bit.__init__` (avoids an unnecessary fp32 buffer that's immediately cast to bf16 anyway).
- `PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256,garbage_collection_threshold:0.8` (reduced "reserved but unallocated" from ~1GB down to ~300MB, but total allocated crept up correspondingly and the 680MB gap remained).
- Adding `torch.cuda.empty_cache()` between train_step calls (confirmed via `torch.cuda.memory_allocated()`/`memory_reserved()` logging that `reserved` was ratcheting up step-over-step without this; adding it reclaimed ~1.6GB between steps, but the next step's backward still grew back to the same wall).

Working hypothesis: NCCL's per-rank communicator-group memory overhead (independently reported as ~544–688MB for a 2-device group by NVIDIA, see NVIDIA/nccl#964) accounts for most of the gap — `NPROC=1` doesn't pay this cost and fits with ~200MB to spare; `NPROC=2` needs that plus the same per-rank working set, which no longer fits on a 12GB card. This is a data-parallel setup (`dist.all_reduce` on gradients/GPTQ stats), not tensor parallelism, so each rank still holds the full layer's weights/quantizer state — splitting calibration data across ranks doesn't reduce the per-rank memory floor.

## Ask / possible actions

- Maybe worth a README/config note that `NPROC>1` on dense models needs meaningfully more per-GPU headroom than `NPROC=1` (not just "N x same requirement"), so 12GB-class cards may need to drop to `NPROC=1` even when the single-GPU config just barely fits.
- Consider defaulting `expandable_segments:False` (or documenting the tradeoff) since it produces actionable OOM errors instead of a misleading driver error, at least until PyTorch's own reporting improves.
- True tensor parallelism for dense models (sharding the weight/quant_logits tensors themselves across ranks) would presumably fix this properly, but that's a bigger change than we attempted here.

Happy to share the modified `gumbel_quantizer_2bit.py` in-place patch if useful, or more logs.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2 #6

Environment

Symptom 1: misleading "device not ready" instead of OOM

Symptom 2: NPROC=2 OOMs where NPROC=1 succeeds, by a small margin

Ask / possible actions

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Uh oh!

NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2 #6

Description

Environment

Symptom 1: misleading "device not ready" instead of OOM

Symptom 2: NPROC=2 OOMs where NPROC=1 succeeds, by a small margin

Ask / possible actions

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions