Skip to content

NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2 #6

Description

@nakaken3013-code

Title: NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2

Environment

  • OS: WSL2 (Ubuntu), mirrored networking
  • GPUs: 2x RTX 5070, 12GB each (physically separate, no NVLink)
  • Model: Qwen3.6-27B (dense), config: configs/local/qwen36_27b_dense.yaml-style (gsq_bits=2, groupsize=128, self_attn=false, batch_size=4, max_length up to 1024, device_microbatch_size=1)
  • Launch: NPROC=2 bash scripts/run.sh (torchrun --standalone --nproc-per-node=2)

Symptom 1: misleading "device not ready" instead of OOM

With the default PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set in scripts/_common.sh, training fails with:

RuntimeError: CUDA driver error: device not ready

This occurs at various call sites (Gumbel quantizer init, torch.randn, later inside GumbelSoftmaxFunction.backward), all during layer 0's mlp.down_proj processing. It looked like GPU/driver flakiness at first.

Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False turns this into a normal, correctly-reported OOM at the exact same point:

torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 680.00 MiB.
GPU 0 has a total capacity of 11.94 GiB of which 0 bytes is free.
Of the allocated memory 9.81 GiB is allocated by PyTorch, and 1009.30 MiB
is reserved by PyTorch but unallocated.

This matches a general known PyTorch pain point (pytorch/pytorch#173049) rather than something GSQ-specific, but flagging here since expandable_segments:True is the repo's own default and it hides the real error on this setup.

(Side note, unrelated to GSQ: the same OOM message sometimes shows Process X has 17179869184.00 GiB memory in use — that's a known PyTorch display bug, see pytorch/pytorch#116928, not a real number.)

Symptom 2: NPROC=2 OOMs where NPROC=1 succeeds, by a small margin

  • NPROC=1: completes layer 0 fully (all 10 GSQ epochs), peak GPU usage ~11.75GB/11.94GB.
  • NPROC=2: consistently OOMs on the very first train_step backward call for mlp.down_proj (the largest of the three MLP sub-layers here, intermediate_size=17408), needing "680.00 MiB" more, with ~9.8–10.3GB already allocated depending on config. The 680MB requested-but-missing amount was essentially constant across changes; total allocated/reserved shifted a bit per attempt but never fit.

Things that had no effect on this specific OOM:

  • data.batch_size (4 → 2)
  • data.max_length (1024 → 512 → 384)
  • NCCL_BUFFSIZE / NCCL_MAX_NCHANNELS reduction

Things that helped a little but not enough:

  • Making GumbelSoftmaxFunction.backward (in src/quantization/gumbel_quantizer_2bit.py) use in-place ops (sub_/mul_) instead of allocating fresh (4, out, in) tensors at each step, and generating torch.randn(...) directly in logits_dtype instead of Q.dtype (fp32) in GumbelQuantizer2Bit.__init__ (avoids an unnecessary fp32 buffer that's immediately cast to bf16 anyway).
  • PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256,garbage_collection_threshold:0.8 (reduced "reserved but unallocated" from ~1GB down to ~300MB, but total allocated crept up correspondingly and the 680MB gap remained).
  • Adding torch.cuda.empty_cache() between train_step calls (confirmed via torch.cuda.memory_allocated()/memory_reserved() logging that reserved was ratcheting up step-over-step without this; adding it reclaimed ~1.6GB between steps, but the next step's backward still grew back to the same wall).

Working hypothesis: NCCL's per-rank communicator-group memory overhead (independently reported as ~544–688MB for a 2-device group by NVIDIA, see NVIDIA/nccl#964) accounts for most of the gap — NPROC=1 doesn't pay this cost and fits with ~200MB to spare; NPROC=2 needs that plus the same per-rank working set, which no longer fits on a 12GB card. This is a data-parallel setup (dist.all_reduce on gradients/GPTQ stats), not tensor parallelism, so each rank still holds the full layer's weights/quantizer state — splitting calibration data across ranks doesn't reduce the per-rank memory floor.

Ask / possible actions

  • Maybe worth a README/config note that NPROC>1 on dense models needs meaningfully more per-GPU headroom than NPROC=1 (not just "N x same requirement"), so 12GB-class cards may need to drop to NPROC=1 even when the single-GPU config just barely fits.
  • Consider defaulting expandable_segments:False (or documenting the tradeoff) since it produces actionable OOM errors instead of a misleading driver error, at least until PyTorch's own reporting improves.
  • True tensor parallelism for dense models (sharding the weight/quant_logits tensors themselves across ranks) would presumably fix this properly, but that's a bigger change than we attempted here.

Happy to share the modified gumbel_quantizer_2bit.py in-place patch if useful, or more logs.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions