Title: NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2
Environment
- OS: WSL2 (Ubuntu), mirrored networking
- GPUs: 2x RTX 5070, 12GB each (physically separate, no NVLink)
- Model: Qwen3.6-27B (dense), config:
configs/local/qwen36_27b_dense.yaml-style (gsq_bits=2, groupsize=128, self_attn=false, batch_size=4, max_length up to 1024, device_microbatch_size=1)
- Launch:
NPROC=2 bash scripts/run.sh (torchrun --standalone --nproc-per-node=2)
Symptom 1: misleading "device not ready" instead of OOM
With the default PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True set in scripts/_common.sh, training fails with:
RuntimeError: CUDA driver error: device not ready
This occurs at various call sites (Gumbel quantizer init, torch.randn, later inside GumbelSoftmaxFunction.backward), all during layer 0's mlp.down_proj processing. It looked like GPU/driver flakiness at first.
Setting PYTORCH_CUDA_ALLOC_CONF=expandable_segments:False turns this into a normal, correctly-reported OOM at the exact same point:
torch.OutOfMemoryError: CUDA out of memory. Tried to allocate 680.00 MiB.
GPU 0 has a total capacity of 11.94 GiB of which 0 bytes is free.
Of the allocated memory 9.81 GiB is allocated by PyTorch, and 1009.30 MiB
is reserved by PyTorch but unallocated.
This matches a general known PyTorch pain point (pytorch/pytorch#173049) rather than something GSQ-specific, but flagging here since expandable_segments:True is the repo's own default and it hides the real error on this setup.
(Side note, unrelated to GSQ: the same OOM message sometimes shows Process X has 17179869184.00 GiB memory in use — that's a known PyTorch display bug, see pytorch/pytorch#116928, not a real number.)
Symptom 2: NPROC=2 OOMs where NPROC=1 succeeds, by a small margin
NPROC=1: completes layer 0 fully (all 10 GSQ epochs), peak GPU usage ~11.75GB/11.94GB.
NPROC=2: consistently OOMs on the very first train_step backward call for mlp.down_proj (the largest of the three MLP sub-layers here, intermediate_size=17408), needing "680.00 MiB" more, with ~9.8–10.3GB already allocated depending on config. The 680MB requested-but-missing amount was essentially constant across changes; total allocated/reserved shifted a bit per attempt but never fit.
Things that had no effect on this specific OOM:
data.batch_size (4 → 2)
data.max_length (1024 → 512 → 384)
NCCL_BUFFSIZE / NCCL_MAX_NCHANNELS reduction
Things that helped a little but not enough:
- Making
GumbelSoftmaxFunction.backward (in src/quantization/gumbel_quantizer_2bit.py) use in-place ops (sub_/mul_) instead of allocating fresh (4, out, in) tensors at each step, and generating torch.randn(...) directly in logits_dtype instead of Q.dtype (fp32) in GumbelQuantizer2Bit.__init__ (avoids an unnecessary fp32 buffer that's immediately cast to bf16 anyway).
PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256,garbage_collection_threshold:0.8 (reduced "reserved but unallocated" from ~1GB down to ~300MB, but total allocated crept up correspondingly and the 680MB gap remained).
- Adding
torch.cuda.empty_cache() between train_step calls (confirmed via torch.cuda.memory_allocated()/memory_reserved() logging that reserved was ratcheting up step-over-step without this; adding it reclaimed ~1.6GB between steps, but the next step's backward still grew back to the same wall).
Working hypothesis: NCCL's per-rank communicator-group memory overhead (independently reported as ~544–688MB for a 2-device group by NVIDIA, see NVIDIA/nccl#964) accounts for most of the gap — NPROC=1 doesn't pay this cost and fits with ~200MB to spare; NPROC=2 needs that plus the same per-rank working set, which no longer fits on a 12GB card. This is a data-parallel setup (dist.all_reduce on gradients/GPTQ stats), not tensor parallelism, so each rank still holds the full layer's weights/quantizer state — splitting calibration data across ranks doesn't reduce the per-rank memory floor.
Ask / possible actions
- Maybe worth a README/config note that
NPROC>1 on dense models needs meaningfully more per-GPU headroom than NPROC=1 (not just "N x same requirement"), so 12GB-class cards may need to drop to NPROC=1 even when the single-GPU config just barely fits.
- Consider defaulting
expandable_segments:False (or documenting the tradeoff) since it produces actionable OOM errors instead of a misleading driver error, at least until PyTorch's own reporting improves.
- True tensor parallelism for dense models (sharding the weight/quant_logits tensors themselves across ranks) would presumably fix this properly, but that's a bigger change than we attempted here.
Happy to share the modified gumbel_quantizer_2bit.py in-place patch if useful, or more logs.
Title: NPROC=2 OOMs on 2x12GB consumer GPUs where NPROC=1 succeeds (Qwen3.6-27B dense, gsq_bits=2); expandable_segments:True also masks OOM as "device not ready" on WSL2
Environment
configs/local/qwen36_27b_dense.yaml-style (gsq_bits=2, groupsize=128, self_attn=false, batch_size=4, max_length up to 1024, device_microbatch_size=1)NPROC=2 bash scripts/run.sh(torchrun --standalone --nproc-per-node=2)Symptom 1: misleading "device not ready" instead of OOM
With the default
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Trueset inscripts/_common.sh, training fails with:This occurs at various call sites (Gumbel quantizer init,
torch.randn, later insideGumbelSoftmaxFunction.backward), all during layer 0'smlp.down_projprocessing. It looked like GPU/driver flakiness at first.Setting
PYTORCH_CUDA_ALLOC_CONF=expandable_segments:Falseturns this into a normal, correctly-reported OOM at the exact same point:This matches a general known PyTorch pain point (pytorch/pytorch#173049) rather than something GSQ-specific, but flagging here since
expandable_segments:Trueis the repo's own default and it hides the real error on this setup.(Side note, unrelated to GSQ: the same OOM message sometimes shows
Process X has 17179869184.00 GiB memory in use— that's a known PyTorch display bug, see pytorch/pytorch#116928, not a real number.)Symptom 2: NPROC=2 OOMs where NPROC=1 succeeds, by a small margin
NPROC=1: completes layer 0 fully (all 10 GSQ epochs), peak GPU usage ~11.75GB/11.94GB.NPROC=2: consistently OOMs on the very firsttrain_stepbackward call formlp.down_proj(the largest of the three MLP sub-layers here, intermediate_size=17408), needing "680.00 MiB" more, with ~9.8–10.3GB already allocated depending on config. The 680MB requested-but-missing amount was essentially constant across changes; total allocated/reserved shifted a bit per attempt but never fit.Things that had no effect on this specific OOM:
data.batch_size(4 → 2)data.max_length(1024 → 512 → 384)NCCL_BUFFSIZE/NCCL_MAX_NCHANNELSreductionThings that helped a little but not enough:
GumbelSoftmaxFunction.backward(insrc/quantization/gumbel_quantizer_2bit.py) use in-place ops (sub_/mul_) instead of allocating fresh(4, out, in)tensors at each step, and generatingtorch.randn(...)directly inlogits_dtypeinstead ofQ.dtype(fp32) inGumbelQuantizer2Bit.__init__(avoids an unnecessary fp32 buffer that's immediately cast to bf16 anyway).PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:256,garbage_collection_threshold:0.8(reduced "reserved but unallocated" from ~1GB down to ~300MB, but total allocated crept up correspondingly and the 680MB gap remained).torch.cuda.empty_cache()between train_step calls (confirmed viatorch.cuda.memory_allocated()/memory_reserved()logging thatreservedwas ratcheting up step-over-step without this; adding it reclaimed ~1.6GB between steps, but the next step's backward still grew back to the same wall).Working hypothesis: NCCL's per-rank communicator-group memory overhead (independently reported as ~544–688MB for a 2-device group by NVIDIA, see NVIDIA/nccl#964) accounts for most of the gap —
NPROC=1doesn't pay this cost and fits with ~200MB to spare;NPROC=2needs that plus the same per-rank working set, which no longer fits on a 12GB card. This is a data-parallel setup (dist.all_reduceon gradients/GPTQ stats), not tensor parallelism, so each rank still holds the full layer's weights/quantizer state — splitting calibration data across ranks doesn't reduce the per-rank memory floor.Ask / possible actions
NPROC>1on dense models needs meaningfully more per-GPU headroom thanNPROC=1(not just "N x same requirement"), so 12GB-class cards may need to drop toNPROC=1even when the single-GPU config just barely fits.expandable_segments:False(or documenting the tradeoff) since it produces actionable OOM errors instead of a misleading driver error, at least until PyTorch's own reporting improves.Happy to share the modified
gumbel_quantizer_2bit.pyin-place patch if useful, or more logs.