Skip to content

fix: mask all stop_token_ids in sampling_ids + multi-GPU device support#1916

Open
lonrencn wants to merge 1 commit into
FunAudioLLM:mainfrom
lonrencn:fix/cv3-sampling-and-multigpu
Open

fix: mask all stop_token_ids in sampling_ids + multi-GPU device support#1916
lonrencn wants to merge 1 commit into
FunAudioLLM:mainfrom
lonrencn:fix/cv3-sampling-and-multigpu

Conversation

@lonrencn

@lonrencn lonrencn commented Jul 3, 2026

Copy link
Copy Markdown

Summary

Two bug fixes for CosyVoice3 inference:

  1. sampling_ids() only masks speech_token_size (sos) during ignore_eos=True, not eos or other stop tokens — on certain transformers versions, the LLM can predict EOS with high confidence as the first token, and the unmasked EOS leaks through the ignore_eos window, producing garbage output.

  2. Hardcoded torch.device('cuda') defaults to cuda:0 — on multi-GPU systems, torch.cuda.set_device() has no effect and models load on the wrong GPU.

Changes

cosyvoice/llm/llm.py — sampling_ids fix

CosyVoice3LM.stop_token_ids = [speech_token_size + i for i in range(200)] covers sos(6561), eos(6562), task_id(6563), fill_token(6564), etc. The original code only masked index speech_token_size (sos=6561), leaving eos(6562) unmasked during the minimum-length window.

This is usually harmless because a well-functioning LLM doesn't predict EOS early. But with slight numerical differences across transformers versions (e.g., SDPA mask handling changed between 4.x and 5.x), the logits can shift enough (~0.012) to flip EOS from #2 to #1, causing the model to output garbage.

The fix uses hasattr to maintain backward compatibility with CosyVoice1/2.

cosyvoice/cli/model.py + cosyvoice/cli/frontend.py — multi-GPU fix

torch.device('cuda') always defaults to cuda:0, ignoring torch.cuda.set_device(). Changed to torch.device(f'cuda:{torch.cuda.current_device()}') in all model classes.

Testing

Verified with Fun-CosyVoice3-0.5B-2512:

  • Chinese: "你好世界" → ASR: "你好,世界。" ✅
  • English: "Hello, this is a test." → ASR correct ✅
  • Environment: Python 3.11, torch 2.11+cu128, transformers 5.3.0

sampling_ids: mask ALL stop_token_ids (sos+eos+task_id+fill_token) during
ignore_eos window, not just speech_token_size (sos only). Prevents premature
EOS prediction with transformers >= 5.0.

model.py/frontend.py: use torch.cuda.current_device() instead of hardcoded
'cuda' (defaults to cuda:0), enabling correct multi-GPU operation.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant