fix: mask all stop_token_ids in sampling_ids + multi-GPU device support by lonrencn · Pull Request #1916 · FunAudioLLM/CosyVoice

lonrencn · 2026-07-03T12:39:37Z

Summary

Two bug fixes for CosyVoice3 inference:

sampling_ids() only masks speech_token_size (sos) during ignore_eos=True, not eos or other stop tokens — on certain transformers versions, the LLM can predict EOS with high confidence as the first token, and the unmasked EOS leaks through the ignore_eos window, producing garbage output.
Hardcoded torch.device('cuda') defaults to cuda:0 — on multi-GPU systems, torch.cuda.set_device() has no effect and models load on the wrong GPU.

Changes

`cosyvoice/llm/llm.py` — sampling_ids fix

CosyVoice3LM.stop_token_ids = [speech_token_size + i for i in range(200)] covers sos(6561), eos(6562), task_id(6563), fill_token(6564), etc. The original code only masked index speech_token_size (sos=6561), leaving eos(6562) unmasked during the minimum-length window.

This is usually harmless because a well-functioning LLM doesn't predict EOS early. But with slight numerical differences across transformers versions (e.g., SDPA mask handling changed between 4.x and 5.x), the logits can shift enough (~0.012) to flip EOS from #2 to #1, causing the model to output garbage.

The fix uses hasattr to maintain backward compatibility with CosyVoice1/2.

`cosyvoice/cli/model.py` + `cosyvoice/cli/frontend.py` — multi-GPU fix

torch.device('cuda') always defaults to cuda:0, ignoring torch.cuda.set_device(). Changed to torch.device(f'cuda:{torch.cuda.current_device()}') in all model classes.

Testing

Verified with Fun-CosyVoice3-0.5B-2512:

Chinese: "你好世界" → ASR: "你好，世界。" ✅
English: "Hello, this is a test." → ASR correct ✅
Environment: Python 3.11, torch 2.11+cu128, transformers 5.3.0

sampling_ids: mask ALL stop_token_ids (sos+eos+task_id+fill_token) during ignore_eos window, not just speech_token_size (sos only). Prevents premature EOS prediction with transformers >= 5.0. model.py/frontend.py: use torch.cuda.current_device() instead of hardcoded 'cuda' (defaults to cuda:0), enabling correct multi-GPU operation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: mask all stop_token_ids in sampling_ids + multi-GPU device support#1916

fix: mask all stop_token_ids in sampling_ids + multi-GPU device support#1916
lonrencn wants to merge 1 commit into
FunAudioLLM:mainfrom
lonrencn:fix/cv3-sampling-and-multigpu

lonrencn commented Jul 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

lonrencn commented Jul 3, 2026

Summary

Changes

cosyvoice/llm/llm.py — sampling_ids fix

cosyvoice/cli/model.py + cosyvoice/cli/frontend.py — multi-GPU fix

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`cosyvoice/llm/llm.py` — sampling_ids fix

`cosyvoice/cli/model.py` + `cosyvoice/cli/frontend.py` — multi-GPU fix