fix: mask all stop_token_ids in sampling_ids + multi-GPU device support#1916
Open
lonrencn wants to merge 1 commit into
Open
fix: mask all stop_token_ids in sampling_ids + multi-GPU device support#1916lonrencn wants to merge 1 commit into
lonrencn wants to merge 1 commit into
Conversation
sampling_ids: mask ALL stop_token_ids (sos+eos+task_id+fill_token) during ignore_eos window, not just speech_token_size (sos only). Prevents premature EOS prediction with transformers >= 5.0. model.py/frontend.py: use torch.cuda.current_device() instead of hardcoded 'cuda' (defaults to cuda:0), enabling correct multi-GPU operation.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two bug fixes for CosyVoice3 inference:
sampling_ids()only masksspeech_token_size(sos) duringignore_eos=True, noteosor other stop tokens — on certain transformers versions, the LLM can predict EOS with high confidence as the first token, and the unmasked EOS leaks through theignore_eoswindow, producing garbage output.Hardcoded
torch.device('cuda')defaults tocuda:0— on multi-GPU systems,torch.cuda.set_device()has no effect and models load on the wrong GPU.Changes
cosyvoice/llm/llm.py— sampling_ids fixCosyVoice3LM.stop_token_ids = [speech_token_size + i for i in range(200)]covers sos(6561), eos(6562), task_id(6563), fill_token(6564), etc. The original code only masked indexspeech_token_size(sos=6561), leaving eos(6562) unmasked during the minimum-length window.This is usually harmless because a well-functioning LLM doesn't predict EOS early. But with slight numerical differences across transformers versions (e.g., SDPA mask handling changed between 4.x and 5.x), the logits can shift enough (~0.012) to flip EOS from #2 to #1, causing the model to output garbage.
The fix uses
hasattrto maintain backward compatibility with CosyVoice1/2.cosyvoice/cli/model.py+cosyvoice/cli/frontend.py— multi-GPU fixtorch.device('cuda')always defaults tocuda:0, ignoringtorch.cuda.set_device(). Changed totorch.device(f'cuda:{torch.cuda.current_device()}')in all model classes.Testing
Verified with
Fun-CosyVoice3-0.5B-2512: