perf(prompt): high-capacity local models get the compressed prompt (lower TTFT)#26
Merged
Conversation
…ower TTFT) The capable/compressed-prompt path was gated on the model FAMILY (claude/gpt/ gemini), so a frontier-class LOCAL model — e.g. qwen3-235b on a Mac Studio, the exact id in this file's own comment — was treated as a weak model and handed the verbose, example-laden prompt. That just inflates turn-1 prefill (and TTFT) for a model that follows terse guidance perfectly well. Add isHighCapacityModel(modelId): reads the largest `<n>b` param marker (MoE active-param markers like a22b are ignored — the total signals capability) and returns true at ≥70B. A high-capacity model now also counts as `capable`, so big qwen/llama/nemotron get the compressed prompt; small/mid local models (≤32B) keep the full guidance they depend on — no quality regression for weak models. Measured: ~730 fewer prefill tokens per turn for a 235B vs the 7B verbose prompt (≈ several seconds off TTFT on a slow-prefill local model). The TTFT telemetry from the earlier work lets you confirm the delta on your hardware. Tests: 4 (≥70B detection incl. largest-marker-wins, small models stay verbose, no misfire on version digits, and the compressed prompt is measurably shorter). typecheck + full suite (1216) + build green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
A surgical TTFT fix for local runs. (The Auto-Evaluation Loop the message also mentioned is already shipped —
qodex skill evalreplays in an isolatedgit worktree+ real verifier, withlearning.autoEvalto run it after capture — so this PR is just the TTFT half.)The leak
The compressed-prompt path was gated on model family (
claude || gpt || gemini), so a frontier-class local model — e.g.qwen3-235bon a Mac Studio (the exact id insystem.ts's own comment) — was treated as a weak model and handed the verbose, example-laden prompt. That inflates turn-1 prefill (and TTFT) for a model that follows terse guidance fine.The fix
isHighCapacityModel(modelId)reads the largest<n>bparam marker (MoE active-param markers likea22bignored — the total signals capability) and returns true at ≥70B. High-capacity models now count ascapable→ the compressed prompt:qwen3-235b-a22bllama-3.1-405b,nemotron-120bqwen2.5-coder:32b,:7b,llama3.1:8bNo quality regression for weak models — only ≥70B models change.
Measured
~730 fewer prefill tokens per turn for a 235B vs the 7B verbose prompt (≈ several seconds off TTFT on a slow-prefill local model). The per-turn
LLM turn timingtelemetry (shipped earlier) lets you confirm the delta on your Mac Studio.Tests
4 (≥70B detection incl. largest-marker-wins, small models stay verbose, no misfire on version digits, compressed prompt measurably shorter). ✅ typecheck · ✅ full suite (1216) · ✅ build.