Skip to content

perf(prompt): high-capacity local models get the compressed prompt (lower TTFT)#26

Merged
QodeXcli merged 1 commit into
mainfrom
perf/local-prefill-ttft
Jun 25, 2026
Merged

perf(prompt): high-capacity local models get the compressed prompt (lower TTFT)#26
QodeXcli merged 1 commit into
mainfrom
perf/local-prefill-ttft

Conversation

@QodeXcli

Copy link
Copy Markdown
Owner

A surgical TTFT fix for local runs. (The Auto-Evaluation Loop the message also mentioned is already shippedqodex skill eval replays in an isolated git worktree + real verifier, with learning.autoEval to run it after capture — so this PR is just the TTFT half.)

The leak

The compressed-prompt path was gated on model family (claude || gpt || gemini), so a frontier-class local model — e.g. qwen3-235b on a Mac Studio (the exact id in system.ts's own comment) — was treated as a weak model and handed the verbose, example-laden prompt. That inflates turn-1 prefill (and TTFT) for a model that follows terse guidance fine.

The fix

isHighCapacityModel(modelId) reads the largest <n>b param marker (MoE active-param markers like a22b ignored — the total signals capability) and returns true at ≥70B. High-capacity models now count as capable → the compressed prompt:

model before after
qwen3-235b-a22b verbose compressed
llama-3.1-405b, nemotron-120b verbose compressed
qwen2.5-coder:32b, :7b, llama3.1:8b verbose verbose (unchanged — weak models keep their guidance)

No quality regression for weak models — only ≥70B models change.

Measured

~730 fewer prefill tokens per turn for a 235B vs the 7B verbose prompt (≈ several seconds off TTFT on a slow-prefill local model). The per-turn LLM turn timing telemetry (shipped earlier) lets you confirm the delta on your Mac Studio.

Tests

4 (≥70B detection incl. largest-marker-wins, small models stay verbose, no misfire on version digits, compressed prompt measurably shorter). ✅ typecheck · ✅ full suite (1216) · ✅ build.

…ower TTFT)

The capable/compressed-prompt path was gated on the model FAMILY (claude/gpt/
gemini), so a frontier-class LOCAL model — e.g. qwen3-235b on a Mac Studio, the
exact id in this file's own comment — was treated as a weak model and handed the
verbose, example-laden prompt. That just inflates turn-1 prefill (and TTFT) for a
model that follows terse guidance perfectly well.

Add isHighCapacityModel(modelId): reads the largest `<n>b` param marker (MoE
active-param markers like a22b are ignored — the total signals capability) and
returns true at ≥70B. A high-capacity model now also counts as `capable`, so big
qwen/llama/nemotron get the compressed prompt; small/mid local models (≤32B) keep
the full guidance they depend on — no quality regression for weak models.

Measured: ~730 fewer prefill tokens per turn for a 235B vs the 7B verbose prompt
(≈ several seconds off TTFT on a slow-prefill local model). The TTFT telemetry
from the earlier work lets you confirm the delta on your hardware.

Tests: 4 (≥70B detection incl. largest-marker-wins, small models stay verbose, no
misfire on version digits, and the compressed prompt is measurably shorter).
typecheck + full suite (1216) + build green.
@QodeXcli QodeXcli merged commit 9dfe156 into main Jun 25, 2026
2 checks passed
@QodeXcli QodeXcli deleted the perf/local-prefill-ttft branch June 25, 2026 15:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant