perf(prompt): high-capacity local models get the compressed prompt (lower TTFT) by QodeXcli · Pull Request #26 · QodeXcli/QodeX

QodeXcli · 2026-06-25T15:18:36Z

A surgical TTFT fix for local runs. (The Auto-Evaluation Loop the message also mentioned is already shipped — qodex skill eval replays in an isolated git worktree + real verifier, with learning.autoEval to run it after capture — so this PR is just the TTFT half.)

The leak

The compressed-prompt path was gated on model family (claude || gpt || gemini), so a frontier-class local model — e.g. qwen3-235b on a Mac Studio (the exact id in system.ts's own comment) — was treated as a weak model and handed the verbose, example-laden prompt. That inflates turn-1 prefill (and TTFT) for a model that follows terse guidance fine.

The fix

isHighCapacityModel(modelId) reads the largest <n>b param marker (MoE active-param markers like a22b ignored — the total signals capability) and returns true at ≥70B. High-capacity models now count as capable → the compressed prompt:

model	before	after
`qwen3-235b-a22b`	verbose	compressed
`llama-3.1-405b`, `nemotron-120b`	verbose	compressed
`qwen2.5-coder:32b`, `:7b`, `llama3.1:8b`	verbose	verbose (unchanged — weak models keep their guidance)

No quality regression for weak models — only ≥70B models change.

Measured

~730 fewer prefill tokens per turn for a 235B vs the 7B verbose prompt (≈ several seconds off TTFT on a slow-prefill local model). The per-turn LLM turn timing telemetry (shipped earlier) lets you confirm the delta on your Mac Studio.

Tests

4 (≥70B detection incl. largest-marker-wins, small models stay verbose, no misfire on version digits, compressed prompt measurably shorter). ✅ typecheck · ✅ full suite (1216) · ✅ build.

…ower TTFT) The capable/compressed-prompt path was gated on the model FAMILY (claude/gpt/ gemini), so a frontier-class LOCAL model — e.g. qwen3-235b on a Mac Studio, the exact id in this file's own comment — was treated as a weak model and handed the verbose, example-laden prompt. That just inflates turn-1 prefill (and TTFT) for a model that follows terse guidance perfectly well. Add isHighCapacityModel(modelId): reads the largest `<n>b` param marker (MoE active-param markers like a22b are ignored — the total signals capability) and returns true at ≥70B. A high-capacity model now also counts as `capable`, so big qwen/llama/nemotron get the compressed prompt; small/mid local models (≤32B) keep the full guidance they depend on — no quality regression for weak models. Measured: ~730 fewer prefill tokens per turn for a 235B vs the 7B verbose prompt (≈ several seconds off TTFT on a slow-prefill local model). The TTFT telemetry from the earlier work lets you confirm the delta on your hardware. Tests: 4 (≥70B detection incl. largest-marker-wins, small models stay verbose, no misfire on version digits, and the compressed prompt is measurably shorter). typecheck + full suite (1216) + build green.

QodeXcli merged commit 9dfe156 into main Jun 25, 2026
2 checks passed

QodeXcli deleted the perf/local-prefill-ttft branch June 25, 2026 15:19

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(prompt): high-capacity local models get the compressed prompt (lower TTFT)#26

perf(prompt): high-capacity local models get the compressed prompt (lower TTFT)#26
QodeXcli merged 1 commit into
mainfrom
perf/local-prefill-ttft

QodeXcli commented Jun 25, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QodeXcli commented Jun 25, 2026

The leak

The fix

Measured

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant