feat(skills): strengthen UCB1 — composite reward, tunable c, trial floor, scores, off-switch by QodeXcli · Pull Request #27 · QodeXcli/QodeX

QodeXcli · 2026-06-26T01:17:10Z

All five requested upgrades to the Phase-4 UCB1 bandit (#24), in the pure, tested core.

1. Tunable exploration factor

RouteOptions.explorationFactor (config learning.versioning.ucbExplorationFactor, default √2). Higher c keeps a barely-tested challenger in play longer.

2. Composite reward (success + tokens + time)

UCB1's per-arm value is no longer raw success rate but a weighted blend — default 0.7 success / 0.15 token / 0.15 time, configurable. Success dominates (a cheap failure never beats an expensive success), but between two equally-successful versions the cheaper + faster one wins. Added totalDurationMs to stats (back-compat optional); recordVersionExecution takes durationMs; decideChampion converges on the composite reward too.

3. Minimum challenger trials

RouteOptions.minChallengerTrials (default 5): a challenger is force-routed until it clears the floor, so UCB1 never starves it — and no decision is made — on too little signal.

4. UCB score history / analysis

ucbScores(manifest) returns reward + exploration bonus + ucb per arm; qodex skill versions <name> now prints the breakdown (and avg ms/run) for debugging.

5. Off-switch

routingStrategy: 'champion-only' (config strategy) disables UCB entirely — always the stable version, for sensitive skills you don't want experimented on.

config: learning.versioning { ucbExplorationFactor, minChallengerTrials, rewardWeights, strategy }, threaded through versioned-store + the CLI.

Live-verified

Between two 100%-success versions, the cheaper/faster one won on composite reward and was promoted to champion.

Tests

+6 (champion-only, trial floor, exploration-factor explore↔exploit, composite reward incl. success-dominates, ucbScores snapshot) + existing 12. ✅ typecheck · ✅ full suite (1222) · ✅ build.

…oor, scores, off-switch All five requested upgrades to the Phase-4 bandit, in the pure tested core: 1. Tunable exploration factor — RouteOptions.explorationFactor (config learning.versioning.ucbExplorationFactor; default √2). Higher c keeps a barely-tested challenger in play longer. 2. Composite reward — UCB1's per-arm value is no longer raw success rate but a weighted blend of success + token-efficiency + time-efficiency (default 0.7/0.15/0.15, configurable). Success dominates (a cheap failure never beats an expensive success), but between two equally-successful versions the cheaper + faster one wins. Added totalDurationMs to VersionStats (back-compat optional); recordVersionExecution now takes durationMs; decideChampion converges on the composite reward too. 3. Minimum challenger trials — RouteOptions.minChallengerTrials (default 5): a challenger is force-routed until it clears the floor, so UCB1 never starves it (or a decision is never made) on too little signal. 4. UCB score history/analysis — ucbScores(manifest) returns reward + exploration bonus + ucb per arm; `qodex skill versions <name>` now prints the breakdown (and avg ms/run) for debugging. 5. Off-switch — routingStrategy 'champion-only' (config strategy) disables UCB and always routes the stable version — for sensitive skills you don't want experimented on. config: learning.versioning { ucbExplorationFactor, minChallengerTrials, rewardWeights, strategy }, threaded through versioned-store + the CLI. Live-verified: between two 100%-success versions, the cheaper/faster one won on composite reward and was promoted to champion. Tests: +6 (champion-only, trial floor, exploration-factor explore/exploit, composite reward incl. success- dominates, ucbScores snapshot) + the existing 12. typecheck + full suite (1222) + build green.

… + UCB README (#28) Follow-ups to the UCB1 work (rewardWeights config already shipped in #27): - Composite-reward efficiency is now normalized RELATIVE TO THE CHAMPION (the stable version is the baseline). championRef() takes the champion's per-exec tokens/ms; a version at champion cost scores the 0.5 efficiency baseline, free → 1.0, twice the cost → 0.0. So a challenger is rewarded for being cheaper/faster than the champion it's trying to unseat (not just relative to the other arm). ucbScores + decideChampion both use the champion reference. - `qodex skill rollback <name> <version>` + rollbackToVersion(): snap a versioned skill's champion back to any earlier version (un-retires it, drops the challenger). The manual safety lever alongside the bandit's auto-convergence. - README: a "Skill versioning & A/B testing (UCB1)" section — flat manifest storage, the bandit, composite/champion-relative reward, the full config block, and a real `skill versions` / `skill rollback` example. Live-verified: rollback to a version makes it champion + clears the challenger; champion-relative reward gives the champion a 0.5 efficiency baseline and ranks a 2× costlier challenger below it. Tests updated to the champion-ref shape + 1 new (over-budget challenger penalized). typecheck + full suite (1223) + build green. Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>

QodeXcli merged commit fc4b9e6 into main Jun 26, 2026
2 checks passed

QodeXcli deleted the feat/ucb-stronger branch June 26, 2026 01:18

QodeXcli mentioned this pull request Jun 26, 2026

feat(skills): champion-relative reward + skill rollback + UCB README #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(skills): strengthen UCB1 — composite reward, tunable c, trial floor, scores, off-switch#27

feat(skills): strengthen UCB1 — composite reward, tunable c, trial floor, scores, off-switch#27
QodeXcli merged 1 commit into
mainfrom
feat/ucb-stronger

QodeXcli commented Jun 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

QodeXcli commented Jun 26, 2026

1. Tunable exploration factor

2. Composite reward (success + tokens + time)

3. Minimum challenger trials

4. UCB score history / analysis

5. Off-switch

Live-verified

Tests

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant