Skip to content

feat(skills): strengthen UCB1 — composite reward, tunable c, trial floor, scores, off-switch#27

Merged
QodeXcli merged 1 commit into
mainfrom
feat/ucb-stronger
Jun 26, 2026
Merged

feat(skills): strengthen UCB1 — composite reward, tunable c, trial floor, scores, off-switch#27
QodeXcli merged 1 commit into
mainfrom
feat/ucb-stronger

Conversation

@QodeXcli

Copy link
Copy Markdown
Owner

All five requested upgrades to the Phase-4 UCB1 bandit (#24), in the pure, tested core.

1. Tunable exploration factor

RouteOptions.explorationFactor (config learning.versioning.ucbExplorationFactor, default √2). Higher c keeps a barely-tested challenger in play longer.

2. Composite reward (success + tokens + time)

UCB1's per-arm value is no longer raw success rate but a weighted blend — default 0.7 success / 0.15 token / 0.15 time, configurable. Success dominates (a cheap failure never beats an expensive success), but between two equally-successful versions the cheaper + faster one wins. Added totalDurationMs to stats (back-compat optional); recordVersionExecution takes durationMs; decideChampion converges on the composite reward too.

3. Minimum challenger trials

RouteOptions.minChallengerTrials (default 5): a challenger is force-routed until it clears the floor, so UCB1 never starves it — and no decision is made — on too little signal.

4. UCB score history / analysis

ucbScores(manifest) returns reward + exploration bonus + ucb per arm; qodex skill versions <name> now prints the breakdown (and avg ms/run) for debugging.

5. Off-switch

routingStrategy: 'champion-only' (config strategy) disables UCB entirely — always the stable version, for sensitive skills you don't want experimented on.

config: learning.versioning { ucbExplorationFactor, minChallengerTrials, rewardWeights, strategy }, threaded through versioned-store + the CLI.

Live-verified

Between two 100%-success versions, the cheaper/faster one won on composite reward and was promoted to champion.

Tests

+6 (champion-only, trial floor, exploration-factor explore↔exploit, composite reward incl. success-dominates, ucbScores snapshot) + existing 12. ✅ typecheck · ✅ full suite (1222) · ✅ build.

…oor, scores, off-switch

All five requested upgrades to the Phase-4 bandit, in the pure tested core:

1. Tunable exploration factor — RouteOptions.explorationFactor (config
   learning.versioning.ucbExplorationFactor; default √2). Higher c keeps a
   barely-tested challenger in play longer.
2. Composite reward — UCB1's per-arm value is no longer raw success rate but a
   weighted blend of success + token-efficiency + time-efficiency (default
   0.7/0.15/0.15, configurable). Success dominates (a cheap failure never beats an
   expensive success), but between two equally-successful versions the cheaper +
   faster one wins. Added totalDurationMs to VersionStats (back-compat optional);
   recordVersionExecution now takes durationMs; decideChampion converges on the
   composite reward too.
3. Minimum challenger trials — RouteOptions.minChallengerTrials (default 5): a
   challenger is force-routed until it clears the floor, so UCB1 never starves it
   (or a decision is never made) on too little signal.
4. UCB score history/analysis — ucbScores(manifest) returns reward + exploration
   bonus + ucb per arm; `qodex skill versions <name>` now prints the breakdown
   (and avg ms/run) for debugging.
5. Off-switch — routingStrategy 'champion-only' (config strategy) disables UCB and
   always routes the stable version — for sensitive skills you don't want
   experimented on.

config: learning.versioning { ucbExplorationFactor, minChallengerTrials,
rewardWeights, strategy }, threaded through versioned-store + the CLI.

Live-verified: between two 100%-success versions, the cheaper/faster one won on
composite reward and was promoted to champion. Tests: +6 (champion-only, trial
floor, exploration-factor explore/exploit, composite reward incl. success-
dominates, ucbScores snapshot) + the existing 12. typecheck + full suite (1222) +
build green.
@QodeXcli QodeXcli merged commit fc4b9e6 into main Jun 26, 2026
2 checks passed
@QodeXcli QodeXcli deleted the feat/ucb-stronger branch June 26, 2026 01:18
QodeXcli added a commit that referenced this pull request Jun 26, 2026
… + UCB README (#28)

Follow-ups to the UCB1 work (rewardWeights config already shipped in #27):

- Composite-reward efficiency is now normalized RELATIVE TO THE CHAMPION (the
  stable version is the baseline). championRef() takes the champion's per-exec
  tokens/ms; a version at champion cost scores the 0.5 efficiency baseline, free →
  1.0, twice the cost → 0.0. So a challenger is rewarded for being cheaper/faster
  than the champion it's trying to unseat (not just relative to the other arm).
  ucbScores + decideChampion both use the champion reference.
- `qodex skill rollback <name> <version>` + rollbackToVersion(): snap a versioned
  skill's champion back to any earlier version (un-retires it, drops the
  challenger). The manual safety lever alongside the bandit's auto-convergence.
- README: a "Skill versioning & A/B testing (UCB1)" section — flat manifest
  storage, the bandit, composite/champion-relative reward, the full config block,
  and a real `skill versions` / `skill rollback` example.

Live-verified: rollback to a version makes it champion + clears the challenger;
champion-relative reward gives the champion a 0.5 efficiency baseline and ranks a
2× costlier challenger below it. Tests updated to the champion-ref shape + 1 new
(over-budget challenger penalized). typecheck + full suite (1223) + build green.

Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant