feat(skills): strengthen UCB1 — composite reward, tunable c, trial floor, scores, off-switch#27
Merged
Merged
Conversation
…oor, scores, off-switch
All five requested upgrades to the Phase-4 bandit, in the pure tested core:
1. Tunable exploration factor — RouteOptions.explorationFactor (config
learning.versioning.ucbExplorationFactor; default √2). Higher c keeps a
barely-tested challenger in play longer.
2. Composite reward — UCB1's per-arm value is no longer raw success rate but a
weighted blend of success + token-efficiency + time-efficiency (default
0.7/0.15/0.15, configurable). Success dominates (a cheap failure never beats an
expensive success), but between two equally-successful versions the cheaper +
faster one wins. Added totalDurationMs to VersionStats (back-compat optional);
recordVersionExecution now takes durationMs; decideChampion converges on the
composite reward too.
3. Minimum challenger trials — RouteOptions.minChallengerTrials (default 5): a
challenger is force-routed until it clears the floor, so UCB1 never starves it
(or a decision is never made) on too little signal.
4. UCB score history/analysis — ucbScores(manifest) returns reward + exploration
bonus + ucb per arm; `qodex skill versions <name>` now prints the breakdown
(and avg ms/run) for debugging.
5. Off-switch — routingStrategy 'champion-only' (config strategy) disables UCB and
always routes the stable version — for sensitive skills you don't want
experimented on.
config: learning.versioning { ucbExplorationFactor, minChallengerTrials,
rewardWeights, strategy }, threaded through versioned-store + the CLI.
Live-verified: between two 100%-success versions, the cheaper/faster one won on
composite reward and was promoted to champion. Tests: +6 (champion-only, trial
floor, exploration-factor explore/exploit, composite reward incl. success-
dominates, ucbScores snapshot) + the existing 12. typecheck + full suite (1222) +
build green.
QodeXcli
added a commit
that referenced
this pull request
Jun 26, 2026
… + UCB README (#28) Follow-ups to the UCB1 work (rewardWeights config already shipped in #27): - Composite-reward efficiency is now normalized RELATIVE TO THE CHAMPION (the stable version is the baseline). championRef() takes the champion's per-exec tokens/ms; a version at champion cost scores the 0.5 efficiency baseline, free → 1.0, twice the cost → 0.0. So a challenger is rewarded for being cheaper/faster than the champion it's trying to unseat (not just relative to the other arm). ucbScores + decideChampion both use the champion reference. - `qodex skill rollback <name> <version>` + rollbackToVersion(): snap a versioned skill's champion back to any earlier version (un-retires it, drops the challenger). The manual safety lever alongside the bandit's auto-convergence. - README: a "Skill versioning & A/B testing (UCB1)" section — flat manifest storage, the bandit, composite/champion-relative reward, the full config block, and a real `skill versions` / `skill rollback` example. Live-verified: rollback to a version makes it champion + clears the challenger; champion-relative reward gives the champion a 0.5 efficiency baseline and ranks a 2× costlier challenger below it. Tests updated to the champion-ref shape + 1 new (over-budget challenger penalized). typecheck + full suite (1223) + build green. Co-authored-by: Louise Lau <QodeXcli@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
All five requested upgrades to the Phase-4 UCB1 bandit (#24), in the pure, tested core.
1. Tunable exploration factor
RouteOptions.explorationFactor(configlearning.versioning.ucbExplorationFactor, default √2). Higherckeeps a barely-tested challenger in play longer.2. Composite reward (success + tokens + time)
UCB1's per-arm value is no longer raw success rate but a weighted blend — default 0.7 success / 0.15 token / 0.15 time, configurable. Success dominates (a cheap failure never beats an expensive success), but between two equally-successful versions the cheaper + faster one wins. Added
totalDurationMsto stats (back-compat optional);recordVersionExecutiontakesdurationMs;decideChampionconverges on the composite reward too.3. Minimum challenger trials
RouteOptions.minChallengerTrials(default 5): a challenger is force-routed until it clears the floor, so UCB1 never starves it — and no decision is made — on too little signal.4. UCB score history / analysis
ucbScores(manifest)returns reward + exploration bonus + ucb per arm;qodex skill versions <name>now prints the breakdown (and avg ms/run) for debugging.5. Off-switch
routingStrategy: 'champion-only'(configstrategy) disables UCB entirely — always the stable version, for sensitive skills you don't want experimented on.config:
learning.versioning { ucbExplorationFactor, minChallengerTrials, rewardWeights, strategy }, threaded throughversioned-store+ the CLI.Live-verified
Between two 100%-success versions, the cheaper/faster one won on composite reward and was promoted to champion.
Tests
+6 (champion-only, trial floor, exploration-factor explore↔exploit, composite reward incl. success-dominates, ucbScores snapshot) + existing 12. ✅ typecheck · ✅ full suite (1222) · ✅ build.