Skip to content

feat(skills): champion-relative reward + skill rollback + UCB README#28

Merged
QodeXcli merged 1 commit into
mainfrom
feat/ucb-reward-rollback
Jun 26, 2026
Merged

feat(skills): champion-relative reward + skill rollback + UCB README#28
QodeXcli merged 1 commit into
mainfrom
feat/ucb-reward-rollback

Conversation

@QodeXcli

Copy link
Copy Markdown
Owner

Follow-ups to the UCB1 work. Note: item #1 (rewardWeights in config) already shipped in #27 — verified learning.versioning.rewardWeights is present. This PR does the other three.

#2 — Composite reward normalized RELATIVE TO THE CHAMPION

Efficiency is now measured against the champion (the stable version is the baseline a challenger must beat), not the max across arms. championRef() = the champion's per-exec tokens/ms; a version at champion cost scores the 0.5 efficiency baseline, free → 1.0, 2× cost → 0.0. ucbScores + decideChampion use the champion reference. Result: a challenger is rewarded specifically for being cheaper/faster than the champion it's challenging.

#3qodex skill rollback <name> <version>

rollbackToVersion() + CLI: snap a versioned skill's champion back to any earlier version (un-retires it, drops the challenger) — the manual safety lever alongside the bandit's auto-convergence.

#4 — README UCB1 docs

A new "Skill versioning & A/B testing (UCB1)" section: flat-manifest storage, the bandit, composite/champion-relative reward, the full config block, and a real skill versions / skill rollback example.

Live-verified

Rollback → that version becomes champion + challenger cleared (and a missing version returns false). Champion-relative reward gives the champion a 0.5 efficiency baseline and ranks a 2× costlier challenger below it.

Tests

Updated to the champion-ref shape + 1 new (over-budget challenger penalized). ✅ typecheck · ✅ full suite (1223) · ✅ build.

… + UCB README

Follow-ups to the UCB1 work (rewardWeights config already shipped in #27):

- Composite-reward efficiency is now normalized RELATIVE TO THE CHAMPION (the
  stable version is the baseline). championRef() takes the champion's per-exec
  tokens/ms; a version at champion cost scores the 0.5 efficiency baseline, free →
  1.0, twice the cost → 0.0. So a challenger is rewarded for being cheaper/faster
  than the champion it's trying to unseat (not just relative to the other arm).
  ucbScores + decideChampion both use the champion reference.
- `qodex skill rollback <name> <version>` + rollbackToVersion(): snap a versioned
  skill's champion back to any earlier version (un-retires it, drops the
  challenger). The manual safety lever alongside the bandit's auto-convergence.
- README: a "Skill versioning & A/B testing (UCB1)" section — flat manifest
  storage, the bandit, composite/champion-relative reward, the full config block,
  and a real `skill versions` / `skill rollback` example.

Live-verified: rollback to a version makes it champion + clears the challenger;
champion-relative reward gives the champion a 0.5 efficiency baseline and ranks a
2× costlier challenger below it. Tests updated to the champion-ref shape + 1 new
(over-budget challenger penalized). typecheck + full suite (1223) + build green.
@QodeXcli QodeXcli merged commit 2520925 into main Jun 26, 2026
2 checks passed
@QodeXcli QodeXcli deleted the feat/ucb-reward-rollback branch June 26, 2026 01:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant