Gap
examples/eval-demo/score-baseline.txt and score-after-drift.txt are pre-committed snapshots of iec eval output (9/9 and 5/9). Nothing in CI re-runs the command and diffs against them:
.github/workflows/ci.yml runs lint, test (pytest), and self-check (iec check). None invoke iec eval against the demo.
tests/integration/test_eval.py and tests/test_eval.py do not reference the examples/eval-demo fixtures or the score-*.txt files.
So if the iec eval output format, the checks, or the demo fixtures (baseline/, after-drift/, eval/*/checks.yaml) change, the committed .txt snapshots silently go stale and nobody finds out from CI.
Why it matters
The book chapter content/quality/agent-evaluation.md (Agent Evaluation and Regression) prints these exact scores and named failures as proof that the eval workflow works. A repo that teaches "tests are proof, not ritual" should not ship example output that CI never re-derives. Verified today the committed numbers still match the binary:
iec eval --path baseline --eval-dir eval # Score: 9/9 (100%)
iec eval --path after-drift --eval-dir eval # Score: 5/9 (55%)
But that check was manual and one-off.
Proposed fix
Add a CI step (or an integration test) that runs both commands from examples/eval-demo and diffs stdout against the committed score-*.txt files, failing on any difference. Either:
- a small job in
ci.yml that runs the two commands and diffs captured output against the committed files, or
- a regeneration script plus
git diff --exit-code examples/eval-demo/score-*.txt so the snapshots are guaranteed reproducible.
This keeps the demo (and the book chapter that cites it) honest as the CLI evolves.
Gap
examples/eval-demo/score-baseline.txtandscore-after-drift.txtare pre-committed snapshots ofiec evaloutput (9/9 and 5/9). Nothing in CI re-runs the command and diffs against them:.github/workflows/ci.ymlrunslint,test(pytest), andself-check(iec check). None invokeiec evalagainst the demo.tests/integration/test_eval.pyandtests/test_eval.pydo not reference theexamples/eval-demofixtures or thescore-*.txtfiles.So if the
iec evaloutput format, the checks, or the demo fixtures (baseline/,after-drift/,eval/*/checks.yaml) change, the committed.txtsnapshots silently go stale and nobody finds out from CI.Why it matters
The book chapter
content/quality/agent-evaluation.md(Agent Evaluation and Regression) prints these exact scores and named failures as proof that the eval workflow works. A repo that teaches "tests are proof, not ritual" should not ship example output that CI never re-derives. Verified today the committed numbers still match the binary:But that check was manual and one-off.
Proposed fix
Add a CI step (or an integration test) that runs both commands from
examples/eval-demoand diffs stdout against the committedscore-*.txtfiles, failing on any difference. Either:ci.ymlthat runs the two commands anddiffs captured output against the committed files, orgit diff --exit-code examples/eval-demo/score-*.txtso the snapshots are guaranteed reproducible.This keeps the demo (and the book chapter that cites it) honest as the CLI evolves.