Skip to content

CI does not verify examples/eval-demo scores against committed score-*.txt snapshots #1

Description

@flemming-n-larsen

Gap

examples/eval-demo/score-baseline.txt and score-after-drift.txt are pre-committed snapshots of iec eval output (9/9 and 5/9). Nothing in CI re-runs the command and diffs against them:

  • .github/workflows/ci.yml runs lint, test (pytest), and self-check (iec check). None invoke iec eval against the demo.
  • tests/integration/test_eval.py and tests/test_eval.py do not reference the examples/eval-demo fixtures or the score-*.txt files.

So if the iec eval output format, the checks, or the demo fixtures (baseline/, after-drift/, eval/*/checks.yaml) change, the committed .txt snapshots silently go stale and nobody finds out from CI.

Why it matters

The book chapter content/quality/agent-evaluation.md (Agent Evaluation and Regression) prints these exact scores and named failures as proof that the eval workflow works. A repo that teaches "tests are proof, not ritual" should not ship example output that CI never re-derives. Verified today the committed numbers still match the binary:

iec eval --path baseline --eval-dir eval      # Score: 9/9 (100%)
iec eval --path after-drift --eval-dir eval   # Score: 5/9 (55%)

But that check was manual and one-off.

Proposed fix

Add a CI step (or an integration test) that runs both commands from examples/eval-demo and diffs stdout against the committed score-*.txt files, failing on any difference. Either:

  • a small job in ci.yml that runs the two commands and diffs captured output against the committed files, or
  • a regeneration script plus git diff --exit-code examples/eval-demo/score-*.txt so the snapshots are guaranteed reproducible.

This keeps the demo (and the book chapter that cites it) honest as the CLI evolves.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions