Skip to content

CI stress-testing: guarantees, coverage, 3-tier CI + Windows speedup#5

Open
bentoner wants to merge 76 commits into
mainfrom
ci-stress
Open

CI stress-testing: guarantees, coverage, 3-tier CI + Windows speedup#5
bentoner wants to merge 76 commits into
mainfrom
ci-stress

Conversation

@bentoner

Copy link
Copy Markdown
Owner

Summary

Hardening pass on git-commit-lock: CI stress-testing guarantees + coverage, test-harness ergonomics, and a Windows CI speedup. No behavioural change to the lock protocol itself beyond the mtime-floor / read-ladder robustness already on main; this branch is mostly tests, CI, and the normative docs that pin what the tool guarantees.

What's in it

Guarantees & failure-mode docs (new, normative)

  • docs/guarantees.md — the normative contract (safety / recovery / interop guarantees, failure semantics, out-of-scope, the operating envelope, and a §7 verification map: each guarantee → its witnessing test).
  • docs/failure-modes.md — the failure-mode map (§A–§K) + the accepted scope-decision log.
  • docs/load-testing-strategy.md — the rationale for why CI is shaped the way it is.

Coverage (closes the reasoned-but-untested lanes)

  • Fault-injection tests: F4 unwritable lock dir (Test 48), F2/J1 failing-log path (Test 49), F1 ENOSPC (Test 50, Linux+sudo), E3 unreadable-mtime fail-safe (Test 42), socket/device-node wrong-type (Test 44).
  • kcov line-coverage gate in the nightly tier (floor tracks achieved ~0.83, ratchetable).

3-tier CI

  • tests.yml (Tier R) — required, no-load, strict, per-PR; 7-cell matrix incl. the canary as its own parallel cell per arch.
  • nightly.yml (Tier N) — background-load matrix (tests/with-load.sh, oversubscription ratio), kcov gate, idempotent issue triage (nightly-triage.sh).
  • deep-sweep.yml (Tier D) — on-demand deep flake hunt (repeat × load).

Harness ergonomics

  • GCL_TEST_ONLY single-test selector; GCL_ENVELOPE_TIER strict/relax; GCL_TEST_SWEEP Axis-A waiter-count sweep.
  • Shared tests/_harness.sh extracted (all four suites source it).

Windows CI speedup

  • The concurrency canary (Test 1, the ~½-of-the-unit-suite 8×25 mutual-exclusion check) extracted into its own file tests/git-commit-lock.canary.test.sh and run as a parallel cell → ~50% faster Windows leg, simpler than the sharding approaches it replaced.

Status

  • Local: unit 313 passed / 0 failed, canary 2 / 0 (reduced fan-out).
  • Phase-4 review converged over 7 rounds (reviewers partitioned by responsibility — tests / CI / docs; foreign-model Codex + one Claude tests reviewer); final round clean.
  • bash -n / shellcheck / actionlint clean on all edited shell + workflow files.

Merge notes

  • Git history preserved (75 commits).
  • Branch-local agent-workflow artifacts removed in the final commit (AGENTS.md flake-hunt runbook, .plans/, .agent review-queue scaffolding) — main carries only product / docs / tests / CI.

🤖 Generated with Claude Code

bentoner and others added 30 commits June 16, 2026 21:32
…atch

Stress-test-only change so many workflow_dispatch runs execute in parallel on this
single branch without cancel-in-progress killing each other. Do NOT merge to main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnosis+Codex review: windows-2025 unit flake is a timing-fragile self-validation
assertion, not a product bug. Plan replaces got97>=1 with rc-in-{0,97,98} + a WAITING-
based anti-vacuity canary; keeps the warn17d TOCTOU regression guard untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…G canary)

Round-1 review (Claude + Codex, verified in code): rc-set must include 1 (lock_run
demotes a clean command to 1 on unverifiable-empty release); WAITING canary must read
per-waiter logs, not the shared churn.log (concurrent appends drop lines). Secondary
hardenings dropped. See reviewer notes at top of plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The got97>=1 self-validation assertion was timing-fragile: on a loaded 2-core Windows
runner all 12 waiters won create-races in the churner absent windows (rc 0) instead of
timing out, so got97=0 and the test failed though the product was correct (run
27616343269; 29 prior identical runs were green).

Replace it (diagnosis + Codex review confirmed test-flake, not a product bug):
  - per-waiter AGENT_LOCK_LOG (single-writer => drop-free; shared logs drop lines
    under concurrent appends), rebuilt into churn.log for the artifact;
  - assert all 12 waiters reach a DESIGNED terminal rc in {0,1,97,98} (rc 1 = clean
    command demoted on unverifiable-empty release) — catches real product regressions;
  - anti-vacuity: require >=1 WAITING line (proves the churn produced real contention
    and the guarded per-poll type-guard path ran);
  - unconditional note: rc distribution + WAITING count for future triage.
The warn17d==0 TOCTOU regression guard is unchanged.

Local: unit suite 214/0; shellcheck -S style (v0.11.0) + bash -n clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
STRESS-BRANCH ONLY (do not merge). tests/with-load.sh runs each suite while N CPU
spin-loops and/or N disk create/write+fsync/delete loops saturate the runner, to widen
the timing windows that latency/race flakes depend on (Test 17d absent window is driven
by both CPU descheduling and slow file IO). Selected via new workflow_dispatch inputs
stress_kind (none|cpu|disk|both, default both) and stress_load (blank=core count); empty
on push/schedule => none. Step/job timeouts raised so load slowness does not trip a
timeout and look like a flake. Hogs reaped by exact PID (never by name).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
So the process (dispatch -> on failure: subagent diagnose -> Codex review -> plan ->
review/fix rounds -> implement -> review/fix rounds -> commit -> reset streak -> resume),
the mechanics, the process-hygiene lessons, and the progress log survive context compaction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under CPU load the kill -9 of the pwsh holder missed the native pwsh.exe (MSYS dollar-bang
is a shim), so pwsh ran its full Start-Sleep 60 and exited gracefully — its
PowerShell.Exiting backstop DELETED the lock, so the precondition read got an empty/gone
file and the 3 steal asserts were vacuous (stole a backdate-recreated empty file).
Diagnosis + Codex agreed: test bug, product correct.

Fix (Option D): the holder now does
  if (-not (Lock-Acquire)) { Exit 3 }; write READY; [Environment]::Exit(0)
Environment.Exit bypasses BOTH release and the backstop, leaving a deterministic
token-bearing orphan with no external kill. bash drops the kill and just reaps. The
tok.ps token assertion is now genuine every run, not vacuous.

Local interop suite 141/0. Reviewed clean by fresh Claude reviewer + Codex. shellcheck
-S style + bash -n clean. Found via the CI load stress test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n T31a)

Third flake found by the load hunt (ubuntu, both/load=2): Test 31(a) leaked-token-memory
DISCOVERY-HOLD assertion races the external mv install vs the leaver _lock_discover; under
load the direct-discover path (sh:822) adopts the claim instead of the memory path (sh:1382)
the assertion pins. Product correct; test-orchestration race; 31(b) covers the memory path
deterministically. Diagnosed, fix pending the formal loop. Hunt at 16/50 clean, halted here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ben clarified: there are no credits to worry about; public repo => unlimited CI capped only
by GitHub concurrency. Corrects the earlier misleading agent-credits framing that prompted a
premature budget pause. Resume guidance: dispatch freely, keep going to 50 clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ind)

Sub-leg (a) installs a recheck-unreadable leaked claim at the lock path via an
external mv, then asserted adoption went through the per-poll leaked-token-memory
route ("DISCOVERY-HOLD (leaked-token memory)"). But the product can adopt the
claim via EITHER of two correct routes: the inline ownership-discovery read
(git-commit-lock.sh:822) if the mv lands before it, or the per-poll memory check
(git-commit-lock.sh:1382) on a later poll if it lands after. Which fires is a pure
scheduling race -- the external mv vs the leaver's inline discover one statement
after the leak-add (sh:1112 -> sh:1114). Under both/load=2 on ubuntu the mv won and
the direct route fired, so the memory-pinned assertion failed spuriously
(run 27626826865).

The product behaved correctly in both cases (token remembered, same token observed
installed, adopted, rc 0, clean release, no residue). Fix is test-only: sub-leg (a)
now accepts EITHER DISCOVERY-HOLD route and records which fired, failing only if
neither adopted the claim. No coverage is lost -- the memory route stays pinned
deterministically by sub-leg (b), and the direct route by Test 25's 7-position
discovery-position matrix.

Diagnosis converged across four independent reviews (my code-read + the verbatim
leak.log, a fresh-context Claude subagent that did not read the prior diagnosis,
and a Codex foreign-model review). Implementation reviewed clean by a fresh Claude
reviewer and by Codex. Static checks (bash -n + shellcheck -S style v0.11.0) clean;
local unit suite 207 passed / 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckdump

Fill the real fix SHA into the Test 31(a) progress entry, update the hunt status
(clean_count reset; both/load=2 hunt resumed toward 50 clean on the fixed tree;
three flakes fixed this session). Add *.stackdump to .gitignore so the suite's
transient Cygwin crash dumps stop cluttering git status during the hunt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A coverage audit (subagent + my own verification against the code) found the
product's two acquire read-back-verification failure lanes were asymmetrically
covered. The create-path lane (O_EXCL create wins, path reads back the wrong token,
git-commit-lock.sh:1354-1360) is covered by Test 32. Its steal-path twin "F2"
(git-commit-lock.sh:1168-1179) was NOT: the stealer wins the claim race AND wins the
rename-over (STOLE-BY-CLAIM logged, ghost destroyed), but the mandatory post-rename
read-back at :1171 reads back the wrong token, so the product must clear its claim
token and re-enter the wait loop rather than take the hold. After a STOLE-BY-CLAIM a
silent false-hold there would be a mis-attributed hold of a destroyed-ghost path, so
this is the higher-stakes twin — and nothing exercised it.

Test 32b closes the gap. It mirrors Test 32 with the INVERSE token gate: a one-shot
_lock_cur_token shadow gated on [ -n "$_LOCK_CLAIM_TOKEN" ] lands the read-back fault
at the STEAL read-back (:1171), not the create one (:1353, where the claim token is
empty). On firing it backdates the just-installed abandoned lock stale so the re-steal
is immediate (same trick as Test 32 — keeps it fast and deterministic); the second
attempt (shadow spent) reads back the real token and acquires, releasing rc 0. The
test asserts the F2-specific log line (not the shared "acquire verification FAILED"
prefix), STOLE-BY-CLAIM x2, the WARNING preceding the eventual ACQUIRED (no
false-hold), and no leftovers. The stale closing NOTE that called the read-back lanes
"not suite-covered" is corrected (create by Test 32, F2 by Test 32b).

Product code is unchanged; F2 reads correct today — this is a missing-test
(regression exposure), not a present bug.

Diagnosis from a coverage-audit subagent, verified by me against the code. Test-only;
no product change. Static checks clean; local suite 0 failed; Test 32b verified to
exercise the F2 lane (standalone + full suite). Implementation reviewed clean by a
fresh Claude reviewer and by Codex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A decision-support document classifying every failure mode into three
robustness tiers (correctness / best-effort-in-envelope / out-of-scope),
each grounded in product code + tests with file:line citations, with a
recommendation on whether it should be an in-scope guarantee.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ck dir; add E3/H4

Address findings from a foreign-model (Codex) + fresh-Claude review of the
failure-modes map, each verified against the code:

- Core-guarantee precision (the doc's central thesis): the unconditional safety
  property is "no silent lost update, given cooperative wrapper unwind", NOT
  unconditional mutual exclusion. Strict mutual exclusion holds only within the
  staleness window; beyond it the lease is fail-open-but-detectable. Split Tier 1
  into safety (unconditional) vs recovery (lock-shaped orphans only, under a
  readable-clock / supported-FS envelope; foreign objects at the path are
  deliberately never auto-removed).
- Add H4: hard kill (SIGKILL) or a wrapped command's [Environment]::Exit() while
  holding bypasses release-time detection -> the explicit boundary of the
  no-silent-loss guarantee.
- Add E3: mtime probe entirely unreadable -> staleness detection disabled; fails
  SAFE (never steals a lock whose age it cannot establish), recovery lost, loudly
  announced once per process (both ports).
- Fix E2 clock-jump direction (age = now - mtime: a FORWARD jump makes a live lock
  look stale -> premature steal -> detected-98; a BACKWARD jump delays recovery).
- D1: separate the atomic-overwrite engines (mv -T / 3-arg File.Move) from the
  non-atomic Windows PowerShell 5.1 unlink-then-Move fallback (claim-guarded;
  fairness loss, never a clobber).
- Note the leaked-token memory is process-local (ties the "no unowned lock"
  framing to residual 5); correct the README-location claim for the mixed-version
  note (it is in the design doc only); minor citation fixes (README quote/line,
  Test 22a over-attribution).

Reviewers confirmed the central thesis (correctness load-independent; only latency
degrades) holds against every interleaving attacked on a local FS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ypass class

Address the round-2 reviews (Codex + fresh Claude), verified against the code:

- Codex (blocking): §H4's bypass list was incomplete for bash. lock_run runs the
  wrapped command in the wrapper shell itself (git-commit-lock.sh:1733), so a
  wrapped `exec` replaces that shell and skips BOTH lock_release and the EXIT trap
  — the same silent-loss boundary as the pwsh [Environment]::Exit() case.
  Generalize §H4 and the §1 bullet from an enumeration to the class
  "termination/replacement without wrapper unwind" (external SIGKILL / bash exec /
  [Environment]::Exit()), and add the contrast that a plain `exit` is safe
  (it unwinds: bash EXIT trap, pwsh finally).
- Claude (nit): §H4 had attributed try/finally to the bash run path; corrected to
  bash EXIT trap vs pwsh try/finally.

Both rounds confirmed the central thesis holds (no two believed-legitimate holders;
no UNdetected lost update on a local FS within the envelope) and that round 1's
revisions 1-7 are factually correct and internally consistent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Applied Ben's review comments (frozen in 534a007) to the failure-modes
scope-decisions doc:
- §4 status: the recommendations are reviewed and accepted, except two overrides.
- Network FS (§4.3 / §E1): document-only — surface the boundary in README, do NOT
  build the FS-type startup probe ("don't do the polish, just document").
- Untested-but-robust lanes (§4.5 / F1-F4, J1): OVERRIDE the prior "accept untested"
  -> add test coverage. Rationale (Ben): actually-tested edge cases make the project
  easier to maintain and give future users confidence vs reasoned-correct-but-untested.
  Propagated to the per-mode F1/F2/F3/F4/J1 entries and their summary-table rows.
Disposition check passed (fresh verifier): every comment dispositioned, propagation
consistent, no leaked comment text, §4 numbering coherent.

comment-commit: 534a007

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…; max-parallel/paths-ignore caveats; billing vs scarcity; FUSE/SIP hedges; reconcile non-Linux disk cells)
Phase 1a (Bucket 1 / D-a) — docs/guarantees.md: the normative contract
derived from failure-modes.md. Operating envelope (E1-E7), guarantees
(safety G-S1..5, recovery G-R1..4, interop G-I1), failure semantics,
best-effort tier (BE-1..5), out-of-scope (OOS-1..7 incl. the non-unwinding-
exit no-silent-loss boundary), operating rules, and a verification map.

Phase 1c (Bucket 7) — docs/steering-coverage.md: the deterministic-steering
coverage audit + prioritized gap list, synthesized from two manual window
audits and an objective kcov pass (83.1% line coverage, 451/543; ~30 lines
platform-gated, ~62 Linux-reachable). kcov corrected three manual over-
credits (step-3.3 CLAIM-ABORT, the foreign claim-recheck branch, the EXIT-
trap no-hold twin). Gaps ranked: Tier A portable steering (A1 rename-refused
wrong-type-mid-steal = headline; A2 step-3.3 abort lane; A4 the exec/H4
boundary), Tier B fault-injection (failure-modes 4.5), Tier C platform-only,
Tier D document-not-test.

Both reviewed to convergence: a fresh-context Claude reviewer plus two Codex
rounds. The foreign Codex check plus a 4-line empirical test corrected the
exec-bypass characterization across all three docs: "run -- bash -c 'exec'"
does NOT skip release (the child shell is replaced, the wrapper releases
normally); only an exec in the lock-holding shell itself (a sourced
lock_acquire+exec, or "run -- exec") bypasses. Propagated to guarantees.md
OOS-5, steering-coverage.md A4, and the failure-modes.md H4 precision fix in
this commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Concrete build plan derived from the Phase 1 outputs (guarantees.md,
steering-coverage.md) + the accepted failure-modes §4 and load-strategy §9
decisions. Sections:
- 2A Tier-A steering tests (A1-A11; per-test mechanism / assertion / platform /
  priority — the audit gave each steering technique).
- 2B Tier-B fault-injection (F4 + F2/J1 first cut; F1 gated-or-doc, F3 doc-only)
  — each injection empirically prototyped; refines the original D-b (F3 was not
  deterministically injectable; ulimit -f is a SIGXFSZ trap).
- 3 doc edits (exact text: envelope + single-clock in the design doc; network-FS
  boundary + upgrade-both in the README).
- 4 GCL_ENVELOPE_TIER=relax mechanism + the 3 downgrade sites (D-c).
- 6 three-tier CI (Required / Nightly / Deep), event-conditional concurrency
  (keeps the deep-sweep group off the required gate), kcov coverage job, nightly
  auto-triage, paths-ignore-on-required fix, and a refined do-not-merge
  disposition (with-load.sh graduates calibrated).
- 8 TAP + 1..N + the silent-undercount sentinel fix, GCL_TEST_ONLY selector
  (integration excluded by design), tests/_harness.sh extraction.

Bucket 2B/6/8 designs feasibility-validated by parallel agents (prototypes in
.agent-testing/, gitignored). Awaiting Ben's Phase 2 gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to-gate

CI (Codex): split into THREE workflows — tests.yml (required) + a stable
tests-passed aggregator as the ONLY required context; nightly.yml; deep-sweep.yml
with distinct job names. This fixes the workflow_dispatch-publishes-check-contexts
gating risk AND the paths-ignore-on-required gotcha, and drops the event-conditional
concurrency expression. Made ok_envelope/bad_envelope TAP-aware (Bucket 8 item 1
lands first, so TAPN/GCL_TAP exist). Added a GCL_TEST_ONLY zero-match guard.

Tests (Claude): A6 must shadow the INNER _lock_stat_mtime (:606), NOT _lock_path_mtime
(:639-643), which is the function that emits the warn-once the test asserts — verified
against the code. Test 22a downgrade refined to only the warning-fired-at-all
assertion (keep the warn-once dedup n<=1 and names-type strict). A reviewer's Test-22a
line numbers were a mislocation — the plan mapping (:1167/:1168/:1170) is verified
correct against the tree. F3 reclassified document-only supersedes steering-coverage
B4's "portable POSIX" rating (ulimit -n can't fail the ~1-FD create without starving
bash startup) — steering-coverage.md B4 corrected to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All three suites: ok/bad now emit TAP13 (`ok`/`not ok N - desc`) gated by
GCL_TAP=1 (dev runs byte-unchanged); a trailing `1..N` plan line lets a consumer
fail on a short count; and a DONE sentinel + a finish() EXIT-trap wrapper turn any
early exit/crash into a loud `Bail out!` + exit 1 — closing the silent-undercount
gap (a stray `exit 0` after a recorded FAIL no longer reports green). A bare trap
`return` is ignored by bash, so the guard uses an explicit `exit 1`.

Validated: unit suite REDUCED + GCL_TAP=1 -> 220/220, the `1..220` plan line
matches the assertion count, exit 0, sentinel does not false-fire. interop +
integration syntax-checked here; full runs verify via CI.

Phase 3, step 1 of the Phase 2 build plan (Bucket 8 item 1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grade-both)

docs/git-commit-lock.md: add the operating-envelope statement (correctness is
load-independent; the wall-clock bounds are best-effort and scale with scheduling)
and the single-time-source note (a local clock jump is correctness-safe, degrading
to the detected exit-98 lane), both beside the mtime-clock caveat; cross-ref
guarantees.md.

README.md: surface the network/sync-FS boundary in "How it works" (exclusion may
silently fail off a local FS), and add the "upgrade both implementations together"
deployment note (a mixed-version tree degrades prevention to detection).

C-misc (the optional case-insensitive-FS / mixed-version one-liners) skipped as
low-value — the design doc already covers mixed-version.

Plan changelog updated: Phase 3 step 1 done; the Bucket 8 item-2 selector is
deferred to bundle with item 3 (revised phasing recorded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add TAP-aware ok_envelope/bad_envelope to the unit suite: default 'strict' is
identical to ok/bad; GCL_ENVELOPE_TIER=relax downgrades an envelope FAIL to a WARN
that never reds the run (ENV_WARN counted + reported in the summary). Reassign the
three load-sensitive wall-clock/poll-count assertions to the envelope tier, keeping
every neighbouring correctness assertion strict:
- Test 21: recovery <=20s
- Test 22a: "warning fired at all" (n>=1) -> envelope; the warn-once dedup (n<=1)
  and the type-naming stay STRICT (names-type guarded on a warning having fired)
- Test 29: >=2 CLAIM lines

Validated: strict (default) -> 220/220, 0 envelope warnings, 1..220 consistent, the
3 sites report PASS[env]. Downgrade logic checked deterministically (relax->WARN,
strict->FAIL). No product (git-commit-lock.sh) change.

Phase 3, step 3 (Bucket 4). Required CI will set strict; nightly/deep set relax.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
bentoner and others added 30 commits June 18, 2026 06:24
Subplan for splitting the windows-2025 unit leg (the ~2x CI bottleneck) into two
parallel shards via a GCL_TEST_SHARD=i/n round-robin gate in section(). Records
the mechanism, the by-construction partition guarantee + a self-contained
per-shard expected-count guard, alternatives rejected, CI wiring (windows-unit
only; not kcov/nightly/interop), edge cases, phasing, and a logging design.
Awaiting review convergence + Ben's go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round 1 = 2 fresh Claude reviewers + independent Codex. Two blocking defects in
the original snippet + two Codex caught the Claude pair missed, all folded:
- malformed input (empty component / leading-zero octal trap) now rejected by a
  single regex ^([1-9][0-9]*)/([1-9][0-9]*)$;
- GCL_TEST_ONLY and GCL_TEST_SHARD made MUTUALLY EXCLUSIVE (simpler than AND +
  guard-fallback; no real use case combines them);
- GCL_TEST_SHARD parsed LAZILY (first section() call) so the integration suite
  (no section() blocks) never parses/bails — it note-and-ignores;
- union proof + per-shard logging use run-only PASS/FAIL signals, not the
  == Test N == headers (which section() prints before gating, so skipped too);
- guard asserts expected>=1 (catches n>section-count) and its rationale reworded
  (catches a section()-coverage regression, not a correlated modulo bug).
Confirm round (fresh reviewer + Codex) pending before declaring converged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round 2 = fresh Claude (sound) + independent Codex (2 accuracy defects). Folded:
- union-proof run-line set corrected to ^(PASS:|FAIL:|PASS[env]:|WARN[env-relaxed]:)
  (ok_envelope/bad_envelope emit the bracketed forms; a bare PASS:/FAIL: grep
  undercounts the 315) — or use GCL_TAP=1;
- the per-shard guard reframed HONESTLY: it does NOT catch a test added outside a
  section() gate (that bumps neither counter); its value is the empty-shard
  (expected>=1) check + a cheap modulo cross-check. The union proof's no-duplicate
  check is what catches an ungated test (runs in both shards);
- explicit selector_report shard-guard snippet (gated on GCL_TEST_SHARD set);
- RAN: attribution marker gated on GCL_TEST_SHARD set (unsharded stays identical);
- kcov-interaction note (Ben asked): coverage job runs the full suite UNSHARDED;
  sharding code is inert when GCL_TEST_SHARD unset -> no interaction.
Mechanism verified sound by 4 reviewers across 2 rounds; final Codex spot-confirm running.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…implement)

Final independent Codex spot-confirm returned no findings. Mechanism verified
sound across 3 review rounds (Claude x3 + Codex x3). Status -> converged; ready
for implementation on Ben's go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the converged windows-unit-shard subplan (Phase 1 — mechanism). Opt-in
GCL_TEST_SHARD=<i>/<n> splits a section()-gated suite into n round-robin shards by
file-order section index:
- tests/_harness.sh: lazy _shard_init (parsed on first section() call, so suites
  that never call section() neither parse nor bail) — mutually exclusive with
  GCL_TEST_ONLY, validated by ^([1-9][0-9]*)/([1-9][0-9]*)$ (rejects empty /
  non-digit / leading-zero octal trap / extra slash), then i<=n. section() bumps
  SECTION_IDX for every test before gating (stable shard key), adds the residue
  gate, and emits a run-only RAN: marker ONLY in shard mode. selector_report gains
  a gated guard: recompute expected from the same residue mapping, log one
  greppable verdict line, bail if SECTIONS_RUN != expected or expected < 1 (catches
  an empty shard, e.g. n>section-count). Unset/empty => no-op, so unsharded runs
  are byte-identical.
- integration suite note-and-ignores GCL_TEST_SHARD (one indivisible scenario, no
  section() blocks).

Partition is guaranteed by construction (round-robin over one stable ordering) +
a one-time local union proof. Validated locally: unsharded unit 315/0 + interop
141/0 byte-identical; shard 1/2 + 2/2 disjoint, union == unsharded (no dup),
148+167=315 run-lines / 29+28=57 sections; all malformed inputs bail; integration
prints the note + runs all 12; shellcheck -S style clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The windows-2025 unit leg is the CI wall-clock bottleneck (~2x the others;
process-spawn overhead on the 2-core runner). Split its one matrix cell into two
(shard 1/2) running in parallel to ~halve that leg:
- Unit-suite step sets GCL_TEST_SHARD: matrix.shard && '{0}/2' || '' — only the two
  windows-unit cells carry matrix.shard, so ubuntu/macos (leg:all) and windows
  interop-integration get '' and run the FULL unit suite unchanged.
- artifact name + job-name template gain a shard suffix (v4+ rejects duplicate
  artifact names). No other cell / SHA-pin / step touched.
Coverage (kcov, nightly, Linux) stays whole + unsharded — orthogonal.
actionlint clean; cross-platform CI verification follows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…red)

CI run 27723744798 all green, both shards pass. Overall CI 360s->242s (~33%);
windows-unit no longer the 2x outlier. But round-robin balance was 242 vs 99
(~2.4x), not the planned ~10% — the estimate used reduced-mode timings while CI
runs full mode (the 8x25 canary clusters in shard 1). Recorded the data + the
lesson (estimate balance from the run mode) + the decision rationale: accept
as-is (macos 210s is the floor, so re-balancing gains only ~32s for a cost-table
the plan rejected). Mechanism is correct + green regardless.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-robin balanced poorly (242 vs 99) because ONE test dominates: Test 1 (the
8x25 FULL-width canary) is ~151s, ~half the 309s suite; the other 56 tests sum to
~158s (measured from run 27723744798's job-log timestamps). So the balanced fixed
split is simply Test 1 alone on one shard, the rest on the other (~151 vs ~158).
Windows-unit -> ~167s, so macOS's 210s becomes the overall floor (242->210).

Replaces the round-robin assignment with a static _shard_of() in _harness.sh
(case "Test 1:"* -> shard 1, else shard 2) — a fixed split derived once from a
measurement, NOT a maintained cost table. New tests default to shard 2; the
per-shard wall-clock log surfaces drift for occasional re-tune. n stays 2 (Test
1's 151s is an irreducible floor). Ben endorsed the Test-1-vs-rest split.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rsede sharding

After unwinding the GCL_TEST_SHARD sharding (reverts 89de803 + 143e280), plan the
simpler approach: move Test 1 (the ~151s full-width concurrency canary, ~half the
Windows unit-suite runtime) into tests/git-commit-lock.canary.test.sh, so it runs
as a naturally-parallel CI job — zero sharding machinery, same wall-clock win.

The canary runs as its own cell on ALL arches (Ben's call: uniform, cheap extra
POSIX job). New 7-cell matrix (canary leg on ubuntu/macos/windows). Test 1 moves
verbatim (sources _harness.sh; tiny preamble: LIB, WORK/cleanup/trap, T1_* knobs,
INCR). Coverage-safe (union across cells == the original 57; counts reconcile).

Marked the two shard plans (windows-unit-shard, shard-balance) SUPERSEDED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex (independent single review) caught two gaps, both folded:
- ENV_WARN under set -u: the unit RESULT line expands $ENV_WARN (defined in the
  envelope section we're not copying) -> the canary file must define ENV_WARN=0.
- The other workflow callers: nightly.yml (stress cells + the kcov coverage job)
  and deep-sweep.yml reach Test 1 only via git-commit-lock.test.sh; after
  extraction they must also run the new canary file (preserve canary-under-load
  stress + the canary's kcov coverage contribution, ~3pp from the 0.80 floor).
  Principle: the canary is a 4th suite file — every suite-runner must include it.
Updated phasing to rewire all three workflows + the kcov merged-coverage run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Test 1 (the full-width 25x8 concurrency canary) is ~half the Windows unit-suite
wall-clock and is a different KIND of test (a statistical concurrency canary;
repetition-at-width is its coverage) from the targeted unit/steering tests. Move
it verbatim into tests/git-commit-lock.canary.test.sh so it can run as a
naturally-parallel CI job — replacing the (now-reverted) GCL_TEST_SHARD sharding
with a far simpler file split.

- New tests/git-commit-lock.canary.test.sh: sources _harness.sh; minimal preamble
  (LIB, the GCL_TEST_FULL->T1_* width knobs, WORK/cleanup/trap finish, INCR), the
  Test 1 block verbatim, standard RESULT/TAP tail. Defines ENV_WARN=0 (the RESULT
  line expands it but it lives in the envelope tier we don't copy).
- tests/git-commit-lock.test.sh: Test 1 block + the now-orphaned INCR removed
  (no other unit test uses INCR); count self-adjusts (TAPN is dynamic).
- _harness.sh: "three suites" -> "four".

Validated: canary standalone 2/0; unit-minus-canary 313/0; 313+2=315 (the
pre-extraction count) with disjoint, no-duplicate, union==original; interop
141/0, integration 12/0; shellcheck -S style clean (incl the new file).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly/deep-sweep/kcov

The canary is now a 4th suite file; every workflow that ran the suites must run it.
- tests.yml: a `canary` cell per arch (ubuntu/macos/windows) running
  canary.test.sh in parallel (the per-PR wall-clock win — Windows unit ~360s ->
  ~174s, overall macOS-gated). Canary step gates on `leg == 'canary'` only, so
  `leg: all` still runs unit+interop+integration but NOT canary (no double-run).
  Canary file added to the shellcheck lint list.
- nightly.yml: canary-under-load step in the stress cells; the kcov job runs the
  unit suite AND the canary under kcov into one merged --include-path outdir, so
  the canary's coverage of git-commit-lock.sh is preserved (the 0.80 floor can't
  regress from the split). Canary kcov log uploaded.
- deep-sweep.yml: canary step with the same repeat-loop / PIPESTATUS fail-fast /
  under-load wrapping as the unit suite.

actionlint clean on all three workflows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…than sharding)

CI run 27728088150 all 8 jobs green. Overall CI 360s -> ~181s (~50%), gated by
the windows unit-minus-canary cell (181s) with the windows canary (165s)
balanced beside it; macOS 194->167s (canary peeled into a cheap 33s cell). Beats
the sharding (242s, imbalanced) and is far simpler — zero GCL_TEST_SHARD
machinery; the canary is just its own suite file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The nightly kcov job runs `kcov … tests/git-commit-lock.canary.test.sh` directly
(not via `bash`), exactly as it does the unit suite — which works because the
unit suite is mode 100755. The canary was added 100644, so the direct kcov exec
would fail permission-denied. Set it executable to match the unit suite (the
precedent for a directly-invoked suite). Caught by Codex's implementation review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ov run

Round-2 review fixes (Codex + Claude, independent) for the canary split's nightly
kcov job:
- Floor-parse picked the cobertura with the highest lines-VALID. Since the unit
  + canary runs into one outdir produce per-suite reports plus a merged union at
  coverage/kcov-out/kcov-merged/cobertura.xml — ALL covering the same source, so
  all with identical lines-valid — the tie-break could grab a single-suite report
  and measure partial coverage (spurious floor failure). Select by highest
  lines-COVERED instead (the merged union has the most), which robustly picks the
  union. Comment corrected (kcov writes no top-level cobertura.xml).
- The kcov-run step lacked pipefail (the kcov job has no shell:bash default), so a
  failing `kcov ... | tee` returned tee's 0. Added `set -euo pipefail`.
- Dropped an inaccurate "+ sweep" from the canary step comment (the canary reads
  only GCL_TEST_FULL, not the Axis-A sweep).

The kcov job dispatches only from the default branch, so this path is exercised
post-merge; verified statically against the kcov v43 output layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ribution)

Round-2 review caught the canary extraction didn't update the user-facing docs:
- docs/git-commit-lock.md: add the canary row to the suite table + run-commands;
  "three"->"four" (temp dirs / CI / log-copy); and move the "mutual exclusion
  under many concurrent workers" description from the unit suite to a new canary
  paragraph (only Test 1 moved — crash-recovery/claim-contention stay in the unit
  suite and remain in its list). Also note the canary's counts are exact.
- README.md: "Three suites" -> "Four suites" (+ the bash concurrency canary).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ash-recovery)

Round-3 review catch: the canary suite TABLE ROW + RUN-COMMAND comment claimed
the canary also covers "crash-recovery / claim-serialization", but those (unit
Tests 2b/20) stayed in the unit suite — the canary file contains only Test 1
(mutual exclusion over repeated rounds + its mtime-probe-warning guard). Align
the table row + run comment with the (already-correct) detailed paragraph.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aim the canary

Round-4 review catches (all split-introduced, swept the whole class):
- tests/git-commit-lock.test.sh: Test 1 moved to the canary file, but the suite
  still computed orphaned T1_ROUNDS/T1_N and its `fan-out mode:` echo + header
  advertised "the full-strength 8x25 canary". Removed the dead T1_* vars; the
  mode-echo now reports the suite's actual heavy fan-out (Test 2b rounds, Test 20
  workers); header/echo say "full-strength fan-out", not "canary".
- deep-sweep.yml: "Mirrors the tests.yml 4-cell set" -> tests.yml now has canary
  cells; reworded to "per-OS legs" + noted the canary runs as a step here.
- nightly.yml: clarified the leg comment (all/unit/interop-integration as in
  tests.yml; the canary is a step here, not a leg).

bash -n + shellcheck + actionlint clean; mode-echo verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-5 review catches (comment-only, no behaviour change):
- nightly.yml kcov-run comment said kcov "merges into the top-level cobertura.xml"
  — contradicts the round-2-corrected enforcement comment. kcov writes the union
  to coverage/kcov-out/kcov-merged/cobertura.xml (no top-level file); the
  enforcement step selects it by highest lines-covered. Comment made consistent.
- deep-sweep.yml loop comment claimed "set -e is NOT in effect (default bash here)"
  — but deep-sweep sets shell: bash (-eo pipefail), so set -e + pipefail ARE on; a
  failing pipeline already trips the step. Reworded: the explicit PIPESTATUS check
  is a defensive backstop that names the failing iteration. (Pre-existing Bucket-6d
  wording the canary loop shares; behaviour was always correct.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…; Test 42 xref)

Comprehensive sweep of all touched files for stale post-split references (one had
surfaced per review round); fixed the two genuine remainders (comment-only):
- nightly.yml kcov section header said "unit suite at FULL" — the job now runs
  unit + canary under kcov (the step name already says so). -> "unit + canary at FULL".
- tests/git-commit-lock.test.sh Test 42 comment cross-referenced "Test 1" (which a
  reader would seek in the unit suite) — Test 1 moved to the canary file. Clarified:
  "the concurrency canary (formerly Test 1, now git-commit-lock.canary.test.sh)".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the Phase-4 area reviews (A unit+canary, C CI):
- A1: Test 42 `grep -c … || echo 0` appended a 2nd "0" on zero matches ("0\n0"),
  garbling the warn-once assertion — drop `|| echo 0` (grep -c prints 0 itself;
  `${:-0}` covers the missing-file case). Only bit the zero-firings case (not a
  false-green), now a clean integer. Suite still 313/0.
- B8: reflow an awkward comment wrap in tests/_harness.sh (cosmetic).
- C1: nightly kcov job inherited the workflow-level GCL_ENVELOPE_TIER=relax despite
  its "strict" intent — set GCL_ENVELOPE_TIER=strict on the kcov step (a true clean
  coverage run; the floor enforces the envelope assertions).
- C2: nightly-triage.sh classified ANY job failure as `nightly-correctness`, so a
  step TIMEOUT (no ^FAIL: line) misclassified as correctness — correctness now
  requires a real ^FAIL:; failure/timeout/cancel without one falls to `nightly-infra`.
- C3: add tests/with-load.sh to the required tests.yml shellcheck gate (nightly/
  deep-sweep depend on it; it was unlinted per-PR).
- Drop a stale `steering-coverage.md` reference in a Test 46 comment.

shellcheck -S style + actionlint clean; unit suite 313/0; triage dry-run confirms
timeout→infra.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the Phase-4 docs review (D) + Ben's merge-scope decision:
- D1: guarantees.md still said "coverage planned" for BE-3/BE-4/G-S2 resource lanes
  that are now TESTED + passing — flip to cite Tests 42/48/49/50 (F3 stays
  document-only), reconciling with failure-modes.md §2 (flipped earlier in 309cf39).
- D2: re-attribute the concurrency canary from "unit Test 1" to
  tests/git-commit-lock.canary.test.sh (the branch's central change) in guarantees.md
  (G-S3) + failure-modes.md (A1); add a `C:` witness legend. Crash-recovery/
  claim-contention stay unit Tests 2b/20.
- guarantees.md OOS-5: the bash `exec` §H4 lane is no longer "a coverage gap" — cite
  unit Test 40 (the exec-bypass boundary test); drop the steering-coverage.md ref.
- load-testing-strategy.md: rewritten from a stale "RECOMMENDATION / not-built-yet"
  plan into a present-tense rationale for why the CI is shaped as it is (Ben's call:
  must describe current state, not read as a stale plan). 346 -> 236 lines.
- steering-coverage.md: DELETED. It was a point-in-time gap list whose every gap is
  filled (Tests 37-50) + had a broken .plans/ link — a finished working artifact, not
  main-worthy (Ben's decision). The kcov baseline lives in the nightly kcov gate.
- Fold in 2 parallel stale README-notes in guarantees/failure-modes (the notes the
  README now carries).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er disclaimer

Round-2 re-review (tests + CI clean; docs Codex found 3):
- deep-sweep.yml referenced docs/load-testing-strategy.md "§9", a section the
  rewrite removed (Tier D is now under §3) — point to the doc generically.
- guarantees.md lacked the line-number disclaimer its sibling failure-modes.md
  carries; the branch's interop/unit reformatting drifted some `file:line` anchors
  (the test NAMES/NUMBERS stay correct). Added the "anchors, not exact addresses"
  disclaimer rather than re-derive every anchor (a transient, re-drifting fix the
  disclaimer exists to avoid; reviewer D round-1 judged the drift non-blocking).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es.md

Round-3 docs re-review (Codex) found the E2 row + §4 items 1-2 still framing the
load/timing envelope and single-clock docs as future recommendations, though both
are implemented (git-commit-lock.md "operating envelope" / "One time source";
load-testing-strategy.md §1). Fixed the whole class so the doc describes current
state — verified each target exists before de-flagging:

- E2 + K1 table rows: drop "not addressed in docs" / "UNDER-documented" /
  "Define the envelope" -> state it's documented.
- §E2 detail and §K "Net K": "Recommend: document X" -> present-tense
  "documented (where)".
- §4 items 1/2/3: add *Status (done):* markers (matching item 5's style) and
  neutralize the two imperative headings.
- §B1 latency-bound, §E3, §H4 "Recommend: document" -> "documented (where)"
  (targets verified: guarantees.md BE-3 for E3, OOS-5/G-S1 for H4, §K for B1).

Tests + CI legs were clean in round 2 and are untouched here (docs-only round).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion anchors

Round-4 docs re-review (Codex) flagged a cross-repo ref and stale plan vocabulary;
swept the whole class so the merged tree is self-contained (.plans/ + AGENTS.md are
deleted at merge, so any reference to their content would dangle on main):

- failure-modes.md: drop the cross-repo `agents/600-claude.md` citation; the
  "§4.5"/"§4.1" anchors meant "§4 item 5"/"§4 item 1" (no such subsections exist) —
  fixed across the doc.
- guarantees.md: same "§4.5"/"§4.1" fix; drop "Bucket 4 / D-c" plan vocabulary.
- deep-sweep.yml / nightly-triage.sh / nightly.yml: drop "Phase-2 build plan",
  "Bucket 6 spec/design", "Bucket-2 Tier-A" — the decisions they cited are stated
  inline.
- tests (test.sh / interop.test.sh / _harness.sh): "Bucket 4 / D-c" -> "see
  failure-modes.md §K / §4 item 1"; "Bucket 6" -> "see load-testing-strategy.md";
  "failure-modes.md §4.5" -> "§4 item 5". Comment-only; bash -n clean.

Validated: actionlint (deep-sweep, nightly), shellcheck (nightly-triage), bash -n
(all edited shell files).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… triage header

Round-5 docs re-review (Codex), 4 findings, all verified against the code:
1. G-S4 / D3 claim socket+device coverage but their witness lists cited only Tests
   17/17d/18/22 — added Test 44 (the socket & device-node arms of the wrong-type
   classifier) to the G-S4 prose witness, the §7 verification-map row, and the D3
   table row + detail.
2. load-testing-strategy.md cited `guarantees.md §E`, which does not exist
   (guarantees.md is §1-§7); the correctness-rests-on-structure point is §2A. Fixed.
3. failure-modes.md "Sources of truth" listed only three test suites — added the
   canary suite (the doc later uses `C Test 1` as a witness).
4. nightly-triage.sh header was stale: it claimed a CONCLUSIONS JSON file (the code
   reads per-cell cell-conclusion.txt) and that a `failure` conclusion is
   correctness (the code classifies failure-without-^FAIL: as infra). Header now
   matches the implementation.

Validated: shellcheck + bash -n (nightly-triage.sh). Docs are markdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-6 docs re-review (Codex): all named docs clean; one finding — the GENERATED
correctness-issue body still said "a FAIL: assertion and/or a cell job concluded
failure", contradicting the classifier (a failed/timed-out cell WITHOUT ^FAIL: is
infra, not correctness). The correctness body now cites only a ^FAIL: line; the
infra body's list is completed to name the failure-without-FAIL case (same class,
fixed pre-emptively).

Validated: shellcheck + bash -n.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on main)

ci-stress was the CI flakiness stress-hunt campaign; its agent-workflow scaffolding
is branch-local, not part of the product:
- AGENTS.md — the flake-hunt runbook for this campaign.
- .plans/ (9 files) — the zformal plan + subplans (full history preserved in git).
- .agent/.gitkeep + the /.agent/* .gitignore entries — the per-worktree
  comment-review queue scaffolding.

Kept: the product (git-commit-lock.sh/.ps1), install.sh, README, docs/, the four
test suites + harness/load wrapper, the CI workflows, and the .agent-testing/
ignore (general testing convention).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant