CI stress-testing: guarantees, coverage, 3-tier CI + Windows speedup by bentoner · Pull Request #5 · bentoner/git-commit-lock

bentoner · 2026-06-18T05:14:56Z

Summary

Hardening pass on git-commit-lock: CI stress-testing guarantees + coverage, test-harness ergonomics, and a Windows CI speedup. No behavioural change to the lock protocol itself beyond the mtime-floor / read-ladder robustness already on main; this branch is mostly tests, CI, and the normative docs that pin what the tool guarantees.

What's in it

Guarantees & failure-mode docs (new, normative)

docs/guarantees.md — the normative contract (safety / recovery / interop guarantees, failure semantics, out-of-scope, the operating envelope, and a §7 verification map: each guarantee → its witnessing test).
docs/failure-modes.md — the failure-mode map (§A–§K) + the accepted scope-decision log.
docs/load-testing-strategy.md — the rationale for why CI is shaped the way it is.

Coverage (closes the reasoned-but-untested lanes)

Fault-injection tests: F4 unwritable lock dir (Test 48), F2/J1 failing-log path (Test 49), F1 ENOSPC (Test 50, Linux+sudo), E3 unreadable-mtime fail-safe (Test 42), socket/device-node wrong-type (Test 44).
kcov line-coverage gate in the nightly tier (floor tracks achieved ~0.83, ratchetable).

3-tier CI

tests.yml (Tier R) — required, no-load, strict, per-PR; 7-cell matrix incl. the canary as its own parallel cell per arch.
nightly.yml (Tier N) — background-load matrix (tests/with-load.sh, oversubscription ratio), kcov gate, idempotent issue triage (nightly-triage.sh).
deep-sweep.yml (Tier D) — on-demand deep flake hunt (repeat × load).

Harness ergonomics

GCL_TEST_ONLY single-test selector; GCL_ENVELOPE_TIER strict/relax; GCL_TEST_SWEEP Axis-A waiter-count sweep.
Shared tests/_harness.sh extracted (all four suites source it).

Windows CI speedup

The concurrency canary (Test 1, the ~½-of-the-unit-suite 8×25 mutual-exclusion check) extracted into its own file tests/git-commit-lock.canary.test.sh and run as a parallel cell → ~50% faster Windows leg, simpler than the sharding approaches it replaced.

Status

Local: unit 313 passed / 0 failed, canary 2 / 0 (reduced fan-out).
Phase-4 review converged over 7 rounds (reviewers partitioned by responsibility — tests / CI / docs; foreign-model Codex + one Claude tests reviewer); final round clean.
bash -n / shellcheck / actionlint clean on all edited shell + workflow files.

Merge notes

Git history preserved (75 commits).
Branch-local agent-workflow artifacts removed in the final commit (AGENTS.md flake-hunt runbook, .plans/, .agent review-queue scaffolding) — main carries only product / docs / tests / CI.

🤖 Generated with Claude Code

…atch Stress-test-only change so many workflow_dispatch runs execute in parallel on this single branch without cancel-in-progress killing each other. Do NOT merge to main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Diagnosis+Codex review: windows-2025 unit flake is a timing-fragile self-validation assertion, not a product bug. Plan replaces got97>=1 with rc-in-{0,97,98} + a WAITING- based anti-vacuity canary; keeps the warn17d TOCTOU regression guard untouched. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…G canary) Round-1 review (Claude + Codex, verified in code): rc-set must include 1 (lock_run demotes a clean command to 1 on unverifiable-empty release); WAITING canary must read per-waiter logs, not the shared churn.log (concurrent appends drop lines). Secondary hardenings dropped. See reviewer notes at top of plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The got97>=1 self-validation assertion was timing-fragile: on a loaded 2-core Windows runner all 12 waiters won create-races in the churner absent windows (rc 0) instead of timing out, so got97=0 and the test failed though the product was correct (run 27616343269; 29 prior identical runs were green). Replace it (diagnosis + Codex review confirmed test-flake, not a product bug): - per-waiter AGENT_LOCK_LOG (single-writer => drop-free; shared logs drop lines under concurrent appends), rebuilt into churn.log for the artifact; - assert all 12 waiters reach a DESIGNED terminal rc in {0,1,97,98} (rc 1 = clean command demoted on unverifiable-empty release) — catches real product regressions; - anti-vacuity: require >=1 WAITING line (proves the churn produced real contention and the guarded per-poll type-guard path ran); - unconditional note: rc distribution + WAITING count for future triage. The warn17d==0 TOCTOU regression guard is unchanged. Local: unit suite 214/0; shellcheck -S style (v0.11.0) + bash -n clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

STRESS-BRANCH ONLY (do not merge). tests/with-load.sh runs each suite while N CPU spin-loops and/or N disk create/write+fsync/delete loops saturate the runner, to widen the timing windows that latency/race flakes depend on (Test 17d absent window is driven by both CPU descheduling and slow file IO). Selected via new workflow_dispatch inputs stress_kind (none|cpu|disk|both, default both) and stress_load (blank=core count); empty on push/schedule => none. Step/job timeouts raised so load slowness does not trip a timeout and look like a flake. Hogs reaped by exact PID (never by name). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

So the process (dispatch -> on failure: subagent diagnose -> Codex review -> plan -> review/fix rounds -> implement -> review/fix rounds -> commit -> reset streak -> resume), the mechanics, the process-hygiene lessons, and the progress log survive context compaction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Under CPU load the kill -9 of the pwsh holder missed the native pwsh.exe (MSYS dollar-bang is a shim), so pwsh ran its full Start-Sleep 60 and exited gracefully — its PowerShell.Exiting backstop DELETED the lock, so the precondition read got an empty/gone file and the 3 steal asserts were vacuous (stole a backdate-recreated empty file). Diagnosis + Codex agreed: test bug, product correct. Fix (Option D): the holder now does if (-not (Lock-Acquire)) { Exit 3 }; write READY; [Environment]::Exit(0) Environment.Exit bypasses BOTH release and the backstop, leaving a deterministic token-bearing orphan with no external kill. bash drops the kill and just reaps. The tok.ps token assertion is now genuine every run, not vacuous. Local interop suite 141/0. Reviewed clean by fresh Claude reviewer + Codex. shellcheck -S style + bash -n clean. Found via the CI load stress test. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…n T31a) Third flake found by the load hunt (ubuntu, both/load=2): Test 31(a) leaked-token-memory DISCOVERY-HOLD assertion races the external mv install vs the leaver _lock_discover; under load the direct-discover path (sh:822) adopts the claim instead of the memory path (sh:1382) the assertion pins. Product correct; test-orchestration race; 31(b) covers the memory path deterministically. Diagnosed, fix pending the formal loop. Hunt at 16/50 clean, halted here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Ben clarified: there are no credits to worry about; public repo => unlimited CI capped only by GitHub concurrency. Corrects the earlier misleading agent-credits framing that prompted a premature budget pause. Resume guidance: dispatch freely, keep going to 50 clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ind) Sub-leg (a) installs a recheck-unreadable leaked claim at the lock path via an external mv, then asserted adoption went through the per-poll leaked-token-memory route ("DISCOVERY-HOLD (leaked-token memory)"). But the product can adopt the claim via EITHER of two correct routes: the inline ownership-discovery read (git-commit-lock.sh:822) if the mv lands before it, or the per-poll memory check (git-commit-lock.sh:1382) on a later poll if it lands after. Which fires is a pure scheduling race -- the external mv vs the leaver's inline discover one statement after the leak-add (sh:1112 -> sh:1114). Under both/load=2 on ubuntu the mv won and the direct route fired, so the memory-pinned assertion failed spuriously (run 27626826865). The product behaved correctly in both cases (token remembered, same token observed installed, adopted, rc 0, clean release, no residue). Fix is test-only: sub-leg (a) now accepts EITHER DISCOVERY-HOLD route and records which fired, failing only if neither adopted the claim. No coverage is lost -- the memory route stays pinned deterministically by sub-leg (b), and the direct route by Test 25's 7-position discovery-position matrix. Diagnosis converged across four independent reviews (my code-read + the verbatim leak.log, a fresh-context Claude subagent that did not read the prior diagnosis, and a Codex foreign-model review). Implementation reviewed clean by a fresh Claude reviewer and by Codex. Static checks (bash -n + shellcheck -S style v0.11.0) clean; local unit suite 207 passed / 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ckdump Fill the real fix SHA into the Test 31(a) progress entry, update the hunt status (clean_count reset; both/load=2 hunt resumed toward 50 clean on the fixed tree; three flakes fixed this session). Add *.stackdump to .gitignore so the suite's transient Cygwin crash dumps stop cluttering git status during the hunt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

A coverage audit (subagent + my own verification against the code) found the product's two acquire read-back-verification failure lanes were asymmetrically covered. The create-path lane (O_EXCL create wins, path reads back the wrong token, git-commit-lock.sh:1354-1360) is covered by Test 32. Its steal-path twin "F2" (git-commit-lock.sh:1168-1179) was NOT: the stealer wins the claim race AND wins the rename-over (STOLE-BY-CLAIM logged, ghost destroyed), but the mandatory post-rename read-back at :1171 reads back the wrong token, so the product must clear its claim token and re-enter the wait loop rather than take the hold. After a STOLE-BY-CLAIM a silent false-hold there would be a mis-attributed hold of a destroyed-ghost path, so this is the higher-stakes twin — and nothing exercised it. Test 32b closes the gap. It mirrors Test 32 with the INVERSE token gate: a one-shot _lock_cur_token shadow gated on [ -n "$_LOCK_CLAIM_TOKEN" ] lands the read-back fault at the STEAL read-back (:1171), not the create one (:1353, where the claim token is empty). On firing it backdates the just-installed abandoned lock stale so the re-steal is immediate (same trick as Test 32 — keeps it fast and deterministic); the second attempt (shadow spent) reads back the real token and acquires, releasing rc 0. The test asserts the F2-specific log line (not the shared "acquire verification FAILED" prefix), STOLE-BY-CLAIM x2, the WARNING preceding the eventual ACQUIRED (no false-hold), and no leftovers. The stale closing NOTE that called the read-back lanes "not suite-covered" is corrected (create by Test 32, F2 by Test 32b). Product code is unchanged; F2 reads correct today — this is a missing-test (regression exposure), not a present bug. Diagnosis from a coverage-audit subagent, verified by me against the code. Test-only; no product change. Static checks clean; local suite 0 failed; Test 32b verified to exercise the F2 lane (standalone + full suite). Implementation reviewed clean by a fresh Claude reviewer and by Codex. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…estart on final tree

A decision-support document classifying every failure mode into three robustness tiers (correctness / best-effort-in-envelope / out-of-scope), each grounded in product code + tests with file:line citations, with a recommendation on whether it should be an in-scope guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ck dir; add E3/H4 Address findings from a foreign-model (Codex) + fresh-Claude review of the failure-modes map, each verified against the code: - Core-guarantee precision (the doc's central thesis): the unconditional safety property is "no silent lost update, given cooperative wrapper unwind", NOT unconditional mutual exclusion. Strict mutual exclusion holds only within the staleness window; beyond it the lease is fail-open-but-detectable. Split Tier 1 into safety (unconditional) vs recovery (lock-shaped orphans only, under a readable-clock / supported-FS envelope; foreign objects at the path are deliberately never auto-removed). - Add H4: hard kill (SIGKILL) or a wrapped command's [Environment]::Exit() while holding bypasses release-time detection -> the explicit boundary of the no-silent-loss guarantee. - Add E3: mtime probe entirely unreadable -> staleness detection disabled; fails SAFE (never steals a lock whose age it cannot establish), recovery lost, loudly announced once per process (both ports). - Fix E2 clock-jump direction (age = now - mtime: a FORWARD jump makes a live lock look stale -> premature steal -> detected-98; a BACKWARD jump delays recovery). - D1: separate the atomic-overwrite engines (mv -T / 3-arg File.Move) from the non-atomic Windows PowerShell 5.1 unlink-then-Move fallback (claim-guarded; fairness loss, never a clobber). - Note the leaked-token memory is process-local (ties the "no unowned lock" framing to residual 5); correct the README-location claim for the mixed-version note (it is in the design doc only); minor citation fixes (README quote/line, Test 22a over-attribution). Reviewers confirmed the central thesis (correctness load-independent; only latency degrades) holds against every interleaving attacked on a local FS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ypass class Address the round-2 reviews (Codex + fresh Claude), verified against the code: - Codex (blocking): §H4's bypass list was incomplete for bash. lock_run runs the wrapped command in the wrapper shell itself (git-commit-lock.sh:1733), so a wrapped `exec` replaces that shell and skips BOTH lock_release and the EXIT trap — the same silent-loss boundary as the pwsh [Environment]::Exit() case. Generalize §H4 and the §1 bullet from an enumeration to the class "termination/replacement without wrapper unwind" (external SIGKILL / bash exec / [Environment]::Exit()), and add the contrast that a plain `exit` is safe (it unwinds: bash EXIT trap, pwsh finally). - Claude (nit): §H4 had attributed try/finally to the bash run path; corrected to bash EXIT trap vs pwsh try/finally. Both rounds confirmed the central thesis holds (no two believed-legitimate holders; no UNdetected lost update on a local FS within the envelope) and that round 1's revisions 1-7 are factually correct and internally consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This reverts commit 534a007.

Applied Ben's review comments (frozen in 534a007) to the failure-modes scope-decisions doc: - §4 status: the recommendations are reviewed and accepted, except two overrides. - Network FS (§4.3 / §E1): document-only — surface the boundary in README, do NOT build the FS-type startup probe ("don't do the polish, just document"). - Untested-but-robust lanes (§4.5 / F1-F4, J1): OVERRIDE the prior "accept untested" -> add test coverage. Rationale (Ben): actually-tested edge cases make the project easier to maintain and give future users confidence vs reasoned-correct-but-untested. Propagated to the per-mode F1/F2/F3/F4/J1 entries and their summary-table rows. Disposition check passed (fresh verifier): every comment dispositioned, propagation consistent, no leaked comment text, §4 numbering coherent. comment-commit: 534a007 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…review)

…rategy recommendation, Ben f)

…draft)

…; max-parallel/paths-ignore caveats; billing vs scarcity; FUSE/SIP hedges; reconcile non-Linux disk cells)

…et 8 (harness ergonomics research)

Phase 1a (Bucket 1 / D-a) — docs/guarantees.md: the normative contract derived from failure-modes.md. Operating envelope (E1-E7), guarantees (safety G-S1..5, recovery G-R1..4, interop G-I1), failure semantics, best-effort tier (BE-1..5), out-of-scope (OOS-1..7 incl. the non-unwinding- exit no-silent-loss boundary), operating rules, and a verification map. Phase 1c (Bucket 7) — docs/steering-coverage.md: the deterministic-steering coverage audit + prioritized gap list, synthesized from two manual window audits and an objective kcov pass (83.1% line coverage, 451/543; ~30 lines platform-gated, ~62 Linux-reachable). kcov corrected three manual over- credits (step-3.3 CLAIM-ABORT, the foreign claim-recheck branch, the EXIT- trap no-hold twin). Gaps ranked: Tier A portable steering (A1 rename-refused wrong-type-mid-steal = headline; A2 step-3.3 abort lane; A4 the exec/H4 boundary), Tier B fault-injection (failure-modes 4.5), Tier C platform-only, Tier D document-not-test. Both reviewed to convergence: a fresh-context Claude reviewer plus two Codex rounds. The foreign Codex check plus a 4-line empirical test corrected the exec-bypass characterization across all three docs: "run -- bash -c 'exec'" does NOT skip release (the child shell is replaced, the wrapper releases normally); only an exec in the lock-holding shell itself (a sourced lock_acquire+exec, or "run -- exec") bypasses. Propagated to guarantees.md OOS-5, steering-coverage.md A4, and the failure-modes.md H4 precision fix in this commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Concrete build plan derived from the Phase 1 outputs (guarantees.md, steering-coverage.md) + the accepted failure-modes §4 and load-strategy §9 decisions. Sections: - 2A Tier-A steering tests (A1-A11; per-test mechanism / assertion / platform / priority — the audit gave each steering technique). - 2B Tier-B fault-injection (F4 + F2/J1 first cut; F1 gated-or-doc, F3 doc-only) — each injection empirically prototyped; refines the original D-b (F3 was not deterministically injectable; ulimit -f is a SIGXFSZ trap). - 3 doc edits (exact text: envelope + single-clock in the design doc; network-FS boundary + upgrade-both in the README). - 4 GCL_ENVELOPE_TIER=relax mechanism + the 3 downgrade sites (D-c). - 6 three-tier CI (Required / Nightly / Deep), event-conditional concurrency (keeps the deep-sweep group off the required gate), kcov coverage job, nightly auto-triage, paths-ignore-on-required fix, and a refined do-not-merge disposition (with-load.sh graduates calibrated). - 8 TAP + 1..N + the silent-undercount sentinel fix, GCL_TEST_ONLY selector (integration excluded by design), tests/_harness.sh extraction. Bucket 2B/6/8 designs feasibility-validated by parallel agents (prototypes in .agent-testing/, gitignored). Awaiting Ben's Phase 2 gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…to-gate CI (Codex): split into THREE workflows — tests.yml (required) + a stable tests-passed aggregator as the ONLY required context; nightly.yml; deep-sweep.yml with distinct job names. This fixes the workflow_dispatch-publishes-check-contexts gating risk AND the paths-ignore-on-required gotcha, and drops the event-conditional concurrency expression. Made ok_envelope/bad_envelope TAP-aware (Bucket 8 item 1 lands first, so TAPN/GCL_TAP exist). Added a GCL_TEST_ONLY zero-match guard. Tests (Claude): A6 must shadow the INNER _lock_stat_mtime (:606), NOT _lock_path_mtime (:639-643), which is the function that emits the warn-once the test asserts — verified against the code. Test 22a downgrade refined to only the warning-fired-at-all assertion (keep the warn-once dedup n<=1 and names-type strict). A reviewer's Test-22a line numbers were a mislocation — the plan mapping (:1167/:1168/:1170) is verified correct against the tree. F3 reclassified document-only supersedes steering-coverage B4's "portable POSIX" rating (ulimit -n can't fail the ~1-FD create without starving bash startup) — steering-coverage.md B4 corrected to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

All three suites: ok/bad now emit TAP13 (`ok`/`not ok N - desc`) gated by GCL_TAP=1 (dev runs byte-unchanged); a trailing `1..N` plan line lets a consumer fail on a short count; and a DONE sentinel + a finish() EXIT-trap wrapper turn any early exit/crash into a loud `Bail out!` + exit 1 — closing the silent-undercount gap (a stray `exit 0` after a recorded FAIL no longer reports green). A bare trap `return` is ignored by bash, so the guard uses an explicit `exit 1`. Validated: unit suite REDUCED + GCL_TAP=1 -> 220/220, the `1..220` plan line matches the assertion count, exit 0, sentinel does not false-fire. interop + integration syntax-checked here; full runs verify via CI. Phase 3, step 1 of the Phase 2 build plan (Bucket 8 item 1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…grade-both) docs/git-commit-lock.md: add the operating-envelope statement (correctness is load-independent; the wall-clock bounds are best-effort and scale with scheduling) and the single-time-source note (a local clock jump is correctness-safe, degrading to the detected exit-98 lane), both beside the mtime-clock caveat; cross-ref guarantees.md. README.md: surface the network/sync-FS boundary in "How it works" (exclusion may silently fail off a local FS), and add the "upgrade both implementations together" deployment note (a mixed-version tree degrades prevention to detection). C-misc (the optional case-insensitive-FS / mixed-version one-liners) skipped as low-value — the design doc already covers mixed-version. Plan changelog updated: Phase 3 step 1 done; the Bucket 8 item-2 selector is deferred to bundle with item 3 (revised phasing recorded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Add TAP-aware ok_envelope/bad_envelope to the unit suite: default 'strict' is identical to ok/bad; GCL_ENVELOPE_TIER=relax downgrades an envelope FAIL to a WARN that never reds the run (ENV_WARN counted + reported in the summary). Reassign the three load-sensitive wall-clock/poll-count assertions to the envelope tier, keeping every neighbouring correctness assertion strict: - Test 21: recovery <=20s - Test 22a: "warning fired at all" (n>=1) -> envelope; the warn-once dedup (n<=1) and the type-naming stay STRICT (names-type guarded on a warning having fired) - Test 29: >=2 CLAIM lines Validated: strict (default) -> 220/220, 0 envelope warnings, 1..220 consistent, the 3 sites report PASS[env]. Downgrade logic checked deterministically (relax->WARN, strict->FAIL). No product (git-commit-lock.sh) change. Phase 3, step 3 (Bucket 4). Required CI will set strict; nightly/deep set relax. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Subplan for splitting the windows-2025 unit leg (the ~2x CI bottleneck) into two parallel shards via a GCL_TEST_SHARD=i/n round-robin gate in section(). Records the mechanism, the by-construction partition guarantee + a self-contained per-shard expected-count guard, alternatives rejected, CI wiring (windows-unit only; not kcov/nightly/interop), edge cases, phasing, and a logging design. Awaiting review convergence + Ben's go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Round 1 = 2 fresh Claude reviewers + independent Codex. Two blocking defects in the original snippet + two Codex caught the Claude pair missed, all folded: - malformed input (empty component / leading-zero octal trap) now rejected by a single regex ^([1-9][0-9]*)/([1-9][0-9]*)$; - GCL_TEST_ONLY and GCL_TEST_SHARD made MUTUALLY EXCLUSIVE (simpler than AND + guard-fallback; no real use case combines them); - GCL_TEST_SHARD parsed LAZILY (first section() call) so the integration suite (no section() blocks) never parses/bails — it note-and-ignores; - union proof + per-shard logging use run-only PASS/FAIL signals, not the == Test N == headers (which section() prints before gating, so skipped too); - guard asserts expected>=1 (catches n>section-count) and its rationale reworded (catches a section()-coverage regression, not a correlated modulo bug). Confirm round (fresh reviewer + Codex) pending before declaring converged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Round 2 = fresh Claude (sound) + independent Codex (2 accuracy defects). Folded: - union-proof run-line set corrected to ^(PASS:|FAIL:|PASS[env]:|WARN[env-relaxed]:) (ok_envelope/bad_envelope emit the bracketed forms; a bare PASS:/FAIL: grep undercounts the 315) — or use GCL_TAP=1; - the per-shard guard reframed HONESTLY: it does NOT catch a test added outside a section() gate (that bumps neither counter); its value is the empty-shard (expected>=1) check + a cheap modulo cross-check. The union proof's no-duplicate check is what catches an ungated test (runs in both shards); - explicit selector_report shard-guard snippet (gated on GCL_TEST_SHARD set); - RAN: attribution marker gated on GCL_TEST_SHARD set (unsharded stays identical); - kcov-interaction note (Ben asked): coverage job runs the full suite UNSHARDED; sharding code is inert when GCL_TEST_SHARD unset -> no interaction. Mechanism verified sound by 4 reviewers across 2 rounds; final Codex spot-confirm running. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…implement) Final independent Codex spot-confirm returned no findings. Mechanism verified sound across 3 review rounds (Claude x3 + Codex x3). Status -> converged; ready for implementation on Ben's go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Implements the converged windows-unit-shard subplan (Phase 1 — mechanism). Opt-in GCL_TEST_SHARD=<i>/<n> splits a section()-gated suite into n round-robin shards by file-order section index: - tests/_harness.sh: lazy _shard_init (parsed on first section() call, so suites that never call section() neither parse nor bail) — mutually exclusive with GCL_TEST_ONLY, validated by ^([1-9][0-9]*)/([1-9][0-9]*)$ (rejects empty / non-digit / leading-zero octal trap / extra slash), then i<=n. section() bumps SECTION_IDX for every test before gating (stable shard key), adds the residue gate, and emits a run-only RAN: marker ONLY in shard mode. selector_report gains a gated guard: recompute expected from the same residue mapping, log one greppable verdict line, bail if SECTIONS_RUN != expected or expected < 1 (catches an empty shard, e.g. n>section-count). Unset/empty => no-op, so unsharded runs are byte-identical. - integration suite note-and-ignores GCL_TEST_SHARD (one indivisible scenario, no section() blocks). Partition is guaranteed by construction (round-robin over one stable ordering) + a one-time local union proof. Validated locally: unsharded unit 315/0 + interop 141/0 byte-identical; shard 1/2 + 2/2 disjoint, union == unsharded (no dup), 148+167=315 run-lines / 29+28=57 sections; all malformed inputs bail; integration prints the note + runs all 12; shellcheck -S style clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The windows-2025 unit leg is the CI wall-clock bottleneck (~2x the others; process-spawn overhead on the 2-core runner). Split its one matrix cell into two (shard 1/2) running in parallel to ~halve that leg: - Unit-suite step sets GCL_TEST_SHARD: matrix.shard && '{0}/2' || '' — only the two windows-unit cells carry matrix.shard, so ubuntu/macos (leg:all) and windows interop-integration get '' and run the FULL unit suite unchanged. - artifact name + job-name template gain a shard suffix (v4+ rejects duplicate artifact names). No other cell / SHA-pin / step touched. Coverage (kcov, nightly, Linux) stays whole + unsharded — orthogonal. actionlint clean; cross-platform CI verification follows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…red) CI run 27723744798 all green, both shards pass. Overall CI 360s->242s (~33%); windows-unit no longer the 2x outlier. But round-robin balance was 242 vs 99 (~2.4x), not the planned ~10% — the estimate used reduced-mode timings while CI runs full mode (the 8x25 canary clusters in shard 1). Recorded the data + the lesson (estimate balance from the run mode) + the decision rationale: accept as-is (macos 210s is the floor, so re-balancing gains only ~32s for a cost-table the plan rejected). Mechanism is correct + green regardless. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Round-robin balanced poorly (242 vs 99) because ONE test dominates: Test 1 (the 8x25 FULL-width canary) is ~151s, ~half the 309s suite; the other 56 tests sum to ~158s (measured from run 27723744798's job-log timestamps). So the balanced fixed split is simply Test 1 alone on one shard, the rest on the other (~151 vs ~158). Windows-unit -> ~167s, so macOS's 210s becomes the overall floor (242->210). Replaces the round-robin assignment with a static _shard_of() in _harness.sh (case "Test 1:"* -> shard 1, else shard 2) — a fixed split derived once from a measurement, NOT a maintained cost table. New tests default to shard 2; the per-shard wall-clock log surfaces drift for occasional re-tune. n stays 2 (Test 1's 151s is an irreducible floor). Ben endorsed the Test-1-vs-rest split. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ts.yml)" This reverts commit 2de66ff.

…ness" This reverts commit a01a8e3.

…rsede sharding After unwinding the GCL_TEST_SHARD sharding (reverts 89de803 + 143e280), plan the simpler approach: move Test 1 (the ~151s full-width concurrency canary, ~half the Windows unit-suite runtime) into tests/git-commit-lock.canary.test.sh, so it runs as a naturally-parallel CI job — zero sharding machinery, same wall-clock win. The canary runs as its own cell on ALL arches (Ben's call: uniform, cheap extra POSIX job). New 7-cell matrix (canary leg on ubuntu/macos/windows). Test 1 moves verbatim (sources _harness.sh; tiny preamble: LIB, WORK/cleanup/trap, T1_* knobs, INCR). Coverage-safe (union across cells == the original 57; counts reconcile). Marked the two shard plans (windows-unit-shard, shard-balance) SUPERSEDED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Codex (independent single review) caught two gaps, both folded: - ENV_WARN under set -u: the unit RESULT line expands $ENV_WARN (defined in the envelope section we're not copying) -> the canary file must define ENV_WARN=0. - The other workflow callers: nightly.yml (stress cells + the kcov coverage job) and deep-sweep.yml reach Test 1 only via git-commit-lock.test.sh; after extraction they must also run the new canary file (preserve canary-under-load stress + the canary's kcov coverage contribution, ~3pp from the 0.80 floor). Principle: the canary is a 4th suite file — every suite-runner must include it. Updated phasing to rewire all three workflows + the kcov merged-coverage run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Test 1 (the full-width 25x8 concurrency canary) is ~half the Windows unit-suite wall-clock and is a different KIND of test (a statistical concurrency canary; repetition-at-width is its coverage) from the targeted unit/steering tests. Move it verbatim into tests/git-commit-lock.canary.test.sh so it can run as a naturally-parallel CI job — replacing the (now-reverted) GCL_TEST_SHARD sharding with a far simpler file split. - New tests/git-commit-lock.canary.test.sh: sources _harness.sh; minimal preamble (LIB, the GCL_TEST_FULL->T1_* width knobs, WORK/cleanup/trap finish, INCR), the Test 1 block verbatim, standard RESULT/TAP tail. Defines ENV_WARN=0 (the RESULT line expands it but it lives in the envelope tier we don't copy). - tests/git-commit-lock.test.sh: Test 1 block + the now-orphaned INCR removed (no other unit test uses INCR); count self-adjusts (TAPN is dynamic). - _harness.sh: "three suites" -> "four". Validated: canary standalone 2/0; unit-minus-canary 313/0; 313+2=315 (the pre-extraction count) with disjoint, no-duplicate, union==original; interop 141/0, integration 12/0; shellcheck -S style clean (incl the new file). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ly/deep-sweep/kcov The canary is now a 4th suite file; every workflow that ran the suites must run it. - tests.yml: a `canary` cell per arch (ubuntu/macos/windows) running canary.test.sh in parallel (the per-PR wall-clock win — Windows unit ~360s -> ~174s, overall macOS-gated). Canary step gates on `leg == 'canary'` only, so `leg: all` still runs unit+interop+integration but NOT canary (no double-run). Canary file added to the shellcheck lint list. - nightly.yml: canary-under-load step in the stress cells; the kcov job runs the unit suite AND the canary under kcov into one merged --include-path outdir, so the canary's coverage of git-commit-lock.sh is preserved (the 0.80 floor can't regress from the split). Canary kcov log uploaded. - deep-sweep.yml: canary step with the same repeat-loop / PIPESTATUS fail-fast / under-load wrapping as the unit suite. actionlint clean on all three workflows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…than sharding) CI run 27728088150 all 8 jobs green. Overall CI 360s -> ~181s (~50%), gated by the windows unit-minus-canary cell (181s) with the windows canary (165s) balanced beside it; macOS 194->167s (canary peeled into a cheap 33s cell). Beats the sharding (242s, imbalanced) and is far simpler — zero GCL_TEST_SHARD machinery; the canary is just its own suite file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The nightly kcov job runs `kcov … tests/git-commit-lock.canary.test.sh` directly (not via `bash`), exactly as it does the unit suite — which works because the unit suite is mode 100755. The canary was added 100644, so the direct kcov exec would fail permission-denied. Set it executable to match the unit suite (the precedent for a directly-invoked suite). Caught by Codex's implementation review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ov run Round-2 review fixes (Codex + Claude, independent) for the canary split's nightly kcov job: - Floor-parse picked the cobertura with the highest lines-VALID. Since the unit + canary runs into one outdir produce per-suite reports plus a merged union at coverage/kcov-out/kcov-merged/cobertura.xml — ALL covering the same source, so all with identical lines-valid — the tie-break could grab a single-suite report and measure partial coverage (spurious floor failure). Select by highest lines-COVERED instead (the merged union has the most), which robustly picks the union. Comment corrected (kcov writes no top-level cobertura.xml). - The kcov-run step lacked pipefail (the kcov job has no shell:bash default), so a failing `kcov ... | tee` returned tee's 0. Added `set -euo pipefail`. - Dropped an inaccurate "+ sweep" from the canary step comment (the canary reads only GCL_TEST_FULL, not the Axis-A sweep). The kcov job dispatches only from the default branch, so this path is exercised post-merge; verified statically against the kcov v43 output layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ribution) Round-2 review caught the canary extraction didn't update the user-facing docs: - docs/git-commit-lock.md: add the canary row to the suite table + run-commands; "three"->"four" (temp dirs / CI / log-copy); and move the "mutual exclusion under many concurrent workers" description from the unit suite to a new canary paragraph (only Test 1 moved — crash-recovery/claim-contention stay in the unit suite and remain in its list). Also note the canary's counts are exact. - README.md: "Three suites" -> "Four suites" (+ the bash concurrency canary). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ash-recovery) Round-3 review catch: the canary suite TABLE ROW + RUN-COMMAND comment claimed the canary also covers "crash-recovery / claim-serialization", but those (unit Tests 2b/20) stayed in the unit suite — the canary file contains only Test 1 (mutual exclusion over repeated rounds + its mtime-probe-warning guard). Align the table row + run comment with the (already-correct) detailed paragraph. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…aim the canary Round-4 review catches (all split-introduced, swept the whole class): - tests/git-commit-lock.test.sh: Test 1 moved to the canary file, but the suite still computed orphaned T1_ROUNDS/T1_N and its `fan-out mode:` echo + header advertised "the full-strength 8x25 canary". Removed the dead T1_* vars; the mode-echo now reports the suite's actual heavy fan-out (Test 2b rounds, Test 20 workers); header/echo say "full-strength fan-out", not "canary". - deep-sweep.yml: "Mirrors the tests.yml 4-cell set" -> tests.yml now has canary cells; reworded to "per-OS legs" + noted the canary runs as a step here. - nightly.yml: clarified the leg comment (all/unit/interop-integration as in tests.yml; the canary is a step here, not a leg). bash -n + shellcheck + actionlint clean; mode-echo verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Round-5 review catches (comment-only, no behaviour change): - nightly.yml kcov-run comment said kcov "merges into the top-level cobertura.xml" — contradicts the round-2-corrected enforcement comment. kcov writes the union to coverage/kcov-out/kcov-merged/cobertura.xml (no top-level file); the enforcement step selects it by highest lines-covered. Comment made consistent. - deep-sweep.yml loop comment claimed "set -e is NOT in effect (default bash here)" — but deep-sweep sets shell: bash (-eo pipefail), so set -e + pipefail ARE on; a failing pipeline already trips the step. Reworded: the explicit PIPESTATUS check is a defensive backstop that names the failing iteration. (Pre-existing Bucket-6d wording the canary loop shares; behaviour was always correct.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…; Test 42 xref) Comprehensive sweep of all touched files for stale post-split references (one had surfaced per review round); fixed the two genuine remainders (comment-only): - nightly.yml kcov section header said "unit suite at FULL" — the job now runs unit + canary under kcov (the step name already says so). -> "unit + canary at FULL". - tests/git-commit-lock.test.sh Test 42 comment cross-referenced "Test 1" (which a reader would seek in the unit suite) — Test 1 moved to the canary file. Clarified: "the concurrency canary (formerly Test 1, now git-commit-lock.canary.test.sh)". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

From the Phase-4 area reviews (A unit+canary, C CI): - A1: Test 42 `grep -c … || echo 0` appended a 2nd "0" on zero matches ("0\n0"), garbling the warn-once assertion — drop `|| echo 0` (grep -c prints 0 itself; `${:-0}` covers the missing-file case). Only bit the zero-firings case (not a false-green), now a clean integer. Suite still 313/0. - B8: reflow an awkward comment wrap in tests/_harness.sh (cosmetic). - C1: nightly kcov job inherited the workflow-level GCL_ENVELOPE_TIER=relax despite its "strict" intent — set GCL_ENVELOPE_TIER=strict on the kcov step (a true clean coverage run; the floor enforces the envelope assertions). - C2: nightly-triage.sh classified ANY job failure as `nightly-correctness`, so a step TIMEOUT (no ^FAIL: line) misclassified as correctness — correctness now requires a real ^FAIL:; failure/timeout/cancel without one falls to `nightly-infra`. - C3: add tests/with-load.sh to the required tests.yml shellcheck gate (nightly/ deep-sweep depend on it; it was unlinted per-PR). - Drop a stale `steering-coverage.md` reference in a Test 46 comment. shellcheck -S style + actionlint clean; unit suite 313/0; triage dry-run confirms timeout→infra. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

From the Phase-4 docs review (D) + Ben's merge-scope decision: - D1: guarantees.md still said "coverage planned" for BE-3/BE-4/G-S2 resource lanes that are now TESTED + passing — flip to cite Tests 42/48/49/50 (F3 stays document-only), reconciling with failure-modes.md §2 (flipped earlier in 309cf39). - D2: re-attribute the concurrency canary from "unit Test 1" to tests/git-commit-lock.canary.test.sh (the branch's central change) in guarantees.md (G-S3) + failure-modes.md (A1); add a `C:` witness legend. Crash-recovery/ claim-contention stay unit Tests 2b/20. - guarantees.md OOS-5: the bash `exec` §H4 lane is no longer "a coverage gap" — cite unit Test 40 (the exec-bypass boundary test); drop the steering-coverage.md ref. - load-testing-strategy.md: rewritten from a stale "RECOMMENDATION / not-built-yet" plan into a present-tense rationale for why the CI is shaped as it is (Ben's call: must describe current state, not read as a stale plan). 346 -> 236 lines. - steering-coverage.md: DELETED. It was a point-in-time gap list whose every gap is filled (Tests 37-50) + had a broken .plans/ link — a finished working artifact, not main-worthy (Ben's decision). The kcov baseline lives in the nightly kcov gate. - Fold in 2 parallel stale README-notes in guarantees/failure-modes (the notes the README now carries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…er disclaimer Round-2 re-review (tests + CI clean; docs Codex found 3): - deep-sweep.yml referenced docs/load-testing-strategy.md "§9", a section the rewrite removed (Tier D is now under §3) — point to the doc generically. - guarantees.md lacked the line-number disclaimer its sibling failure-modes.md carries; the branch's interop/unit reformatting drifted some `file:line` anchors (the test NAMES/NUMBERS stay correct). Added the "anchors, not exact addresses" disclaimer rather than re-derive every anchor (a transient, re-drifting fix the disclaimer exists to avoid; reviewer D round-1 judged the drift non-blocking). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…es.md Round-3 docs re-review (Codex) found the E2 row + §4 items 1-2 still framing the load/timing envelope and single-clock docs as future recommendations, though both are implemented (git-commit-lock.md "operating envelope" / "One time source"; load-testing-strategy.md §1). Fixed the whole class so the doc describes current state — verified each target exists before de-flagging: - E2 + K1 table rows: drop "not addressed in docs" / "UNDER-documented" / "Define the envelope" -> state it's documented. - §E2 detail and §K "Net K": "Recommend: document X" -> present-tense "documented (where)". - §4 items 1/2/3: add *Status (done):* markers (matching item 5's style) and neutralize the two imperative headings. - §B1 latency-bound, §E3, §H4 "Recommend: document" -> "documented (where)" (targets verified: guarantees.md BE-3 for E3, OOS-5/G-S1 for H4, §K for B1). Tests + CI legs were clean in round 2 and are untouched here (docs-only round). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tion anchors Round-4 docs re-review (Codex) flagged a cross-repo ref and stale plan vocabulary; swept the whole class so the merged tree is self-contained (.plans/ + AGENTS.md are deleted at merge, so any reference to their content would dangle on main): - failure-modes.md: drop the cross-repo `agents/600-claude.md` citation; the "§4.5"/"§4.1" anchors meant "§4 item 5"/"§4 item 1" (no such subsections exist) — fixed across the doc. - guarantees.md: same "§4.5"/"§4.1" fix; drop "Bucket 4 / D-c" plan vocabulary. - deep-sweep.yml / nightly-triage.sh / nightly.yml: drop "Phase-2 build plan", "Bucket 6 spec/design", "Bucket-2 Tier-A" — the decisions they cited are stated inline. - tests (test.sh / interop.test.sh / _harness.sh): "Bucket 4 / D-c" -> "see failure-modes.md §K / §4 item 1"; "Bucket 6" -> "see load-testing-strategy.md"; "failure-modes.md §4.5" -> "§4 item 5". Comment-only; bash -n clean. Validated: actionlint (deep-sweep, nightly), shellcheck (nightly-triage), bash -n (all edited shell files). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… triage header Round-5 docs re-review (Codex), 4 findings, all verified against the code: 1. G-S4 / D3 claim socket+device coverage but their witness lists cited only Tests 17/17d/18/22 — added Test 44 (the socket & device-node arms of the wrong-type classifier) to the G-S4 prose witness, the §7 verification-map row, and the D3 table row + detail. 2. load-testing-strategy.md cited `guarantees.md §E`, which does not exist (guarantees.md is §1-§7); the correctness-rests-on-structure point is §2A. Fixed. 3. failure-modes.md "Sources of truth" listed only three test suites — added the canary suite (the doc later uses `C Test 1` as a witness). 4. nightly-triage.sh header was stale: it claimed a CONCLUSIONS JSON file (the code reads per-cell cell-conclusion.txt) and that a `failure` conclusion is correctness (the code classifies failure-without-^FAIL: as infra). Header now matches the implementation. Validated: shellcheck + bash -n (nightly-triage.sh). Docs are markdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Round-6 docs re-review (Codex): all named docs clean; one finding — the GENERATED correctness-issue body still said "a FAIL: assertion and/or a cell job concluded failure", contradicting the classifier (a failed/timed-out cell WITHOUT ^FAIL: is infra, not correctness). The correctness body now cites only a ^FAIL: line; the infra body's list is completed to name the failure-without-FAIL case (same class, fixed pre-emptively). Validated: shellcheck + bash -n. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…on main) ci-stress was the CI flakiness stress-hunt campaign; its agent-workflow scaffolding is branch-local, not part of the product: - AGENTS.md — the flake-hunt runbook for this campaign. - .plans/ (9 files) — the zformal plan + subplans (full history preserved in git). - .agent/.gitkeep + the /.agent/* .gitignore entries — the per-worktree comment-review queue scaffolding. Kept: the product (git-commit-lock.sh/.ps1), install.sh, README, docs/, the four test suites + harness/load wrapper, the CI workflows, and the .agent-testing/ ignore (general testing convention). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

bentoner and others added 30 commits June 16, 2026 21:32

AGENTS.md: record F2 coverage addition (Test 32b, 19a28fd) and hunt r…

c762899

…estart on final tree

ben comments: docs/failure-modes.md

534a007

Revert "ben comments: docs/failure-modes.md"

959cca9

This reverts commit 534a007.

Plan proposal: guarantees spec + failure-modes follow-ups (await Ben …

9048400

…review)

Plan: lock decisions D-a..e; add Bucket 6 / Phase 1b (load-testing st…

2617449

…rategy recommendation, Ben f)

docs: load-&-matrix testing strategy recommendation (Ben f, Phase 1b …

0397aaa

…draft)

docs(load-testing): apply Codex factual review (cgroup probe-required…

aeba95c

…; max-parallel/paths-ignore caveats; billing vs scarcity; FUSE/SIP hedges; reconcile non-Linux disk cells)

Plan: §9 accepted; add Bucket 7 (steering coverage / Phase 1c) + Buck…

8ba6341

…et 8 (harness ergonomics research)

bentoner and others added 30 commits June 18, 2026 06:24

Revert "CI: split the windows-unit leg into 2 round-robin shards (tes…

89de803

…ts.yml)" This reverts commit 2de66ff.

Revert "Test sharding: GCL_TEST_SHARD=i/n round-robin gate in the har…

143e280

…ness" This reverts commit a01a8e3.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI stress-testing: guarantees, coverage, 3-tier CI + Windows speedup#5

CI stress-testing: guarantees, coverage, 3-tier CI + Windows speedup#5
bentoner wants to merge 76 commits into
mainfrom
ci-stress

bentoner commented Jun 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

bentoner commented Jun 18, 2026

Summary

What's in it

Status

Merge notes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant