Conversation
…atch Stress-test-only change so many workflow_dispatch runs execute in parallel on this single branch without cancel-in-progress killing each other. Do NOT merge to main. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Diagnosis+Codex review: windows-2025 unit flake is a timing-fragile self-validation
assertion, not a product bug. Plan replaces got97>=1 with rc-in-{0,97,98} + a WAITING-
based anti-vacuity canary; keeps the warn17d TOCTOU regression guard untouched.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…G canary) Round-1 review (Claude + Codex, verified in code): rc-set must include 1 (lock_run demotes a clean command to 1 on unverifiable-empty release); WAITING canary must read per-waiter logs, not the shared churn.log (concurrent appends drop lines). Secondary hardenings dropped. See reviewer notes at top of plan. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The got97>=1 self-validation assertion was timing-fragile: on a loaded 2-core Windows
runner all 12 waiters won create-races in the churner absent windows (rc 0) instead of
timing out, so got97=0 and the test failed though the product was correct (run
27616343269; 29 prior identical runs were green).
Replace it (diagnosis + Codex review confirmed test-flake, not a product bug):
- per-waiter AGENT_LOCK_LOG (single-writer => drop-free; shared logs drop lines
under concurrent appends), rebuilt into churn.log for the artifact;
- assert all 12 waiters reach a DESIGNED terminal rc in {0,1,97,98} (rc 1 = clean
command demoted on unverifiable-empty release) — catches real product regressions;
- anti-vacuity: require >=1 WAITING line (proves the churn produced real contention
and the guarded per-poll type-guard path ran);
- unconditional note: rc distribution + WAITING count for future triage.
The warn17d==0 TOCTOU regression guard is unchanged.
Local: unit suite 214/0; shellcheck -S style (v0.11.0) + bash -n clean.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
STRESS-BRANCH ONLY (do not merge). tests/with-load.sh runs each suite while N CPU spin-loops and/or N disk create/write+fsync/delete loops saturate the runner, to widen the timing windows that latency/race flakes depend on (Test 17d absent window is driven by both CPU descheduling and slow file IO). Selected via new workflow_dispatch inputs stress_kind (none|cpu|disk|both, default both) and stress_load (blank=core count); empty on push/schedule => none. Step/job timeouts raised so load slowness does not trip a timeout and look like a flake. Hogs reaped by exact PID (never by name). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
So the process (dispatch -> on failure: subagent diagnose -> Codex review -> plan -> review/fix rounds -> implement -> review/fix rounds -> commit -> reset streak -> resume), the mechanics, the process-hygiene lessons, and the progress log survive context compaction. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Under CPU load the kill -9 of the pwsh holder missed the native pwsh.exe (MSYS dollar-bang
is a shim), so pwsh ran its full Start-Sleep 60 and exited gracefully — its
PowerShell.Exiting backstop DELETED the lock, so the precondition read got an empty/gone
file and the 3 steal asserts were vacuous (stole a backdate-recreated empty file).
Diagnosis + Codex agreed: test bug, product correct.
Fix (Option D): the holder now does
if (-not (Lock-Acquire)) { Exit 3 }; write READY; [Environment]::Exit(0)
Environment.Exit bypasses BOTH release and the backstop, leaving a deterministic
token-bearing orphan with no external kill. bash drops the kill and just reaps. The
tok.ps token assertion is now genuine every run, not vacuous.
Local interop suite 141/0. Reviewed clean by fresh Claude reviewer + Codex. shellcheck
-S style + bash -n clean. Found via the CI load stress test.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…n T31a) Third flake found by the load hunt (ubuntu, both/load=2): Test 31(a) leaked-token-memory DISCOVERY-HOLD assertion races the external mv install vs the leaver _lock_discover; under load the direct-discover path (sh:822) adopts the claim instead of the memory path (sh:1382) the assertion pins. Product correct; test-orchestration race; 31(b) covers the memory path deterministically. Diagnosed, fix pending the formal loop. Hunt at 16/50 clean, halted here. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Ben clarified: there are no credits to worry about; public repo => unlimited CI capped only by GitHub concurrency. Corrects the earlier misleading agent-credits framing that prompted a premature budget pause. Resume guidance: dispatch freely, keep going to 50 clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ind)
Sub-leg (a) installs a recheck-unreadable leaked claim at the lock path via an
external mv, then asserted adoption went through the per-poll leaked-token-memory
route ("DISCOVERY-HOLD (leaked-token memory)"). But the product can adopt the
claim via EITHER of two correct routes: the inline ownership-discovery read
(git-commit-lock.sh:822) if the mv lands before it, or the per-poll memory check
(git-commit-lock.sh:1382) on a later poll if it lands after. Which fires is a pure
scheduling race -- the external mv vs the leaver's inline discover one statement
after the leak-add (sh:1112 -> sh:1114). Under both/load=2 on ubuntu the mv won and
the direct route fired, so the memory-pinned assertion failed spuriously
(run 27626826865).
The product behaved correctly in both cases (token remembered, same token observed
installed, adopted, rc 0, clean release, no residue). Fix is test-only: sub-leg (a)
now accepts EITHER DISCOVERY-HOLD route and records which fired, failing only if
neither adopted the claim. No coverage is lost -- the memory route stays pinned
deterministically by sub-leg (b), and the direct route by Test 25's 7-position
discovery-position matrix.
Diagnosis converged across four independent reviews (my code-read + the verbatim
leak.log, a fresh-context Claude subagent that did not read the prior diagnosis,
and a Codex foreign-model review). Implementation reviewed clean by a fresh Claude
reviewer and by Codex. Static checks (bash -n + shellcheck -S style v0.11.0) clean;
local unit suite 207 passed / 0 failed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ckdump Fill the real fix SHA into the Test 31(a) progress entry, update the hunt status (clean_count reset; both/load=2 hunt resumed toward 50 clean on the fixed tree; three flakes fixed this session). Add *.stackdump to .gitignore so the suite's transient Cygwin crash dumps stop cluttering git status during the hunt. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
A coverage audit (subagent + my own verification against the code) found the product's two acquire read-back-verification failure lanes were asymmetrically covered. The create-path lane (O_EXCL create wins, path reads back the wrong token, git-commit-lock.sh:1354-1360) is covered by Test 32. Its steal-path twin "F2" (git-commit-lock.sh:1168-1179) was NOT: the stealer wins the claim race AND wins the rename-over (STOLE-BY-CLAIM logged, ghost destroyed), but the mandatory post-rename read-back at :1171 reads back the wrong token, so the product must clear its claim token and re-enter the wait loop rather than take the hold. After a STOLE-BY-CLAIM a silent false-hold there would be a mis-attributed hold of a destroyed-ghost path, so this is the higher-stakes twin — and nothing exercised it. Test 32b closes the gap. It mirrors Test 32 with the INVERSE token gate: a one-shot _lock_cur_token shadow gated on [ -n "$_LOCK_CLAIM_TOKEN" ] lands the read-back fault at the STEAL read-back (:1171), not the create one (:1353, where the claim token is empty). On firing it backdates the just-installed abandoned lock stale so the re-steal is immediate (same trick as Test 32 — keeps it fast and deterministic); the second attempt (shadow spent) reads back the real token and acquires, releasing rc 0. The test asserts the F2-specific log line (not the shared "acquire verification FAILED" prefix), STOLE-BY-CLAIM x2, the WARNING preceding the eventual ACQUIRED (no false-hold), and no leftovers. The stale closing NOTE that called the read-back lanes "not suite-covered" is corrected (create by Test 32, F2 by Test 32b). Product code is unchanged; F2 reads correct today — this is a missing-test (regression exposure), not a present bug. Diagnosis from a coverage-audit subagent, verified by me against the code. Test-only; no product change. Static checks clean; local suite 0 failed; Test 32b verified to exercise the F2 lane (standalone + full suite). Implementation reviewed clean by a fresh Claude reviewer and by Codex. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…estart on final tree
A decision-support document classifying every failure mode into three robustness tiers (correctness / best-effort-in-envelope / out-of-scope), each grounded in product code + tests with file:line citations, with a recommendation on whether it should be an in-scope guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ck dir; add E3/H4 Address findings from a foreign-model (Codex) + fresh-Claude review of the failure-modes map, each verified against the code: - Core-guarantee precision (the doc's central thesis): the unconditional safety property is "no silent lost update, given cooperative wrapper unwind", NOT unconditional mutual exclusion. Strict mutual exclusion holds only within the staleness window; beyond it the lease is fail-open-but-detectable. Split Tier 1 into safety (unconditional) vs recovery (lock-shaped orphans only, under a readable-clock / supported-FS envelope; foreign objects at the path are deliberately never auto-removed). - Add H4: hard kill (SIGKILL) or a wrapped command's [Environment]::Exit() while holding bypasses release-time detection -> the explicit boundary of the no-silent-loss guarantee. - Add E3: mtime probe entirely unreadable -> staleness detection disabled; fails SAFE (never steals a lock whose age it cannot establish), recovery lost, loudly announced once per process (both ports). - Fix E2 clock-jump direction (age = now - mtime: a FORWARD jump makes a live lock look stale -> premature steal -> detected-98; a BACKWARD jump delays recovery). - D1: separate the atomic-overwrite engines (mv -T / 3-arg File.Move) from the non-atomic Windows PowerShell 5.1 unlink-then-Move fallback (claim-guarded; fairness loss, never a clobber). - Note the leaked-token memory is process-local (ties the "no unowned lock" framing to residual 5); correct the README-location claim for the mixed-version note (it is in the design doc only); minor citation fixes (README quote/line, Test 22a over-attribution). Reviewers confirmed the central thesis (correctness load-independent; only latency degrades) holds against every interleaving attacked on a local FS. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ypass class Address the round-2 reviews (Codex + fresh Claude), verified against the code: - Codex (blocking): §H4's bypass list was incomplete for bash. lock_run runs the wrapped command in the wrapper shell itself (git-commit-lock.sh:1733), so a wrapped `exec` replaces that shell and skips BOTH lock_release and the EXIT trap — the same silent-loss boundary as the pwsh [Environment]::Exit() case. Generalize §H4 and the §1 bullet from an enumeration to the class "termination/replacement without wrapper unwind" (external SIGKILL / bash exec / [Environment]::Exit()), and add the contrast that a plain `exit` is safe (it unwinds: bash EXIT trap, pwsh finally). - Claude (nit): §H4 had attributed try/finally to the bash run path; corrected to bash EXIT trap vs pwsh try/finally. Both rounds confirmed the central thesis holds (no two believed-legitimate holders; no UNdetected lost update on a local FS within the envelope) and that round 1's revisions 1-7 are factually correct and internally consistent. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This reverts commit 534a007.
Applied Ben's review comments (frozen in 534a007) to the failure-modes scope-decisions doc: - §4 status: the recommendations are reviewed and accepted, except two overrides. - Network FS (§4.3 / §E1): document-only — surface the boundary in README, do NOT build the FS-type startup probe ("don't do the polish, just document"). - Untested-but-robust lanes (§4.5 / F1-F4, J1): OVERRIDE the prior "accept untested" -> add test coverage. Rationale (Ben): actually-tested edge cases make the project easier to maintain and give future users confidence vs reasoned-correct-but-untested. Propagated to the per-mode F1/F2/F3/F4/J1 entries and their summary-table rows. Disposition check passed (fresh verifier): every comment dispositioned, propagation consistent, no leaked comment text, §4 numbering coherent. comment-commit: 534a007 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rategy recommendation, Ben f)
…; max-parallel/paths-ignore caveats; billing vs scarcity; FUSE/SIP hedges; reconcile non-Linux disk cells)
…et 8 (harness ergonomics research)
Phase 1a (Bucket 1 / D-a) — docs/guarantees.md: the normative contract derived from failure-modes.md. Operating envelope (E1-E7), guarantees (safety G-S1..5, recovery G-R1..4, interop G-I1), failure semantics, best-effort tier (BE-1..5), out-of-scope (OOS-1..7 incl. the non-unwinding- exit no-silent-loss boundary), operating rules, and a verification map. Phase 1c (Bucket 7) — docs/steering-coverage.md: the deterministic-steering coverage audit + prioritized gap list, synthesized from two manual window audits and an objective kcov pass (83.1% line coverage, 451/543; ~30 lines platform-gated, ~62 Linux-reachable). kcov corrected three manual over- credits (step-3.3 CLAIM-ABORT, the foreign claim-recheck branch, the EXIT- trap no-hold twin). Gaps ranked: Tier A portable steering (A1 rename-refused wrong-type-mid-steal = headline; A2 step-3.3 abort lane; A4 the exec/H4 boundary), Tier B fault-injection (failure-modes 4.5), Tier C platform-only, Tier D document-not-test. Both reviewed to convergence: a fresh-context Claude reviewer plus two Codex rounds. The foreign Codex check plus a 4-line empirical test corrected the exec-bypass characterization across all three docs: "run -- bash -c 'exec'" does NOT skip release (the child shell is replaced, the wrapper releases normally); only an exec in the lock-holding shell itself (a sourced lock_acquire+exec, or "run -- exec") bypasses. Propagated to guarantees.md OOS-5, steering-coverage.md A4, and the failure-modes.md H4 precision fix in this commit. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Concrete build plan derived from the Phase 1 outputs (guarantees.md, steering-coverage.md) + the accepted failure-modes §4 and load-strategy §9 decisions. Sections: - 2A Tier-A steering tests (A1-A11; per-test mechanism / assertion / platform / priority — the audit gave each steering technique). - 2B Tier-B fault-injection (F4 + F2/J1 first cut; F1 gated-or-doc, F3 doc-only) — each injection empirically prototyped; refines the original D-b (F3 was not deterministically injectable; ulimit -f is a SIGXFSZ trap). - 3 doc edits (exact text: envelope + single-clock in the design doc; network-FS boundary + upgrade-both in the README). - 4 GCL_ENVELOPE_TIER=relax mechanism + the 3 downgrade sites (D-c). - 6 three-tier CI (Required / Nightly / Deep), event-conditional concurrency (keeps the deep-sweep group off the required gate), kcov coverage job, nightly auto-triage, paths-ignore-on-required fix, and a refined do-not-merge disposition (with-load.sh graduates calibrated). - 8 TAP + 1..N + the silent-undercount sentinel fix, GCL_TEST_ONLY selector (integration excluded by design), tests/_harness.sh extraction. Bucket 2B/6/8 designs feasibility-validated by parallel agents (prototypes in .agent-testing/, gitignored). Awaiting Ben's Phase 2 gate. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…to-gate CI (Codex): split into THREE workflows — tests.yml (required) + a stable tests-passed aggregator as the ONLY required context; nightly.yml; deep-sweep.yml with distinct job names. This fixes the workflow_dispatch-publishes-check-contexts gating risk AND the paths-ignore-on-required gotcha, and drops the event-conditional concurrency expression. Made ok_envelope/bad_envelope TAP-aware (Bucket 8 item 1 lands first, so TAPN/GCL_TAP exist). Added a GCL_TEST_ONLY zero-match guard. Tests (Claude): A6 must shadow the INNER _lock_stat_mtime (:606), NOT _lock_path_mtime (:639-643), which is the function that emits the warn-once the test asserts — verified against the code. Test 22a downgrade refined to only the warning-fired-at-all assertion (keep the warn-once dedup n<=1 and names-type strict). A reviewer's Test-22a line numbers were a mislocation — the plan mapping (:1167/:1168/:1170) is verified correct against the tree. F3 reclassified document-only supersedes steering-coverage B4's "portable POSIX" rating (ulimit -n can't fail the ~1-FD create without starving bash startup) — steering-coverage.md B4 corrected to match. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
All three suites: ok/bad now emit TAP13 (`ok`/`not ok N - desc`) gated by GCL_TAP=1 (dev runs byte-unchanged); a trailing `1..N` plan line lets a consumer fail on a short count; and a DONE sentinel + a finish() EXIT-trap wrapper turn any early exit/crash into a loud `Bail out!` + exit 1 — closing the silent-undercount gap (a stray `exit 0` after a recorded FAIL no longer reports green). A bare trap `return` is ignored by bash, so the guard uses an explicit `exit 1`. Validated: unit suite REDUCED + GCL_TAP=1 -> 220/220, the `1..220` plan line matches the assertion count, exit 0, sentinel does not false-fire. interop + integration syntax-checked here; full runs verify via CI. Phase 3, step 1 of the Phase 2 build plan (Bucket 8 item 1). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…grade-both) docs/git-commit-lock.md: add the operating-envelope statement (correctness is load-independent; the wall-clock bounds are best-effort and scale with scheduling) and the single-time-source note (a local clock jump is correctness-safe, degrading to the detected exit-98 lane), both beside the mtime-clock caveat; cross-ref guarantees.md. README.md: surface the network/sync-FS boundary in "How it works" (exclusion may silently fail off a local FS), and add the "upgrade both implementations together" deployment note (a mixed-version tree degrades prevention to detection). C-misc (the optional case-insensitive-FS / mixed-version one-liners) skipped as low-value — the design doc already covers mixed-version. Plan changelog updated: Phase 3 step 1 done; the Bucket 8 item-2 selector is deferred to bundle with item 3 (revised phasing recorded). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add TAP-aware ok_envelope/bad_envelope to the unit suite: default 'strict' is identical to ok/bad; GCL_ENVELOPE_TIER=relax downgrades an envelope FAIL to a WARN that never reds the run (ENV_WARN counted + reported in the summary). Reassign the three load-sensitive wall-clock/poll-count assertions to the envelope tier, keeping every neighbouring correctness assertion strict: - Test 21: recovery <=20s - Test 22a: "warning fired at all" (n>=1) -> envelope; the warn-once dedup (n<=1) and the type-naming stay STRICT (names-type guarded on a warning having fired) - Test 29: >=2 CLAIM lines Validated: strict (default) -> 220/220, 0 envelope warnings, 1..220 consistent, the 3 sites report PASS[env]. Downgrade logic checked deterministically (relax->WARN, strict->FAIL). No product (git-commit-lock.sh) change. Phase 3, step 3 (Bucket 4). Required CI will set strict; nightly/deep set relax. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Subplan for splitting the windows-2025 unit leg (the ~2x CI bottleneck) into two parallel shards via a GCL_TEST_SHARD=i/n round-robin gate in section(). Records the mechanism, the by-construction partition guarantee + a self-contained per-shard expected-count guard, alternatives rejected, CI wiring (windows-unit only; not kcov/nightly/interop), edge cases, phasing, and a logging design. Awaiting review convergence + Ben's go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round 1 = 2 fresh Claude reviewers + independent Codex. Two blocking defects in the original snippet + two Codex caught the Claude pair missed, all folded: - malformed input (empty component / leading-zero octal trap) now rejected by a single regex ^([1-9][0-9]*)/([1-9][0-9]*)$; - GCL_TEST_ONLY and GCL_TEST_SHARD made MUTUALLY EXCLUSIVE (simpler than AND + guard-fallback; no real use case combines them); - GCL_TEST_SHARD parsed LAZILY (first section() call) so the integration suite (no section() blocks) never parses/bails — it note-and-ignores; - union proof + per-shard logging use run-only PASS/FAIL signals, not the == Test N == headers (which section() prints before gating, so skipped too); - guard asserts expected>=1 (catches n>section-count) and its rationale reworded (catches a section()-coverage regression, not a correlated modulo bug). Confirm round (fresh reviewer + Codex) pending before declaring converged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round 2 = fresh Claude (sound) + independent Codex (2 accuracy defects). Folded: - union-proof run-line set corrected to ^(PASS:|FAIL:|PASS[env]:|WARN[env-relaxed]:) (ok_envelope/bad_envelope emit the bracketed forms; a bare PASS:/FAIL: grep undercounts the 315) — or use GCL_TAP=1; - the per-shard guard reframed HONESTLY: it does NOT catch a test added outside a section() gate (that bumps neither counter); its value is the empty-shard (expected>=1) check + a cheap modulo cross-check. The union proof's no-duplicate check is what catches an ungated test (runs in both shards); - explicit selector_report shard-guard snippet (gated on GCL_TEST_SHARD set); - RAN: attribution marker gated on GCL_TEST_SHARD set (unsharded stays identical); - kcov-interaction note (Ben asked): coverage job runs the full suite UNSHARDED; sharding code is inert when GCL_TEST_SHARD unset -> no interaction. Mechanism verified sound by 4 reviewers across 2 rounds; final Codex spot-confirm running. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…implement) Final independent Codex spot-confirm returned no findings. Mechanism verified sound across 3 review rounds (Claude x3 + Codex x3). Status -> converged; ready for implementation on Ben's go. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Implements the converged windows-unit-shard subplan (Phase 1 — mechanism). Opt-in GCL_TEST_SHARD=<i>/<n> splits a section()-gated suite into n round-robin shards by file-order section index: - tests/_harness.sh: lazy _shard_init (parsed on first section() call, so suites that never call section() neither parse nor bail) — mutually exclusive with GCL_TEST_ONLY, validated by ^([1-9][0-9]*)/([1-9][0-9]*)$ (rejects empty / non-digit / leading-zero octal trap / extra slash), then i<=n. section() bumps SECTION_IDX for every test before gating (stable shard key), adds the residue gate, and emits a run-only RAN: marker ONLY in shard mode. selector_report gains a gated guard: recompute expected from the same residue mapping, log one greppable verdict line, bail if SECTIONS_RUN != expected or expected < 1 (catches an empty shard, e.g. n>section-count). Unset/empty => no-op, so unsharded runs are byte-identical. - integration suite note-and-ignores GCL_TEST_SHARD (one indivisible scenario, no section() blocks). Partition is guaranteed by construction (round-robin over one stable ordering) + a one-time local union proof. Validated locally: unsharded unit 315/0 + interop 141/0 byte-identical; shard 1/2 + 2/2 disjoint, union == unsharded (no dup), 148+167=315 run-lines / 29+28=57 sections; all malformed inputs bail; integration prints the note + runs all 12; shellcheck -S style clean. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The windows-2025 unit leg is the CI wall-clock bottleneck (~2x the others;
process-spawn overhead on the 2-core runner). Split its one matrix cell into two
(shard 1/2) running in parallel to ~halve that leg:
- Unit-suite step sets GCL_TEST_SHARD: matrix.shard && '{0}/2' || '' — only the two
windows-unit cells carry matrix.shard, so ubuntu/macos (leg:all) and windows
interop-integration get '' and run the FULL unit suite unchanged.
- artifact name + job-name template gain a shard suffix (v4+ rejects duplicate
artifact names). No other cell / SHA-pin / step touched.
Coverage (kcov, nightly, Linux) stays whole + unsharded — orthogonal.
actionlint clean; cross-platform CI verification follows.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…red) CI run 27723744798 all green, both shards pass. Overall CI 360s->242s (~33%); windows-unit no longer the 2x outlier. But round-robin balance was 242 vs 99 (~2.4x), not the planned ~10% — the estimate used reduced-mode timings while CI runs full mode (the 8x25 canary clusters in shard 1). Recorded the data + the lesson (estimate balance from the run mode) + the decision rationale: accept as-is (macos 210s is the floor, so re-balancing gains only ~32s for a cost-table the plan rejected). Mechanism is correct + green regardless. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-robin balanced poorly (242 vs 99) because ONE test dominates: Test 1 (the 8x25 FULL-width canary) is ~151s, ~half the 309s suite; the other 56 tests sum to ~158s (measured from run 27723744798's job-log timestamps). So the balanced fixed split is simply Test 1 alone on one shard, the rest on the other (~151 vs ~158). Windows-unit -> ~167s, so macOS's 210s becomes the overall floor (242->210). Replaces the round-robin assignment with a static _shard_of() in _harness.sh (case "Test 1:"* -> shard 1, else shard 2) — a fixed split derived once from a measurement, NOT a maintained cost table. New tests default to shard 2; the per-shard wall-clock log surfaces drift for occasional re-tune. n stays 2 (Test 1's 151s is an irreducible floor). Ben endorsed the Test-1-vs-rest split. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ts.yml)" This reverts commit 2de66ff.
…ness" This reverts commit a01a8e3.
…rsede sharding After unwinding the GCL_TEST_SHARD sharding (reverts 89de803 + 143e280), plan the simpler approach: move Test 1 (the ~151s full-width concurrency canary, ~half the Windows unit-suite runtime) into tests/git-commit-lock.canary.test.sh, so it runs as a naturally-parallel CI job — zero sharding machinery, same wall-clock win. The canary runs as its own cell on ALL arches (Ben's call: uniform, cheap extra POSIX job). New 7-cell matrix (canary leg on ubuntu/macos/windows). Test 1 moves verbatim (sources _harness.sh; tiny preamble: LIB, WORK/cleanup/trap, T1_* knobs, INCR). Coverage-safe (union across cells == the original 57; counts reconcile). Marked the two shard plans (windows-unit-shard, shard-balance) SUPERSEDED. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Codex (independent single review) caught two gaps, both folded: - ENV_WARN under set -u: the unit RESULT line expands $ENV_WARN (defined in the envelope section we're not copying) -> the canary file must define ENV_WARN=0. - The other workflow callers: nightly.yml (stress cells + the kcov coverage job) and deep-sweep.yml reach Test 1 only via git-commit-lock.test.sh; after extraction they must also run the new canary file (preserve canary-under-load stress + the canary's kcov coverage contribution, ~3pp from the 0.80 floor). Principle: the canary is a 4th suite file — every suite-runner must include it. Updated phasing to rewire all three workflows + the kcov merged-coverage run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Test 1 (the full-width 25x8 concurrency canary) is ~half the Windows unit-suite wall-clock and is a different KIND of test (a statistical concurrency canary; repetition-at-width is its coverage) from the targeted unit/steering tests. Move it verbatim into tests/git-commit-lock.canary.test.sh so it can run as a naturally-parallel CI job — replacing the (now-reverted) GCL_TEST_SHARD sharding with a far simpler file split. - New tests/git-commit-lock.canary.test.sh: sources _harness.sh; minimal preamble (LIB, the GCL_TEST_FULL->T1_* width knobs, WORK/cleanup/trap finish, INCR), the Test 1 block verbatim, standard RESULT/TAP tail. Defines ENV_WARN=0 (the RESULT line expands it but it lives in the envelope tier we don't copy). - tests/git-commit-lock.test.sh: Test 1 block + the now-orphaned INCR removed (no other unit test uses INCR); count self-adjusts (TAPN is dynamic). - _harness.sh: "three suites" -> "four". Validated: canary standalone 2/0; unit-minus-canary 313/0; 313+2=315 (the pre-extraction count) with disjoint, no-duplicate, union==original; interop 141/0, integration 12/0; shellcheck -S style clean (incl the new file). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ly/deep-sweep/kcov The canary is now a 4th suite file; every workflow that ran the suites must run it. - tests.yml: a `canary` cell per arch (ubuntu/macos/windows) running canary.test.sh in parallel (the per-PR wall-clock win — Windows unit ~360s -> ~174s, overall macOS-gated). Canary step gates on `leg == 'canary'` only, so `leg: all` still runs unit+interop+integration but NOT canary (no double-run). Canary file added to the shellcheck lint list. - nightly.yml: canary-under-load step in the stress cells; the kcov job runs the unit suite AND the canary under kcov into one merged --include-path outdir, so the canary's coverage of git-commit-lock.sh is preserved (the 0.80 floor can't regress from the split). Canary kcov log uploaded. - deep-sweep.yml: canary step with the same repeat-loop / PIPESTATUS fail-fast / under-load wrapping as the unit suite. actionlint clean on all three workflows. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…than sharding) CI run 27728088150 all 8 jobs green. Overall CI 360s -> ~181s (~50%), gated by the windows unit-minus-canary cell (181s) with the windows canary (165s) balanced beside it; macOS 194->167s (canary peeled into a cheap 33s cell). Beats the sharding (242s, imbalanced) and is far simpler — zero GCL_TEST_SHARD machinery; the canary is just its own suite file. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The nightly kcov job runs `kcov … tests/git-commit-lock.canary.test.sh` directly (not via `bash`), exactly as it does the unit suite — which works because the unit suite is mode 100755. The canary was added 100644, so the direct kcov exec would fail permission-denied. Set it executable to match the unit suite (the precedent for a directly-invoked suite). Caught by Codex's implementation review. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ov run Round-2 review fixes (Codex + Claude, independent) for the canary split's nightly kcov job: - Floor-parse picked the cobertura with the highest lines-VALID. Since the unit + canary runs into one outdir produce per-suite reports plus a merged union at coverage/kcov-out/kcov-merged/cobertura.xml — ALL covering the same source, so all with identical lines-valid — the tie-break could grab a single-suite report and measure partial coverage (spurious floor failure). Select by highest lines-COVERED instead (the merged union has the most), which robustly picks the union. Comment corrected (kcov writes no top-level cobertura.xml). - The kcov-run step lacked pipefail (the kcov job has no shell:bash default), so a failing `kcov ... | tee` returned tee's 0. Added `set -euo pipefail`. - Dropped an inaccurate "+ sweep" from the canary step comment (the canary reads only GCL_TEST_FULL, not the Axis-A sweep). The kcov job dispatches only from the default branch, so this path is exercised post-merge; verified statically against the kcov v43 output layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ribution) Round-2 review caught the canary extraction didn't update the user-facing docs: - docs/git-commit-lock.md: add the canary row to the suite table + run-commands; "three"->"four" (temp dirs / CI / log-copy); and move the "mutual exclusion under many concurrent workers" description from the unit suite to a new canary paragraph (only Test 1 moved — crash-recovery/claim-contention stay in the unit suite and remain in its list). Also note the canary's counts are exact. - README.md: "Three suites" -> "Four suites" (+ the bash concurrency canary). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ash-recovery) Round-3 review catch: the canary suite TABLE ROW + RUN-COMMAND comment claimed the canary also covers "crash-recovery / claim-serialization", but those (unit Tests 2b/20) stayed in the unit suite — the canary file contains only Test 1 (mutual exclusion over repeated rounds + its mtime-probe-warning guard). Align the table row + run comment with the (already-correct) detailed paragraph. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…aim the canary Round-4 review catches (all split-introduced, swept the whole class): - tests/git-commit-lock.test.sh: Test 1 moved to the canary file, but the suite still computed orphaned T1_ROUNDS/T1_N and its `fan-out mode:` echo + header advertised "the full-strength 8x25 canary". Removed the dead T1_* vars; the mode-echo now reports the suite's actual heavy fan-out (Test 2b rounds, Test 20 workers); header/echo say "full-strength fan-out", not "canary". - deep-sweep.yml: "Mirrors the tests.yml 4-cell set" -> tests.yml now has canary cells; reworded to "per-OS legs" + noted the canary runs as a step here. - nightly.yml: clarified the leg comment (all/unit/interop-integration as in tests.yml; the canary is a step here, not a leg). bash -n + shellcheck + actionlint clean; mode-echo verified. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-5 review catches (comment-only, no behaviour change): - nightly.yml kcov-run comment said kcov "merges into the top-level cobertura.xml" — contradicts the round-2-corrected enforcement comment. kcov writes the union to coverage/kcov-out/kcov-merged/cobertura.xml (no top-level file); the enforcement step selects it by highest lines-covered. Comment made consistent. - deep-sweep.yml loop comment claimed "set -e is NOT in effect (default bash here)" — but deep-sweep sets shell: bash (-eo pipefail), so set -e + pipefail ARE on; a failing pipeline already trips the step. Reworded: the explicit PIPESTATUS check is a defensive backstop that names the failing iteration. (Pre-existing Bucket-6d wording the canary loop shares; behaviour was always correct.) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…; Test 42 xref) Comprehensive sweep of all touched files for stale post-split references (one had surfaced per review round); fixed the two genuine remainders (comment-only): - nightly.yml kcov section header said "unit suite at FULL" — the job now runs unit + canary under kcov (the step name already says so). -> "unit + canary at FULL". - tests/git-commit-lock.test.sh Test 42 comment cross-referenced "Test 1" (which a reader would seek in the unit suite) — Test 1 moved to the canary file. Clarified: "the concurrency canary (formerly Test 1, now git-commit-lock.canary.test.sh)". Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the Phase-4 area reviews (A unit+canary, C CI):
- A1: Test 42 `grep -c … || echo 0` appended a 2nd "0" on zero matches ("0\n0"),
garbling the warn-once assertion — drop `|| echo 0` (grep -c prints 0 itself;
`${:-0}` covers the missing-file case). Only bit the zero-firings case (not a
false-green), now a clean integer. Suite still 313/0.
- B8: reflow an awkward comment wrap in tests/_harness.sh (cosmetic).
- C1: nightly kcov job inherited the workflow-level GCL_ENVELOPE_TIER=relax despite
its "strict" intent — set GCL_ENVELOPE_TIER=strict on the kcov step (a true clean
coverage run; the floor enforces the envelope assertions).
- C2: nightly-triage.sh classified ANY job failure as `nightly-correctness`, so a
step TIMEOUT (no ^FAIL: line) misclassified as correctness — correctness now
requires a real ^FAIL:; failure/timeout/cancel without one falls to `nightly-infra`.
- C3: add tests/with-load.sh to the required tests.yml shellcheck gate (nightly/
deep-sweep depend on it; it was unlinted per-PR).
- Drop a stale `steering-coverage.md` reference in a Test 46 comment.
shellcheck -S style + actionlint clean; unit suite 313/0; triage dry-run confirms
timeout→infra.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
From the Phase-4 docs review (D) + Ben's merge-scope decision: - D1: guarantees.md still said "coverage planned" for BE-3/BE-4/G-S2 resource lanes that are now TESTED + passing — flip to cite Tests 42/48/49/50 (F3 stays document-only), reconciling with failure-modes.md §2 (flipped earlier in 309cf39). - D2: re-attribute the concurrency canary from "unit Test 1" to tests/git-commit-lock.canary.test.sh (the branch's central change) in guarantees.md (G-S3) + failure-modes.md (A1); add a `C:` witness legend. Crash-recovery/ claim-contention stay unit Tests 2b/20. - guarantees.md OOS-5: the bash `exec` §H4 lane is no longer "a coverage gap" — cite unit Test 40 (the exec-bypass boundary test); drop the steering-coverage.md ref. - load-testing-strategy.md: rewritten from a stale "RECOMMENDATION / not-built-yet" plan into a present-tense rationale for why the CI is shaped as it is (Ben's call: must describe current state, not read as a stale plan). 346 -> 236 lines. - steering-coverage.md: DELETED. It was a point-in-time gap list whose every gap is filled (Tests 37-50) + had a broken .plans/ link — a finished working artifact, not main-worthy (Ben's decision). The kcov baseline lives in the nightly kcov gate. - Fold in 2 parallel stale README-notes in guarantees/failure-modes (the notes the README now carries). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…er disclaimer Round-2 re-review (tests + CI clean; docs Codex found 3): - deep-sweep.yml referenced docs/load-testing-strategy.md "§9", a section the rewrite removed (Tier D is now under §3) — point to the doc generically. - guarantees.md lacked the line-number disclaimer its sibling failure-modes.md carries; the branch's interop/unit reformatting drifted some `file:line` anchors (the test NAMES/NUMBERS stay correct). Added the "anchors, not exact addresses" disclaimer rather than re-derive every anchor (a transient, re-drifting fix the disclaimer exists to avoid; reviewer D round-1 judged the drift non-blocking). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…es.md Round-3 docs re-review (Codex) found the E2 row + §4 items 1-2 still framing the load/timing envelope and single-clock docs as future recommendations, though both are implemented (git-commit-lock.md "operating envelope" / "One time source"; load-testing-strategy.md §1). Fixed the whole class so the doc describes current state — verified each target exists before de-flagging: - E2 + K1 table rows: drop "not addressed in docs" / "UNDER-documented" / "Define the envelope" -> state it's documented. - §E2 detail and §K "Net K": "Recommend: document X" -> present-tense "documented (where)". - §4 items 1/2/3: add *Status (done):* markers (matching item 5's style) and neutralize the two imperative headings. - §B1 latency-bound, §E3, §H4 "Recommend: document" -> "documented (where)" (targets verified: guarantees.md BE-3 for E3, OOS-5/G-S1 for H4, §K for B1). Tests + CI legs were clean in round 2 and are untouched here (docs-only round). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tion anchors Round-4 docs re-review (Codex) flagged a cross-repo ref and stale plan vocabulary; swept the whole class so the merged tree is self-contained (.plans/ + AGENTS.md are deleted at merge, so any reference to their content would dangle on main): - failure-modes.md: drop the cross-repo `agents/600-claude.md` citation; the "§4.5"/"§4.1" anchors meant "§4 item 5"/"§4 item 1" (no such subsections exist) — fixed across the doc. - guarantees.md: same "§4.5"/"§4.1" fix; drop "Bucket 4 / D-c" plan vocabulary. - deep-sweep.yml / nightly-triage.sh / nightly.yml: drop "Phase-2 build plan", "Bucket 6 spec/design", "Bucket-2 Tier-A" — the decisions they cited are stated inline. - tests (test.sh / interop.test.sh / _harness.sh): "Bucket 4 / D-c" -> "see failure-modes.md §K / §4 item 1"; "Bucket 6" -> "see load-testing-strategy.md"; "failure-modes.md §4.5" -> "§4 item 5". Comment-only; bash -n clean. Validated: actionlint (deep-sweep, nightly), shellcheck (nightly-triage), bash -n (all edited shell files). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… triage header Round-5 docs re-review (Codex), 4 findings, all verified against the code: 1. G-S4 / D3 claim socket+device coverage but their witness lists cited only Tests 17/17d/18/22 — added Test 44 (the socket & device-node arms of the wrong-type classifier) to the G-S4 prose witness, the §7 verification-map row, and the D3 table row + detail. 2. load-testing-strategy.md cited `guarantees.md §E`, which does not exist (guarantees.md is §1-§7); the correctness-rests-on-structure point is §2A. Fixed. 3. failure-modes.md "Sources of truth" listed only three test suites — added the canary suite (the doc later uses `C Test 1` as a witness). 4. nightly-triage.sh header was stale: it claimed a CONCLUSIONS JSON file (the code reads per-cell cell-conclusion.txt) and that a `failure` conclusion is correctness (the code classifies failure-without-^FAIL: as infra). Header now matches the implementation. Validated: shellcheck + bash -n (nightly-triage.sh). Docs are markdown. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Round-6 docs re-review (Codex): all named docs clean; one finding — the GENERATED correctness-issue body still said "a FAIL: assertion and/or a cell job concluded failure", contradicting the classifier (a failed/timed-out cell WITHOUT ^FAIL: is infra, not correctness). The correctness body now cites only a ^FAIL: line; the infra body's list is completed to name the failure-without-FAIL case (same class, fixed pre-emptively). Validated: shellcheck + bash -n. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…on main) ci-stress was the CI flakiness stress-hunt campaign; its agent-workflow scaffolding is branch-local, not part of the product: - AGENTS.md — the flake-hunt runbook for this campaign. - .plans/ (9 files) — the zformal plan + subplans (full history preserved in git). - .agent/.gitkeep + the /.agent/* .gitignore entries — the per-worktree comment-review queue scaffolding. Kept: the product (git-commit-lock.sh/.ps1), install.sh, README, docs/, the four test suites + harness/load wrapper, the CI workflows, and the .agent-testing/ ignore (general testing convention). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Hardening pass on
git-commit-lock: CI stress-testing guarantees + coverage, test-harness ergonomics, and a Windows CI speedup. No behavioural change to the lock protocol itself beyond the mtime-floor / read-ladder robustness already onmain; this branch is mostly tests, CI, and the normative docs that pin what the tool guarantees.What's in it
Guarantees & failure-mode docs (new, normative)
docs/guarantees.md— the normative contract (safety / recovery / interop guarantees, failure semantics, out-of-scope, the operating envelope, and a §7 verification map: each guarantee → its witnessing test).docs/failure-modes.md— the failure-mode map (§A–§K) + the accepted scope-decision log.docs/load-testing-strategy.md— the rationale for why CI is shaped the way it is.Coverage (closes the reasoned-but-untested lanes)
3-tier CI
tests.yml(Tier R) — required, no-load, strict, per-PR; 7-cell matrix incl. the canary as its own parallel cell per arch.nightly.yml(Tier N) — background-load matrix (tests/with-load.sh, oversubscription ratio), kcov gate, idempotent issue triage (nightly-triage.sh).deep-sweep.yml(Tier D) — on-demand deep flake hunt (repeat × load).Harness ergonomics
GCL_TEST_ONLYsingle-test selector;GCL_ENVELOPE_TIERstrict/relax;GCL_TEST_SWEEPAxis-A waiter-count sweep.tests/_harness.shextracted (all four suites source it).Windows CI speedup
tests/git-commit-lock.canary.test.shand run as a parallel cell → ~50% faster Windows leg, simpler than the sharding approaches it replaced.Status
bash -n/ shellcheck / actionlint clean on all edited shell + workflow files.Merge notes
AGENTS.mdflake-hunt runbook,.plans/,.agentreview-queue scaffolding) — main carries only product / docs / tests / CI.🤖 Generated with Claude Code