From 980856bc1488e8266507579bac427910c7028093 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Tue, 16 Jun 2026 21:32:45 +1000
Subject: [PATCH 01/76] CI(stress-branch): unique-per-run concurrency group for
 parallel dispatch

Stress-test-only change so many workflow_dispatch runs execute in parallel on this
single branch without cancel-in-progress killing each other. Do NOT merge to main.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/tests.yml | 8 ++++++--
 1 file changed, 6 insertions(+), 2 deletions(-)
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 3d1424c..c9a99da 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -19,8 +19,12 @@ on:
   workflow_dispatch:
 
 concurrency:
-  group: ${{ github.workflow }}-${{ github.ref }}
-  cancel-in-progress: true
+  # STRESS-BRANCH ONLY — do NOT merge to main. The per-run-unique group + no
+  # cancellation lets many workflow_dispatch runs execute in parallel on this one
+  # branch (flakiness stress test). On main the group is
+  # `${{ github.workflow }}-${{ github.ref }}` with cancel-in-progress: true.
+  group: ${{ github.workflow }}-${{ github.ref }}-${{ github.run_id }}
+  cancel-in-progress: false
 
 permissions:
   contents: read

From 5dec024e18e572e442991b09a1fa98542ee6fc47 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Tue, 16 Jun 2026 22:33:45 +1000
Subject: [PATCH 02/76] Plan: de-flake Test 17d got97 assertion (CI stress
 find)

Diagnosis+Codex review: windows-2025 unit flake is a timing-fragile self-validation
assertion, not a product bug. Plan replaces got97>=1 with rc-in-{0,97,98} + a WAITING-
based anti-vacuity canary; keeps the warn17d TOCTOU regression guard untouched.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...2026-06-16-ci-stress-test17d-flake-plan.md | 123 ++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 .plans/2026-06-16-ci-stress-test17d-flake-plan.md

diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
new file mode 100644
index 0000000..87b31dc
--- /dev/null
+++ b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
@@ -0,0 +1,123 @@
+# Plan: de-flake Test 17d (`got97 >= 1`) in the unit suite
+
+Status: DRAFT — awaiting review (Claude reviewer + Codex), then implement.
+
+## Reviewer notes (add at top; do not renumber)
+_(none yet)_
+
+## Context
+CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run
+27616343269 failed only on `windows-2025 (unit)` with one assertion in
+`tests/git-commit-lock.test.sh` Test 17d:
+
+```
+PASS: 12 waiters polled through churn with ZERO spurious non-lock warnings
+FAIL: no waiter reached 97 under churn (got97=0/12) — timeout lane bypassed?
+```
+
+Diagnosis (Claude subagent) + independent review (Codex) — both in
+`.agent-testing/failures/27616343269/{DIAGNOSIS.md,codex-diag-review.md}`:
+
+- **Root cause.** The Windows pwsh churner (`tests/git-commit-lock.test.sh:925-931`)
+  does `WriteAllText → Delete` with **no present-hold**, unlike the POSIX perl churner
+  which sleeps 2ms present each iteration (`:944-947`). On the loaded 2-core
+  windows-2025 VM, per-iteration pwsh/.NET overhead widened the *absent*
+  (Delete→next-Write) window past the 20ms poll interval, so all 12 waiters won an
+  ordinary `O_EXCL` create-race in an absent window (`git-commit-lock.sh:1323-1356`)
+  and exited rc 0 — none reached the `MAX_WAIT=2` timeout, so `got97=0`. Proof: every
+  waiter in `churn.log` carries its **own** `tok.<pid>...` token (not the churner's
+  `tok.churn.1.1`) and there are no steal/TIMEOUT lines; the leg ran 17d in 4.4s
+  (too short for twelve 2s timeouts).
+- **Classification: test-flake, not a product bug.** Acquiring during a genuinely
+  absent window is correct behavior. `got97 >= 1` is a *self-validation* guard (was
+  the timeout lane exercised?), not a product requirement. In this test shape rc ∈
+  {0 (create-win), 97 (timeout), 98 (churner overwrote the hold before release —
+  designed theft detection; present in this run, waiter 36836 / `t17d.3.3.err`)} are
+  **all** correct outcomes. Which one occurs is machine-speed luck.
+
+The real regression Test 17d guards — `warn17d == 0`, the per-poll non-lock-warning
+TOCTOU guard — PASSED and is untouched by this plan.
+
+## Goal
+Make Test 17d non-flaky across fast and slow runners **without weakening the
+`warn17d == 0` regression guard**, while keeping a real anti-vacuous-pass canary so a
+dead/absent churner can't let the test pass without exercising the guarded poll path.
+
+## Fix (replaces the single `got97 >= 1` assertion; keeps everything else)
+Within the `for r in 1 2 3` waiter loop, replace the `got97` accumulation and its
+assertion with three assertions:
+
+1. **Regression guard — unchanged.** `warn17d == 0` ("12 waiters polled through churn
+   with ZERO spurious non-lock warnings"). Keep verbatim.
+
+2. **Every waiter reaches a designed terminal state.** Accumulate each waiter's rc;
+   require all 12 ∈ {0, 97, 98}. Any other rc (crash, 96 config error, 99, …) ⇒ `bad`,
+   listing the offending `round.idx=rc`. This is *stricter* than the old test, which
+   ignored every rc except 97.
+
+3. **Anti-vacuity: contention actually happened (the guarded path ran).** Require
+   `grep -c 'WAITING for lock' "$LOG" >= 1`. `WAITING` is logged **only** after a
+   waiter's create was blocked by a present file (`git-commit-lock.sh:1363-1370`),
+   immediately before the per-poll type-guard loop (`:1388-1570`) that `warn17d`
+   guards — so ≥1 `WAITING` proves at least one waiter entered the exact path under
+   test. A dead/absent-only churner produces 0 `WAITING` and fails this. Threshold is
+   **≥1** (the weakest non-vacuous signal) to stay robust on absent-dominant runners;
+   the failing run already had 9 `WAITING` lines, so ≥1 has wide margin both ways.
+
+### Why ≥1 WAITING is robust (not a new flake)
+`WAITING` count is machine-dependent in the *opposite* direction to `got97`: a
+present-dominant (fast) runner blocks most waiters (lots of WAITING, got97 high); an
+absent-dominant (slow) runner lets waiters acquire (fewer WAITING, got97 low) — but
+even the worst observed case (this failure) still logged 9 WAITING. The only way to
+get 0 WAITING is no contention at all (churner never ran / always absent), which is
+exactly the vacuity we want to fail on. So ≥1 has margin on both ends; no threshold
+near the machine-variance band is introduced.
+
+### Secondary hardening (cheap, include if clean)
+- **Churner readiness proves churn began.** Today the start marker is written *before*
+  the loop (`:926`), so "started" doesn't prove a single cycle ran. Move the start-marker
+  write to *after* the churner's first successful write+delete cycle (both pwsh and perl
+  branches) so `wait_for_file "$START"` implies the churn loop is actually turning over.
+- **Churner alive at reap.** Capture `kill -0 "$churn_pid"` right before `touch "$STOP"`;
+  assert it was alive ⇒ catches a churner that crashed mid-test (another vacuity route).
+  This is non-flaky: the churner loops 2,000,000× and the test lasts ~4-6s, so it is
+  always alive at reap unless it actually crashed.
+
+If either hardening proves fiddly or risks its own flake, the plan's load-bearing fix
+is assertions 1-3 alone; the start-marker move and alive-check are defense-in-depth and
+can be dropped without losing the de-flake. (Decide during implementation; record in
+changelog.)
+
+## Observability (per logging practice)
+Keep the data that made this diagnosable: emit a `note:` line with the rc distribution
+and the WAITING count every run, e.g.
+`note: T17d outcomes rc0=$n0 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited` — so a
+future failure can be classified from the suite log without re-deriving it. (The old
+test discarded this.)
+
+## Out of scope / explicitly NOT changed
+- The `warn17d`/TOCTOU regression logic and its assertion.
+- The churner shapes' core (pwsh on Windows, perl elsewhere) beyond the start-marker move.
+- Product code (`git-commit-lock.sh`) — no product defect found.
+- The `.ps1` port and other suites — Test 17d is bash-unit-only.
+
+## Testing
+1. **Static:** `bash -n tests/git-commit-lock.test.sh`; shellcheck `-S style` (the CI
+   lint gate) on the test file — must stay clean.
+2. **Local sanity (Windows, this box):** run Test 17d in isolation a handful of times via
+   the suite's single-test selector if present, else the whole unit suite once, in
+   `.agent-testing/` — confirm it passes and the new `note:` line shows a sane rc/WAITING
+   mix. (Local box is faster/less loaded, so it will likely be present-dominant — expect
+   high got97; that's fine, the test no longer asserts on it.)
+3. **Real proof = CI stress.** The genuine signal is the GitHub windows-2025 (unit) leg
+   under load. After implementing, resume the stress driver (streak reset to 0) and
+   require the previously-flaky path to survive the run to 50 clean. If 17d flakes again
+   we re-open.
+
+## Rollout
+Commit the test fix to `ci-stress` (under the git commit lock). This is a normal,
+mergeable fix (unlike the stress-only concurrency commit 980856b). Reset
+`clean_count`, relaunch the driver, continue toward 50 clean in a row.
+
+## Changelog (implementation)
+_(to be appended during implementation)_

From 9f76c292f117d26df8ebec2d4b366c0e961e46ff Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Tue, 16 Jun 2026 22:48:20 +1000
Subject: [PATCH 03/76] Plan v2: address review round 1 (rc-set {0,1,97,98};
 drop-free WAITING canary)

Round-1 review (Claude + Codex, verified in code): rc-set must include 1 (lock_run
demotes a clean command to 1 on unverifiable-empty release); WAITING canary must read
per-waiter logs, not the shared churn.log (concurrent appends drop lines). Secondary
hardenings dropped. See reviewer notes at top of plan.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...2026-06-16-ci-stress-test17d-flake-plan.md | 87 +++++++++++++------
 1 file changed, 60 insertions(+), 27 deletions(-)

diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
index 87b31dc..c5fcbe7 100644
--- a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
+++ b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
@@ -3,7 +3,33 @@
 Status: DRAFT — awaiting review (Claude reviewer + Codex), then implement.
 
 ## Reviewer notes (add at top; do not renumber)
-_(none yet)_
+Round 1 — fresh Claude reviewer + Codex (both independent), findings verified by me
+against the product code:
+
+1. **[BLOCKING — fixed in plan v2] rc-set `{0,97,98}` is not exhaustive of correct
+   outcomes → must be `{0,1,97,98}`.** Under this churn a clean `true` whose release
+   reads the held lock EMPTY (the churner's create→write window) gets release rc 2,
+   which `lock_run` maps to **rc 1** (`git-commit-lock.sh:1739-1744`). rc 1 is the
+   documented "ownership unverifiable, successful command demoted" outcome — correct,
+   not a defect. Verified. The original `{0,97,98}` was the *same class* of
+   timing-fragile assumption as the bug being fixed. Fixed below.
+2. **[BLOCKING — fixed in plan v2] the `WAITING` canary must not read the SHARED log.**
+   Plan v1 grepped `WAITING` from the single shared `churn.log` (line 916), but the
+   suite itself documents `# per-waiter logs: concurrent appends to one log drop lines`
+   (`tests/git-commit-lock.test.sh:258`) and uses per-waiter logs elsewhere for exactly
+   this reason. A shared-log `WAITING` count can under-count under concurrency and the
+   canary would itself flake. Fixed: give each waiter its OWN `AGENT_LOCK_LOG`
+   (single-writer ⇒ drop-free), count `WAITING` across those, and concatenate them into
+   `churn.log` afterwards so the preserved artifact is unchanged.
+3. **[disposition] Secondary hardenings DROPPED.** Reviewers flagged the
+   start-marker-after-first-cycle and alive-at-reap hardenings as needing care (the
+   alive check can false-fail if the churner's iteration cap is ever hit; both add
+   machinery to a delicate timing path). They are also largely redundant with the
+   drop-free `WAITING>=1` canary, which already proves the churner produced contention.
+   To keep the change minimal and the timing path untouched, v2 drops both. The
+   load-bearing fix is assertions 1-3.
+4. **[non-blocking, adopted] observability buckets** updated to `rc0/rc1/rc97/rc98/other`
+   and emitted unconditionally (pass and fail), so a drift toward an edge is visible.
 
 ## Context
 CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run
@@ -43,20 +69,33 @@ Make Test 17d non-flaky across fast and slow runners **without weakening the
 `warn17d == 0` regression guard**, while keeping a real anti-vacuous-pass canary so a
 dead/absent churner can't let the test pass without exercising the guarded poll path.
 
-## Fix (replaces the single `got97 >= 1` assertion; keeps everything else)
-Within the `for r in 1 2 3` waiter loop, replace the `got97` accumulation and its
-assertion with three assertions:
+## Fix (v2) — replaces the single `got97 >= 1` assertion; keeps everything else
+**Structural A — per-waiter lock logs (drop-free).** Today all 12 waiters share
+`AGENT_LOCK_LOG="$LOG"` (`$LOG=churn.log`, line 916). Change each waiter to its OWN log
+`AGENT_LOCK_LOG="$WORK/t17d.$r.$i.log"` (the churner writes only the lock *file*, never
+the log, so per-waiter logs lose nothing). After the 3 rounds,
+`cat "$WORK"/t17d.*.log > "$LOG"` to rebuild the consolidated `churn.log` artifact.
+`warn17d` is unaffected — it greps the per-waiter `.err` STDERR files, not the log.
+
+Then replace the `got97` accumulation + its assertion with three assertions:
 
 1. **Regression guard — unchanged.** `warn17d == 0` ("12 waiters polled through churn
    with ZERO spurious non-lock warnings"). Keep verbatim.
 
 2. **Every waiter reaches a designed terminal state.** Accumulate each waiter's rc;
-   require all 12 ∈ {0, 97, 98}. Any other rc (crash, 96 config error, 99, …) ⇒ `bad`,
-   listing the offending `round.idx=rc`. This is *stricter* than the old test, which
-   ignored every rc except 97.
+   require all 12 ∈ **{0, 1, 97, 98}**. For `bash -c 'true'` under this churn: `0`
+   acquired+clean release; `1` acquired but release read the held lock EMPTY (churner's
+   create→write window) ⇒ release rc 2 ⇒ `lock_run` demotes the clean command to 1
+   (`git-commit-lock.sh:1739-1744`), ownership-unverifiable/correct; `97` timed out;
+   `98` churner overwrote the hold before release (designed theft detection). Any OTHER
+   rc (crash/139, 96 config error, 99, …) ⇒ `bad`, listing the offending `round.idx=rc`.
+   Stricter than the old test (which ignored every rc but 97) and is the real new
+   product-regression check. Comment must name why rc 1 is correct so a successor does
+   not "tighten" the set back and re-introduce the flake.
 
 3. **Anti-vacuity: contention actually happened (the guarded path ran).** Require
-   `grep -c 'WAITING for lock' "$LOG" >= 1`. `WAITING` is logged **only** after a
+   `cat "$WORK"/t17d.*.log | grep -c 'WAITING for lock' >= 1` (counted from the
+   single-writer per-waiter logs ⇒ drop-free; see reviewer note 2). `WAITING` is logged **only** after a
    waiter's create was blocked by a present file (`git-commit-lock.sh:1363-1370`),
    immediately before the per-poll type-guard loop (`:1388-1570`) that `warn17d`
    guards — so ≥1 `WAITING` proves at least one waiter entered the exact path under
@@ -73,31 +112,25 @@ get 0 WAITING is no contention at all (churner never ran / always absent), which
 exactly the vacuity we want to fail on. So ≥1 has margin on both ends; no threshold
 near the machine-variance band is introduced.
 
-### Secondary hardening (cheap, include if clean)
-- **Churner readiness proves churn began.** Today the start marker is written *before*
-  the loop (`:926`), so "started" doesn't prove a single cycle ran. Move the start-marker
-  write to *after* the churner's first successful write+delete cycle (both pwsh and perl
-  branches) so `wait_for_file "$START"` implies the churn loop is actually turning over.
-- **Churner alive at reap.** Capture `kill -0 "$churn_pid"` right before `touch "$STOP"`;
-  assert it was alive ⇒ catches a churner that crashed mid-test (another vacuity route).
-  This is non-flaky: the churner loops 2,000,000× and the test lasts ~4-6s, so it is
-  always alive at reap unless it actually crashed.
-
-If either hardening proves fiddly or risks its own flake, the plan's load-bearing fix
-is assertions 1-3 alone; the start-marker move and alive-check are defense-in-depth and
-can be dropped without losing the de-flake. (Decide during implementation; record in
-changelog.)
+### Secondary hardening — DROPPED (reviewer note 3)
+v1 proposed two extra hardenings (move the start-marker after the churner's first
+write+delete cycle; assert the churner is alive at reap). Both are dropped in v2: they
+add machinery to a delicate timing path, the alive-check can false-fail if the churner's
+iteration cap is ever hit, and both are largely redundant with the drop-free
+`WAITING>=1` canary (which already proves the churner produced real contention — a
+waiter can only log `WAITING` if the churner had the lock file present). The
+load-bearing fix is the per-waiter logs + assertions 1-3.
 
 ## Observability (per logging practice)
 Keep the data that made this diagnosable: emit a `note:` line with the rc distribution
-and the WAITING count every run, e.g.
-`note: T17d outcomes rc0=$n0 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited` — so a
-future failure can be classified from the suite log without re-deriving it. (The old
-test discarded this.)
+and the WAITING count **unconditionally** (both pass and fail paths), e.g.
+`note: T17d outcomes rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited`
+— so a future failure (or a pass drifting toward an edge) can be classified from the
+suite log without re-deriving it. (The old test discarded this.)
 
 ## Out of scope / explicitly NOT changed
 - The `warn17d`/TOCTOU regression logic and its assertion.
-- The churner shapes' core (pwsh on Windows, perl elsewhere) beyond the start-marker move.
+- The churner shapes' core (pwsh on Windows, perl elsewhere) — unchanged in v2.
 - Product code (`git-commit-lock.sh`) — no product defect found.
 - The `.ps1` port and other suites — Test 17d is bash-unit-only.
 

From 58c3741b5af4c6986f49eec9985785982a8ab46c Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Tue, 16 Jun 2026 23:18:32 +1000
Subject: [PATCH 04/76] Test 17d: de-flake got97>=1 (CI stress find on
 windows-2025 unit)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The got97>=1 self-validation assertion was timing-fragile: on a loaded 2-core Windows
runner all 12 waiters won create-races in the churner absent windows (rc 0) instead of
timing out, so got97=0 and the test failed though the product was correct (run
27616343269; 29 prior identical runs were green).

Replace it (diagnosis + Codex review confirmed test-flake, not a product bug):
  - per-waiter AGENT_LOCK_LOG (single-writer => drop-free; shared logs drop lines
    under concurrent appends), rebuilt into churn.log for the artifact;
  - assert all 12 waiters reach a DESIGNED terminal rc in {0,1,97,98} (rc 1 = clean
    command demoted on unverifiable-empty release) — catches real product regressions;
  - anti-vacuity: require >=1 WAITING line (proves the churn produced real contention
    and the guarded per-poll type-guard path ran);
  - unconditional note: rc distribution + WAITING count for future triage.
The warn17d==0 TOCTOU regression guard is unchanged.

Local: unit suite 214/0; shellcheck -S style (v0.11.0) + bash -n clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...2026-06-16-ci-stress-test17d-flake-plan.md | 34 ++++++++++++++-
 tests/git-commit-lock.test.sh                 | 42 ++++++++++++++++---
 2 files changed, 69 insertions(+), 7 deletions(-)

diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
index c5fcbe7..c2f4bb8 100644
--- a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
+++ b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
@@ -1,6 +1,7 @@
 # Plan: de-flake Test 17d (`got97 >= 1`) in the unit suite
 
-Status: DRAFT — awaiting review (Claude reviewer + Codex), then implement.
+Status: **DONE** (implemented + reviewed clean by Claude and Codex; local unit suite
+214/0; awaiting CI-stress confirmation toward 50 clean in a row).
 
 ## Reviewer notes (add at top; do not renumber)
 Round 1 — fresh Claude reviewer + Codex (both independent), findings verified by me
@@ -31,6 +32,17 @@ against the product code:
 4. **[non-blocking, adopted] observability buckets** updated to `rc0/rc1/rc97/rc98/other`
    and emitted unconditionally (pass and fail), so a drift toward an edge is visible.
 
+Round 2 — confirming review (fresh Claude + Codex, both independent): **CONVERGED, ok to
+implement.** Both verified against the product code that the rc-set {0,1,97,98} is
+exhaustive and tight (release rc 2 is remapped to 1, never leaks; acquire exposes only
+0/97; reentrant-1 unreachable from a fresh CLI process), per-waiter `AGENT_LOCK_LOG`
+auto-creates and breaks nothing, and `WAITING>=1` is a sound non-flaky floor. Two
+implementation reminders adopted: (a) `bad` is a function — name the "other" rc bucket
+something else (e.g. `nother`) and an offenders string; (b) avoid `cat … | grep -c`
+(ShellCheck SC2002 fires at the CI style gate). Resolution for (b): rebuild churn.log via
+`cat "$WORK"/t17d.*.log > "$LOG"` (a redirect, not a pipe — no SC2002), then
+`grep -c 'WAITING for lock' "$LOG"` on the single rebuilt file.
+
 ## Context
 CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run
 27616343269 failed only on `windows-2025 (unit)` with one assertion in
@@ -153,4 +165,22 @@ mergeable fix (unlike the stress-only concurrency commit 980856b). Reset
 `clean_count`, relaunch the driver, continue toward 50 clean in a row.
 
 ## Changelog (implementation)
-_(to be appended during implementation)_
+- Implemented exactly the Fix v2 design in `tests/git-commit-lock.test.sh` Test 17d
+  (the `if wait_for_file "$START" 60` block): per-waiter `AGENT_LOCK_LOG`, rc `case`
+  bucketing into `n0/n1/n97/n98/nother` + `rc_bad` offender list, `cat glob > "$LOG"`
+  rebuild, `grep -c 'WAITING for lock' "$LOG"` count, unconditional `note:` line, and
+  the three assertions (warn17d==0 kept verbatim; rc∈{0,1,97,98}; WAITING>=1). Removed
+  `got97`. No product code or other test touched.
+- Static: `bash -n` clean; `shellcheck -S style` v0.11.0 (the CI-pinned gate version)
+  clean.
+- Local run (Windows, this box, REDUCED fan-out — Test 17d is not fan-out-gated so it
+  runs identically): full unit suite **214 passed / 0 failed**. Test 17d emitted
+  `note: T17d outcomes rc0=0 rc1=0 rc97=12 rc98=0 other=0; WAITING=12` and all three
+  assertions PASS. (Idle box ⇒ present-dominant ⇒ all 12 timed out at 97 — the opposite
+  extreme to the CI failure's rc0-heavy distribution; both now accepted.)
+- Implementation review: fresh Claude reviewer — "IMPLEMENTATION OK" (confirmed
+  set -uo pipefail / no errexit so `grep -c` exit-1 is harmless; empty-glob rebuild
+  handled; no `bad`/`rc_bad` collision; `warn17d` guard intact). Codex
+  `exec review --uncommitted` — no blocking bug. Both in `.agent-testing/`.
+- Real proof pending: the windows-2025 (unit) leg under CI load. Resuming the stress
+  driver with the streak reset to 0.
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 021ea22..57265a9 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -962,26 +962,58 @@ if [ -n "$churn_pid" ]; then
   # never churned, so bash sees it reliably. Budget 60s: pwsh cold start on
   # a loaded box can take >15s.
   if wait_for_file "$START" 60; then
-    warn17d=0; got97=0
+    # Per-waiter lock logs (single-writer => drop-free): a SHARED log drops lines
+    # under concurrent appends (cf. the per-waiter logs at Test 2B), which would make
+    # the WAITING anti-vacuity count below unreliable. Rebuilt into $LOG after the runs.
+    warn17d=0; n0=0; n1=0; n97=0; n98=0; nother=0; rc_bad=""
     for r in 1 2 3; do
       pids=()
       for i in 1 2 3 4; do
-        AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 \
+        AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/t17d.$r.$i.log" AGENT_LOCK_STALE_SECS=300 \
           AGENT_LOCK_POLL_SECS=0.02 AGENT_LOCK_MAX_WAIT=2 \
           bash "$LIB" run -- bash -c 'true' 2> "$WORK/t17d.$r.$i.err" &
         pids+=($!)
       done
       for i in 1 2 3 4; do
         wait "${pids[$((i-1))]}"; rc=$?
-        [ "$rc" = 97 ] && got97=$((got97+1))
+        # A CLEAN command ('true') under this churn has exactly FOUR correct terminal
+        # codes — do NOT tighten this set: rc 1 is the real catch that made the old
+        # got97>=1 assertion flaky (see the Test 17d de-flake plan).
+        #   0  acquired in an absent window, clean release
+        #   1  acquired, but release read the held lock EMPTY (the churner's
+        #      create->write window) -> release rc 2 -> lock_run demotes the clean
+        #      command to 1 (ownership unverifiable; correct, not a defect)
+        #   97 never won an absent window within MAX_WAIT -> timed out
+        #   98 churner overwrote the hold before release -> designed theft detection
+        case "$rc" in
+          0)  n0=$((n0+1)) ;;
+          1)  n1=$((n1+1)) ;;
+          97) n97=$((n97+1)) ;;
+          98) n98=$((n98+1)) ;;
+          *)  nother=$((nother+1)); rc_bad="$rc_bad $r.$i=$rc" ;;
+        esac
         n="$(grep -c 'is not a lock file' "$WORK/t17d.$r.$i.err")"
         warn17d=$((warn17d+n))
       done
     done
+    # Rebuild the consolidated churn.log artifact from the drop-free per-waiter logs.
+    # 'cat glob > file' is a redirect, not a pipe (no SC2002); then count WAITING from
+    # the single rebuilt file.
+    cat "$WORK"/t17d.*.log > "$LOG" 2>/dev/null || :
+    waited="$(grep -c 'WAITING for lock' "$LOG")"
+    echo "note: T17d outcomes rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited"
     [ "$warn17d" = 0 ] && ok "12 waiters polled through churn with ZERO spurious non-lock warnings" \
                        || bad "churned regular file fired $warn17d non-lock warning(s) — per-poll guard TOCTOU regression!"
-    [ "$got97" -ge 1 ] && ok "waiters still timed out at 97 under churn ($got97/12)" \
-                       || bad "no waiter reached 97 under churn (got97=$got97/12) — timeout lane bypassed?"
+    # Replaces the old got97>=1 assertion (timeout is only ONE of the correct outcomes;
+    # which one occurs is machine-speed luck). Assert each waiter reached a DESIGNED
+    # terminal state instead — catches a real product regression (crash/139, 96, …).
+    [ "$nother" = 0 ] && ok "all 12 waiters reached a designed terminal state (rc in {0,1,97,98})" \
+                      || bad "waiter(s) hit an undesigned rc under churn:$rc_bad (rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98)"
+    # Anti-vacuity: WAITING is logged only after a create was blocked by a PRESENT lock,
+    # immediately before the per-poll type guard that warn17d guards — so >=1 proves the
+    # churn produced real contention and the guarded path ran. 0 => dead/absent churner.
+    [ "$waited" -ge 1 ] && ok "churn exercised the blocked-poll type-guard lane ($waited WAITING line(s))" \
+                        || bad "no WAITING logged under churn — contention never happened; test ran vacuously"
   else
     bad "T17d churner never signalled its start marker"
     echo "  diag: churner pid=$churn_pid alive=$(kill -0 "$churn_pid" 2>/dev/null && echo yes || echo no)"

From b430d739e96a7b913908cb6de306f0b869c58f53 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Tue, 16 Jun 2026 23:37:23 +1000
Subject: [PATCH 05/76] CI(stress-branch): add CPU/disk load wrapper to surface
 timing flakes

STRESS-BRANCH ONLY (do not merge). tests/with-load.sh runs each suite while N CPU
spin-loops and/or N disk create/write+fsync/delete loops saturate the runner, to widen
the timing windows that latency/race flakes depend on (Test 17d absent window is driven
by both CPU descheduling and slow file IO). Selected via new workflow_dispatch inputs
stress_kind (none|cpu|disk|both, default both) and stress_load (blank=core count); empty
on push/schedule => none. Step/job timeouts raised so load slowness does not trip a
timeout and look like a flake. Hogs reaped by exact PID (never by name).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/tests.yml | 34 +++++++++++------
 tests/with-load.sh          | 73 +++++++++++++++++++++++++++++++++++++
 2 files changed, 96 insertions(+), 11 deletions(-)
 create mode 100644 tests/with-load.sh

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index c9a99da..52961e6 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -17,6 +17,13 @@ on:
   schedule:
     - cron: '17 3 * * 1'   # weekly Monday run: catches runner-image/tool drift
   workflow_dispatch:
+    inputs:
+      stress_kind:
+        description: 'STRESS BRANCH: artificial load during suites — none|cpu|disk|both'
+        default: both
+      stress_load:
+        description: 'STRESS BRANCH: hogs per kind (blank = runner core count)'
+        default: ''
 
 concurrency:
   # STRESS-BRANCH ONLY — do NOT merge to main. The per-run-unique group + no
@@ -41,17 +48,22 @@ jobs:
         # process-spawn overhead, not the PowerShell engines). Suites must NOT run
         # concurrently inside one runner: they're timing-sensitive on 2-core
         # runners. POSIX legs are fast enough to stay single-job.
-        include:
-          - { os: ubuntu-24.04, leg: all, job_timeout: 35 }
-          - { os: macos-15, leg: all, job_timeout: 35 }
-          - { os: windows-2025, leg: unit, job_timeout: 20 }
-          - { os: windows-2025, leg: interop-integration, job_timeout: 22 }
+        include:                       # STRESS BRANCH: job_timeouts raised to clear the summed step budgets under artificial load
+          - { os: ubuntu-24.04, leg: all, job_timeout: 80 }
+          - { os: macos-15, leg: all, job_timeout: 80 }
+          - { os: windows-2025, leg: unit, job_timeout: 40 }
+          - { os: windows-2025, leg: interop-integration, job_timeout: 50 }
     timeout-minutes: ${{ matrix.job_timeout }}   # backstop only: sum of the leg's step budgets + upload headroom
     defaults:
       run:
         shell: bash                  # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires
     env:
       GCL_TEST_FULL: 1               # full fan-out — CI runners are dedicated; the reduced default protects live dev boxes (TODO 58)
+      # STRESS-BRANCH ONLY (do not merge): artificial CPU/disk load wrapped around each
+      # suite (tests/with-load.sh) to widen timing windows and surface latency/race
+      # flakes. From the workflow_dispatch inputs; empty on push/schedule => 'none'.
+      GCL_STRESS_KIND: ${{ inputs.stress_kind || 'none' }}
+      GCL_STRESS_LOAD: ${{ inputs.stress_load }}
     steps:
       - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0   # v6.0.3, SHA-pinned
         with:
@@ -76,30 +88,30 @@ jobs:
 
       - name: Unit suite
         if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}
-        timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }}   # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 30 || 25 }}   # STRESS BRANCH: raised (15->30 / 10->25) so artificial load slowness doesn't trip the step timeout and masquerade as a flake
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit
         run: |
           mkdir -p test-output
-          bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
+          bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
 
       - name: Interop suite (bash + pwsh)
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}   # run even if an earlier suite failed — every signal is useful
-        timeout-minutes: 10
+        timeout-minutes: 25          # STRESS BRANCH: raised 10->25 for artificial load
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop
         run: |
           mkdir -p test-output
-          bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log
+          bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log
 
       - name: Integration suite (real concurrent commits)
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}
-        timeout-minutes: 7           # its internal AGENT_LOCK_MAX_WAIT cap is 240s
+        timeout-minutes: 20          # STRESS BRANCH: raised 7->20 for artificial load (internal AGENT_LOCK_MAX_WAIT cap is 240s)
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration
         run: |
           mkdir -p test-output
-          bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log
+          bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log
 
       - name: Upload failure diagnostics
         if: ${{ failure() || cancelled() }}   # failure() covers step timeouts (they fail the step); cancelled() is best-effort cover for manual cancels / the job-level backstop
diff --git a/tests/with-load.sh b/tests/with-load.sh
new file mode 100644
index 0000000..e19ae5f
--- /dev/null
+++ b/tests/with-load.sh
@@ -0,0 +1,73 @@
+#!/usr/bin/env bash
+# STRESS-BRANCH ONLY — do NOT merge to main.
+#
+# Run "$@" while artificial CPU and/or disk load saturates the runner, to widen the
+# timing windows that latency/race flakes depend on (e.g. Test 17d's churn "absent
+# window" — driven by both CPU descheduling of the churner AND slow file create/delete
+# IO). Hogs are reaped by their EXACT PIDs afterward (never by name), so this is safe on
+# a shared machine; on an ephemeral CI runner it is doubly safe.
+#
+#   GCL_STRESS_KIND = none | cpu | disk | both   (default: both)
+#   GCL_STRESS_LOAD = N hogs of EACH selected kind (default: detected core count)
+#
+# CPU hog  = a bare bash spin loop (one core each).
+# Disk hog = a tight create / write+fsync / delete loop of a small file on the same
+#            volume as the test's scratch dir (TMPDIR) — metadata + write-back pressure
+#            that contends with the lock-file create/delete the suite itself does.
+set -uo pipefail
+
+kind="${GCL_STRESS_KIND:-both}"
+cores="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)"
+load="${GCL_STRESS_LOAD:-$cores}"
+case "$load" in ''|*[!0-9]*) load="$cores" ;; esac   # guard non-numeric / empty
+
+hogdir="${TMPDIR:-/tmp}/gcl-stress.$$"
+mkdir -p "$hogdir" 2>/dev/null || hogdir="."
+
+hogs=()
+spawn_cpu() {
+  local i
+  for ((i = 0; i < load; i++)); do
+    bash -c 'while :; do :; done' &
+    hogs+=("$!")
+  done
+}
+spawn_disk() {
+  local i
+  for ((i = 0; i < load; i++)); do
+    bash -c '
+      d="$1"; j=0
+      while :; do
+        f="$d/dh.$$.$((j % 24))"
+        dd if=/dev/zero of="$f" bs=32k count=8 conv=fsync 2>/dev/null
+        rm -f "$f"
+        j=$((j + 1))
+      done' _ "$hogdir" &
+    hogs+=("$!")
+  done
+}
+cleanup() {
+  local p
+  for p in "${hogs[@]:-}"; do
+    [ -n "$p" ] && kill "$p" 2>/dev/null
+  done
+  rm -rf "$hogdir" 2>/dev/null
+}
+trap cleanup EXIT INT TERM
+
+case "$kind" in
+  cpu)  spawn_cpu ;;
+  disk) spawn_disk ;;
+  both) spawn_cpu; spawn_disk ;;
+  none) : ;;
+  *) echo "with-load: unknown GCL_STRESS_KIND='$kind' — running with NO load" >&2 ;;
+esac
+echo "stress: kind=$kind load=$load cores=$cores hogs=${#hogs[@]} :: $*"
+
+"$@"
+rc=$?
+
+cleanup
+hogs=()
+echo "stress: hogs reaped; wrapped command rc=$rc"
+exit "$rc"

From 2e483de058cb2f9084a141e75c4057881b56b000 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 00:00:31 +1000
Subject: [PATCH 06/76] AGENTS.md: record the CI flake-hunt mission + formal
 diagnosis loop

So the process (dispatch -> on failure: subagent diagnose -> Codex review -> plan ->
review/fix rounds -> implement -> review/fix rounds -> commit -> reset streak -> resume),
the mechanics, the process-hygiene lessons, and the progress log survive context compaction.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 AGENTS.md | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)
 create mode 100644 AGENTS.md

diff --git a/AGENTS.md b/AGENTS.md
new file mode 100644
index 0000000..9f10699
--- /dev/null
+++ b/AGENTS.md
@@ -0,0 +1,79 @@
+# AGENTS.md — CI flakiness stress hunt (branch `ci-stress`)
+
+> This branch exists to **flush out CI flakiness** in the test suites by running them
+> on GitHub Actions many times, under artificial load, and fixing every flake found via
+> a formal loop. Written 2026-06-16 so the mission + process survive context compaction.
+> A successor instance: read this top-to-bottom, then check `.agent-testing/` for live state.
+
+## Mission (Ben, 2026-06-16)
+Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**, or
+until agent credits run out (tell Ben; GitHub minutes are FREE — public repo — so the
+only budget is agent compute). Each time a run fails, fix the flake with the formal loop
+below, reset the streak to 0 (we want 50 clean on the *fixed* code), and resume. Ben also
+asked to run under **CPU + disk load** to surface load-sensitive flakes faster.
+
+## The formal diagnosis→fix loop (run on EVERY failure)
+1. **Capture** the failure: which leg/suite/test, the assertion, logs + preserved
+   artifacts. Save under `.agent-testing/failures/<run_id>/` (or `interop-fail-*.log`).
+2. **Diagnose** — spawn a subagent (fresh context) to root-cause from the evidence + the
+   code. Give it the evidence, WITHHOLD your own conclusion (let it reason independently).
+3. **Independent review of the diagnosis** — get a *foreign model* (Codex) to verify the
+   diagnosis against the code (uncorrelated with Claude). `codex exec --sandbox read-only
+   -c service_tier=default - < prompt > out.md` (NO `-o` — it corrupts output; capture stdout).
+4. **Classify**: test-flake (timing assumption breaks; product is correct) vs product bug.
+5. **Plan** the fix in `.plans/YYYY-MM-DD-ci-stress-<task>-plan.md`; commit it.
+6. **Plan review/fix rounds until clean** — fresh Claude reviewer AND Codex each round;
+   block ONLY on real design defects (not plan-doc pedantry); iterate until both CONVERGE.
+   Verify every reviewer finding against the actual code yourself (reviewers are fallible
+   and Claude-correlated).
+7. **Implement** the fix (test or product). `bash -n` + `shellcheck -S style` (v0.11.0 —
+   the CI gate) must stay clean. Run the affected suite locally to confirm.
+8. **Implementation review/fix rounds** — fresh Claude reviewer + Codex on the diff; clean.
+9. **Commit** to `ci-stress` under the git commit lock (`~/.local/bin/git-commit-lock.sh
+   run -- ...`, stage only your paths), **push**, mark the plan DONE + changelog.
+10. **Reset** the streak (`rm .agent-testing/clean_count`) and **resume** the driver.
+
+Quality bar (Ben): "I'm intending this library to be great" — spend tokens on rigor;
+don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow.
+
+## Mechanics (all under the `ci-stress` worktree)
+- Worktree: `C:/agent_data/commit-lock/worktrees/ci-stress`. Repo public: `bentoner/git-commit-lock`.
+- **Auth**: `GH_TOKEN=$(printf 'protocol=https\nhost=github.com\n\n' | git credential fill | grep '^password=' | cut -d= -f2-)`. `gh` is at `~/scoop/shims` (add to PATH).
+- **Stress-only commits — DO NOT MERGE to main**: the workflow `concurrency` tweak
+  (unique-per-run group, so parallel dispatches don't cancel) and `tests/with-load.sh` +
+  the workflow's load wiring (inputs `stress_kind`/`stress_load`, wrapped suite steps,
+  raised timeouts). Any *test/product fixes* ARE normal mergeable commits.
+- **Driver**: `.agent-testing/driver.sh` — keeps `MAXC=5` runs in flight via
+  `workflow_dispatch` (with `-f stress_kind=$STRESS_KIND`), polls, records
+  `results.tsv`/`clean_count`/`status.txt`, and EXITS on the first failure (sentinel
+  `FAIL:<id>`, captures diagnostics) or at `TARGET` (sentinel `DONE`). Launch:
+  `cd .agent-testing && rm -f clean_count sentinel STOP && STRESS_KIND=both TARGET=50 bash ./driver.sh` (background).
+- **Load**: `tests/with-load.sh` wraps each suite, spawning N CPU spin-loops and/or N disk
+  create/write+fsync/delete loops (`GCL_STRESS_KIND`, `GCL_STRESS_LOAD`). Hogs reaped by
+  exact PID. The runner is 4-core; `load=4` saturates it.
+- **Flake-condition meter**: Test 17d's `note: T17d outcomes rc0=.. rc1=.. rc97=.. rc98=..
+  ; WAITING=..` line (in each unit-leg log) shows how hard load is biting (rc97 dropping /
+  rc0 rising == the original flake condition). Read it to confirm load is effective.
+
+## Process hygiene (LEARNED THE HARD WAY 2026-06-16)
+- **`TaskStop` does NOT kill a background bash script** — it keeps running and dispatching.
+  After stopping, VERIFY via `powershell Get-CimInstance Win32_Process -Filter
+  "Name='bash.exe'"` (match CommandLine on `driver.sh`/`calibrate.sh`) and
+  `taskkill //F //T //PID <winpid>` the SPECIFIC pid. The driver also honors a graceful
+  **STOP file**: `touch .agent-testing/STOP` → it cancels inflight and exits (sentinel STOPPED).
+- **Exactly ONE dispatcher alive at a time.** A surviving zombie + a relaunch = two
+  dispatchers racing on `ci-stress` (this corrupted a calibration run-id correlation).
+- **NEVER blanket-kill** by name (`Stop-Process -Name`, `taskkill /IM`, `pkill`) — Ben's
+  box is shared; kill only specific PIDs you spawned.
+
+## Progress log
+- **Test 17d (unit, `git-commit-lock.test.sh`)** — `got97>=1` was timing-fragile
+  (windows-unit flaked at normal load, run 27616343269). FIXED (commit 58c3741): replaced
+  with rc∈{0,1,97,98} + drop-free `WAITING>=1` anti-vacuity canary + `note:` meter.
+  Diagnosis+plan+impl all reviewed clean by Claude+Codex. See the plan in `.plans/`.
+- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load
+  (load=4): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. The
+  precondition read (`head -n 1 "$LOCK"` after killing the pwsh holder) is a single
+  non-retrying read that catches the token not-yet-visible under load; the actual
+  cross-impl steal asserts PASS. Looks like a test-flake (fragile precondition read).
+  STATUS: in the formal loop (diagnosis stage) as of this writing.

From 06c6d8e614262da42abd8254145b048eb94ec54f Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 00:30:39 +1000
Subject: [PATCH 07/76] Interop Test 5: de-flake via deterministic pwsh orphan
 (CPU-load find)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Under CPU load the kill -9 of the pwsh holder missed the native pwsh.exe (MSYS dollar-bang
is a shim), so pwsh ran its full Start-Sleep 60 and exited gracefully — its
PowerShell.Exiting backstop DELETED the lock, so the precondition read got an empty/gone
file and the 3 steal asserts were vacuous (stole a backdate-recreated empty file).
Diagnosis + Codex agreed: test bug, product correct.

Fix (Option D): the holder now does
  if (-not (Lock-Acquire)) { Exit 3 }; write READY; [Environment]::Exit(0)
Environment.Exit bypasses BOTH release and the backstop, leaving a deterministic
token-bearing orphan with no external kill. bash drops the kill and just reaps. The
tok.ps token assertion is now genuine every run, not vacuous.

Local interop suite 141/0. Reviewed clean by fresh Claude reviewer + Codex. shellcheck
-S style + bash -n clean. Found via the CI load stress test.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...6-17-ci-stress-interop-test5-flake-plan.md | 119 ++++++++++++++++++
 AGENTS.md                                     |  25 +++-
 tests/git-commit-lock.interop.test.sh         |  19 +--
 3 files changed, 150 insertions(+), 13 deletions(-)
 create mode 100644 .plans/2026-06-17-ci-stress-interop-test5-flake-plan.md

diff --git a/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md b/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md
new file mode 100644
index 0000000..a6f9e8d
--- /dev/null
+++ b/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md
@@ -0,0 +1,119 @@
+# Plan: de-flake interop Test 5 (genuine-pwsh-orphan steal) under load
+
+Status: **DONE** — diagnosis + fix D validated by Claude subagent + Codex; implemented;
+implementation reviewed clean by fresh Claude reviewer ("IMPLEMENTATION OK") + Codex ("no
+correctness issues"); local interop suite 141/0 with a genuine `tok.ps.*` token. Awaiting
+CI-under-load confirmation.
+
+## Reviewer notes (top; do not renumber)
+_(none yet)_
+
+## Context
+CI stress under CPU load (load=4, 4-core Windows runner) reproducibly fails the **interop
+suite Test 5** ("bash steals a STALE lock GENUINELY created by pwsh (holder killed
+mid-hold)"), `tests/git-commit-lock.interop.test.sh:308-334`:
+```
+FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''
+PASS: bash run exited 0 after stealing pwsh's stale lock   (+2 more PASS)
+```
+Diagnosis (Claude subagent) + independent Codex review — both in
+`.agent-testing/failures/interop-test5/{DIAGNOSIS.md,b5.log}` and
+`.agent-testing/codex-t5-diag-review.txt`. Agreed mechanism (high confidence,
+triple-corroborated by b5.log):
+
+- The holder is `pwsh ... Lock-Acquire; write READY; Start-Sleep 60 &`, with `hpid=$!`.
+  bash waits READY then `kill -9 "$hpid"`. **That kill does not terminate the native
+  pwsh** (MSYS `$!` names a shim, not `pwsh.exe`; under load it misses). Proof: b5.log
+  shows ACQUIRED 13:42:45 → RELEASED 13:43:45 = **exactly 60s = the full Start-Sleep**,
+  and the release reason is **`engine-event backstop at process exit`** which fires ONLY
+  on graceful exit (`git-commit-lock.ps1:1299-1322`), never on a hard kill.
+- That graceful-exit backstop **deletes the lock file** (`git-commit-lock.ps1:1319-1321`)
+  before bash reads it, so `head -n 1 "$LOCK"` (:320) returns `''` — a **gone file**, not
+  a slow-to-appear token. `backdate "$LOCK" 9999` (:325 = `touch`, no `-c`, :107-115)
+  then **re-creates it empty+ancient**, and bash steals THAT empty orphan (`ghost=?`,
+  b5.log). So the 3 downstream PASSes are **vacuous** (they steal an empty file, not a
+  genuine `tok.ps.*` orphan); the only assertion checking the real premise correctly FAILed.
+- **Classification: test bug, product correct.** Every product action in b5.log is right.
+- **Why load:** unloaded, the kill lands by timing luck before the sleep ends; under load
+  the kill misses and the holder self-releases.
+
+Scope: this kill-a-holder-then-read-its-orphan pattern is unique to Test 5. The other
+interop kill (`:787`, `w14b`) is cleanup of a *hung waiter* after a regression `bad` — no
+orphan read depends on it — so it is NOT affected.
+
+## Fix (Option D — make the orphan deterministic; remove the unreliable kill)
+Both reviewers recommend D over hardening the kill (B/C): it eliminates the flaky
+mechanism instead of making it reliable, and is the smaller, more deterministic change.
+
+Have the pwsh holder **acquire, signal READY, then self-exit via
+`[Environment]::Exit(0)`** — the product's *documented* hard-exit that bypasses BOTH
+`Lock-Release` and the `PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-224`,
+`:1299-1301`), so it leaves a genuine token'd orphan every time, with no external kill and
+no timing dependence. `Lock-Acquire` writes+flushes+closes the token before returning
+(`git-commit-lock.ps1:650-664`) and READY is written only after acquire, so the moment
+bash sees READY the `tok.ps.*` token is already durably on disk.
+
+Concretely in `tests/git-commit-lock.interop.test.sh` Test 5:
+1. Holder command (`:314-315`): replace
+   `. '$PS1WIN'; Lock-Acquire | Out-Null; [IO.File]::WriteAllText('$READY','r'); Start-Sleep 60`
+   with
+   `. '$PS1WIN'; if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; [IO.File]::WriteAllText('$READY','r'); [Environment]::Exit(0)`
+   (`Lock-Acquire` returns `$false` on failure, `git-commit-lock.ps1:1350`; guard it so a
+   failed acquire never writes READY → the existing else-branch "never readied" fires.)
+2. Success branch (`:317-324`): drop the unreliable `kill -9 "$hpid"; wait "$hpid"; sleep
+   0.3` and replace with just `wait "$hpid" 2>/dev/null` (reap the self-exited holder).
+   Keep the token read + `case tok.ps.*` assertion + `backdate` + the steal asserts
+   unchanged — but now the orphan deterministically carries the genuine pwsh token, so the
+   `tok.ps.*` assertion (and the downstream steal) are no longer vacuous.
+3. Comment (`:309-311`): rewrite to describe the new mechanism honestly — the holder
+   acquires, signals ready, then exits via `[Environment]::Exit(0)`, a CLR hard-exit that
+   bypasses release (no `PowerShell.Exiting` event), leaving a genuine no-release token'd
+   orphan; deterministically equivalent (same on-disk state) to a holder killed mid-hold,
+   without depending on a scheduler-raced external kill.
+4. else branch (`:331-333`): keep its `kill -9 "$hpid"` cleanup (harmless; the holder may
+   still be starting if it never readied).
+
+### Why D is faithful (not a weakening)
+Test 5 verifies **bash stealing a genuine stale pwsh-created lock cross-impl**. What
+matters is the on-disk state at steal time: a live lock file whose line 1 is a real
+`tok.ps.*` token, with the holder gone and no release performed. D produces exactly that
+state deterministically. The literal "killed by external TerminateProcess" flavor is only
+test *setup*, not the product behavior under test; D's CLR hard-exit leaves the identical
+artifact. The fix makes the long-vacuous downstream PASSes actually meaningful.
+
+## Also
+- Correct the `AGENTS.md` Test 5 progress-log note (it currently states the wrong
+  mechanism — "token not-yet-visible under load"); replace with the missed-kill /
+  graceful-release-deleted-the-file mechanism.
+
+## Out of scope / NOT changed
+- Product code (`git-commit-lock.ps1` / `.sh`) — no product defect.
+- The bash-worker kills in the unit suite (they kill native bash where `$!` is correct and
+  no orphan-read depends on them; they passed under load).
+- Other interop tests.
+
+## Testing
+1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) on the interop test.
+2. Local: run the interop suite once on this box (pwsh present) — Test 5 must pass and the
+   token assertion must see a real `tok.ps.*` token. (Unloaded local box can't reproduce
+   the original miss, but confirms the rewrite is correct.)
+3. Real proof = CI under load: dispatch ci-stress with stress_kind=cpu/both several times;
+   the interop leg must stay green where it previously failed deterministically.
+
+## Changelog (implementation)
+- Implemented Fix D in `tests/git-commit-lock.interop.test.sh` Test 5: holder command now
+  `if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; write READY; [Environment]::Exit(0)`
+  (was `Lock-Acquire | Out-Null; write READY; Start-Sleep 60`); success branch drops
+  `kill -9 "$hpid"; sleep 0.3`, keeps `wait "$hpid"` to reap; ok-message + comment updated.
+  No product code, no other test touched. `Lock-Acquire` returns a strict boolean
+  (git-commit-lock.ps1:1350 etc.) so the `-not` guard is valid; the token is flushed+closed
+  during acquire (before READY) so it is durably visible before `[Environment]::Exit`.
+- Static: `bash -n` + `shellcheck -S style` (v0.11.0) clean.
+- Local (Windows, pwsh 7.5.5): interop suite **141 passed / 0 failed**; Test 5 token
+  assertion now PASSes with a real `tok.ps.*` token (e.g. `tok.ps.76676.…`) — no longer the
+  vacuous empty-orphan steal.
+- Review: fresh Claude reviewer "IMPLEMENTATION OK" (verified Lock-Acquire boolean contract,
+  no pipeline pollution from dropping Out-Null, token durability, race-free `wait`, quoting);
+  Codex `exec review --uncommitted` "no correctness issues." Both in `.agent-testing/`.
+- AGENTS.md Test 5 progress note corrected (was the wrong "token not-yet-visible" mechanism).
+- Real proof pending: CI interop leg under CPU load where it previously failed 3/3.
diff --git a/AGENTS.md b/AGENTS.md
index 9f10699..07a06bd 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -71,9 +71,22 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow
   (windows-unit flaked at normal load, run 27616343269). FIXED (commit 58c3741): replaced
   with rc∈{0,1,97,98} + drop-free `WAITING>=1` anti-vacuity canary + `note:` meter.
   Diagnosis+plan+impl all reviewed clean by Claude+Codex. See the plan in `.plans/`.
-- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load
-  (load=4): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. The
-  precondition read (`head -n 1 "$LOCK"` after killing the pwsh holder) is a single
-  non-retrying read that catches the token not-yet-visible under load; the actual
-  cross-impl steal asserts PASS. Looks like a test-flake (fragile precondition read).
-  STATUS: in the formal loop (diagnosis stage) as of this writing.
+- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load (3/3 cpu
+  runs): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. Mechanism
+  (diagnosis + Codex, NOT "token not-yet-visible"): `kill -9 "$hpid"` missed the native
+  pwsh (MSYS `$!` is a shim), so pwsh ran its full `Start-Sleep 60` and exited gracefully,
+  firing the `PowerShell.Exiting` backstop that DELETED its own lock — so the read hit a
+  gone file; `backdate`(touch) then re-created it empty, making the 3 "steal" PASSes
+  vacuous. Test bug, product correct. FIXED (commit <see git log>): holder now self-exits
+  via `[Environment]::Exit(0)` (bypasses release + backstop) leaving a deterministic
+  token'd orphan — no kill. Reviewed clean Claude+Codex; local interop 141/0.
+- **Calibration finding (load=4 on a 4-core runner):** `cpu` reliably breaks interop Test 5
+  (above) and otherwise the unit suite is fine. `disk` shifts Test 17d toward the acquire
+  regime (rc0 up to 4/12 — Ben's disk instinct was apt) but nothing fails. `both` (8 hogs
+  on 4 cores) is the most extreme and additionally trips TWO unit tests only under that
+  pathological oversubscription: `recovery took 33s (>20s)` (+ "rc=97 behind a crashed
+  claim" / "no STOLE-BY-CLAIM") and `claim-path warning fired 0 times (want 1)`. These two
+  are SUSPECTED load-too-high artifacts (tight internal budgets exceeded by 2x CPU
+  oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the
+  50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden
+  those two budgets. Data: `.agent-testing/calibration.tsv`.
diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh
index 06fe746..8d2a566 100644
--- a/tests/git-commit-lock.interop.test.sh
+++ b/tests/git-commit-lock.interop.test.sh
@@ -306,20 +306,25 @@ grep -q "holder=pid=99999 host=ghost" "$LOG" \
   || bad "holder from line 2 missing in pwsh's STALE log line"
 
 echo "== Test 5: bash steals a STALE lock GENUINELY created by pwsh (holder killed mid-hold) =="
-# The stale lock really is pwsh's: a pwsh process dot-sources the lock, acquires,
-# signals ready, then is hard-killed by PID mid-hold (TerminateProcess — no
-# release, no exit event), leaving its live lock FILE (token line 1) behind.
+# The stale lock really is pwsh's: a pwsh process dot-sources the lock, acquires (writing
+# its tok.ps.* token to line 1 and flushing+closing the file), signals ready, then
+# SELF-EXITS via [Environment]::Exit(0) — the port's documented hard-exit that bypasses
+# BOTH Lock-Release AND the PowerShell.Exiting backstop — leaving its live token'd lock
+# FILE behind with no release. This is DETERMINISTIC: the same on-disk state as a holder
+# killed mid-hold, but without an external kill. (An MSYS `kill -9 "$!"` does NOT reliably
+# terminate the native pwsh.exe under load — it survived, ran to completion, and its
+# graceful-exit backstop DELETED the lock, leaving an empty file to steal; observed under
+# CPU load, run 27621668323. See the Test 5 de-flake plan.)
 LOCK="$WORK/b5.lock"; LOG="$WORK/b5.log"; : > "$LOG"; MARK="$WORK/b5.mark"; printf '%s' before > "$MARK"
 READY="$WORK/b5.ready"; rm -f "$READY"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 \
-  pwsh -NoProfile -Command ". '$PS1WIN'; Lock-Acquire | Out-Null; [IO.File]::WriteAllText('$READY','r'); Start-Sleep 60" &
+  pwsh -NoProfile -Command ". '$PS1WIN'; if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; [IO.File]::WriteAllText('$READY','r'); [Environment]::Exit(0)" &
 hpid=$!
 if wait_for "$READY"; then
-  kill -9 "$hpid" 2>/dev/null; wait "$hpid" 2>/dev/null
-  sleep 0.3
+  wait "$hpid" 2>/dev/null                          # holder self-exited via [Environment]::Exit (no release); reap it
   tok="$(head -n 1 "$LOCK" 2>/dev/null | tr -d '\r\n')"
   case "$tok" in
-    tok.ps.*) ok "dead pwsh holder left its own lock file behind (token $tok)" ;;
+    tok.ps.*) ok "self-exited pwsh holder left its own token'd lock behind (token $tok)" ;;
     *)        bad "expected a tok.ps.* token on line 1 of the orphan lock, got '$tok'" ;;
   esac
   backdate "$LOCK" 9999                           # age the orphan past any stale window

From 3270fbd37db19221d6419a3bb60ed2c9f2df19eb Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 01:21:45 +1000
Subject: [PATCH 08/76] AGENTS.md: record Test 31(a) diagnosis + hunt status
 (16/50, halted on T31a)

Third flake found by the load hunt (ubuntu, both/load=2): Test 31(a) leaked-token-memory
DISCOVERY-HOLD assertion races the external mv install vs the leaver _lock_discover; under
load the direct-discover path (sh:822) adopts the claim instead of the memory path (sh:1382)
the assertion pins. Product correct; test-orchestration race; 31(b) covers the memory path
deterministically. Diagnosed, fix pending the formal loop. Hunt at 16/50 clean, halted here.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 AGENTS.md | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/AGENTS.md b/AGENTS.md
index 07a06bd..381f331 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -90,3 +90,23 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow
   oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the
   50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden
   those two budgets. Data: `.agent-testing/calibration.tsv`.
+- **Test 31(a) (unit, `git-commit-lock.test.sh:1582`)** — FOUND on **ubuntu** under
+  both/load=2 (moderate, likely genuine), run 27626826865: `FAIL: no leaked-token-memory
+  DISCOVERY-HOLD`. DIAGNOSED (not yet fixed): the product has two valid DISCOVERY-HOLD
+  paths — direct `_lock_discover` (sh:822) and the per-poll leaked-token-memory check
+  (sh:1382). 31(a)'s external `mv` (installs the leaked claim at the lock path) RACES the
+  leaver's `_lock_discover`; under load the mv landed first, so 822 adopted the claim
+  instead of the 1382 memory path the assertion pins. Product correct (rc 0, clean
+  release, no leftover all PASSed); test-orchestration race. Sibling 31(b) already covers
+  the memory path DETERMINISTICALLY (internal steering) and passed. Fix options + recommend
+  in `.agent-testing/failures/unit-test31/DIAGNOSIS.md` (recommend A: relax 31a to accept
+  generic DISCOVERY-HOLD since 31b covers memory — but VERIFY via the formal loop it's not
+  vacuous). NEEDS: subagent-diagnosis confirm + Codex review + plan + impl review.
+
+## Hunt status (as of 2026-06-17 ~01:15 local)
+- `both`/load=2 hunt reached **16/50 clean** then halted on Test 31(a) above. The driver
+  exited cleanly (sentinel FAIL); no stray dispatcher; no in-flight runs.
+- To RESUME after fixing Test 31(a): `cd .agent-testing && rm -f clean_count sentinel STOP
+  && STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). Expect it to
+  surface further flakes (each is a fresh loop). Load=2 avoids the 8-hog budget artifacts.
+- TWO flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e).

From b7af8102123cb455065898514a470e1c1e889182 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 01:29:26 +1000
Subject: [PATCH 09/76] =?UTF-8?q?AGENTS.md:=20no=20CI=20credit=20limit=20(?=
 =?UTF-8?q?public=20repo,=20unlimited=20CI)=20=E2=80=94=20keep=20going?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ben clarified: there are no credits to worry about; public repo => unlimited CI capped only
by GitHub concurrency. Corrects the earlier misleading agent-credits framing that prompted a
premature budget pause. Resume guidance: dispatch freely, keep going to 50 clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 AGENTS.md | 15 ++++++++++-----
 1 file changed, 10 insertions(+), 5 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 381f331..b15e5d2 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -6,11 +6,16 @@
 > A successor instance: read this top-to-bottom, then check `.agent-testing/` for live state.
 
 ## Mission (Ben, 2026-06-16)
-Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**, or
-until agent credits run out (tell Ben; GitHub minutes are FREE — public repo — so the
-only budget is agent compute). Each time a run fails, fix the flake with the formal loop
-below, reset the streak to 0 (we want 50 clean on the *fixed* code), and resume. Ben also
-asked to run under **CPU + disk load** to surface load-sensitive flakes faster.
+Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**. Each
+time a run fails, fix the flake with the formal loop below, reset the streak to 0 (we want
+50 clean on the *fixed* code), and resume. Ben also asked to run under **CPU + disk load**
+to surface load-sensitive flakes faster.
+
+**NO CREDITS / NO BUDGET LIMIT — DON'T PAUSE FOR "CREDITS".** Ben (2026-06-17, explicit):
+there are no credits to worry about — this is a PUBLIC repo, so we can run UNLIMITED CI,
+capped only by GitHub concurrency (excess just queues — throughput, not cost). Keep going;
+dispatch freely; run full review loops. Only surface a genuine blocker or a real decision
+for Ben.
 
 ## The formal diagnosis→fix loop (run on EVERY failure)
 1. **Capture** the failure: which leg/suite/test, the assertion, logs + preserved

From 51a1753b83da1f55d728d47e5d44756d293d9bcc Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 01:59:01 +1000
Subject: [PATCH 10/76] Test 31(a): de-flake the leaked-claim discovery-route
 race (CI load find)

Sub-leg (a) installs a recheck-unreadable leaked claim at the lock path via an
external mv, then asserted adoption went through the per-poll leaked-token-memory
route ("DISCOVERY-HOLD (leaked-token memory)"). But the product can adopt the
claim via EITHER of two correct routes: the inline ownership-discovery read
(git-commit-lock.sh:822) if the mv lands before it, or the per-poll memory check
(git-commit-lock.sh:1382) on a later poll if it lands after. Which fires is a pure
scheduling race -- the external mv vs the leaver's inline discover one statement
after the leak-add (sh:1112 -> sh:1114). Under both/load=2 on ubuntu the mv won and
the direct route fired, so the memory-pinned assertion failed spuriously
(run 27626826865).

The product behaved correctly in both cases (token remembered, same token observed
installed, adopted, rc 0, clean release, no residue). Fix is test-only: sub-leg (a)
now accepts EITHER DISCOVERY-HOLD route and records which fired, failing only if
neither adopted the claim. No coverage is lost -- the memory route stays pinned
deterministically by sub-leg (b), and the direct route by Test 25's 7-position
discovery-position matrix.

Diagnosis converged across four independent reviews (my code-read + the verbatim
leak.log, a fresh-context Claude subagent that did not read the prior diagnosis,
and a Codex foreign-model review). Implementation reviewed clean by a fresh Claude
reviewer and by Codex. Static checks (bash -n + shellcheck -S style v0.11.0) clean;
local unit suite 207 passed / 0 failed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...2026-06-17-ci-stress-test31a-flake-plan.md | 135 ++++++++++++++++++
 AGENTS.md                                     |  26 ++--
 tests/git-commit-lock.test.sh                 |  28 +++-
 3 files changed, 173 insertions(+), 16 deletions(-)
 create mode 100644 .plans/2026-06-17-ci-stress-test31a-flake-plan.md

diff --git a/.plans/2026-06-17-ci-stress-test31a-flake-plan.md b/.plans/2026-06-17-ci-stress-test31a-flake-plan.md
new file mode 100644
index 0000000..be8d801
--- /dev/null
+++ b/.plans/2026-06-17-ci-stress-test31a-flake-plan.md
@@ -0,0 +1,135 @@
+# Plan: de-flake unit Test 31(a) (leaked-claim discovery-route race) under load
+
+Status: **DONE** — diagnosis converged across 4 independent reviews (my code-read +
+leak.log + a fresh-context Claude subagent that did NOT read the prior diagnosis + Codex
+foreign-model review); fix implemented; implementation reviewed clean (see changelog).
+Test-only change; product untouched. Awaiting CI-under-load confirmation.
+
+## Reviewer notes (top; do not renumber)
+_(none yet)_
+
+## Context
+CI stress under both/load=2 (moderate, 4 hogs on a 4-core ubuntu runner — NOT the
+8-hog oversubscription regime) failed ONE assertion in unit **Test 31 sub-leg (a)**
+(`tests/git-commit-lock.test.sh:1582`), run 27626826865:
+```
+FAIL: no leaked-token-memory DISCOVERY-HOLD
+```
+Every other (a) assertion passed (recheck-unreadable feeder fired; rc 0; lock released
+cleanly; no claim/lock leftover); sub-legs (b)(c)(d) passed.
+
+### Mechanism (test-orchestration race; product correct)
+The product has TWO valid, equally-correct ways to adopt a leaked claim that a rival has
+installed at the lock path, and both log a `DISCOVERY-HOLD` line:
+- **D1 — inline ownership-discovery read.** `_lock_discover` (`git-commit-lock.sh:819`,
+  log at `:822` `DISCOVERY-HOLD: our claim ... installed ... by a rival's rename`) is the
+  unconditional final act of every post-claim non-rename exit. In (a) the steered
+  recheck-unreadable exit runs `_lock_leaked_add` (`:1112`, the `LEAKED-CLAIM` log) and
+  then **immediately, one statement later**, `_lock_discover "$tok"` (`:1114`).
+- **D2 — per-poll leaked-token-memory check.** `git-commit-lock.sh:1382`
+  (`DISCOVERY-HOLD (leaked-token memory): ...`) fires on a LATER blocked poll while the
+  memory list is non-empty.
+
+Sub-leg (a)'s harness is open-loop: it `wait_for_grep`s the `LEAKED-CLAIM` line
+(`:1574`) then does `mv -f -- "$LOCK.next" "$LOCK"` (`:1576`, the rival install). That
+`mv` races the leaver's inline `_lock_discover` at `:1114`:
+- mv lands **before** the inline discover → **D1** wins (the `:822` line). ← failing run
+- mv lands **after** the inline discover (it misses; later poll) → **D2** wins (`:1382`).
+
+The assertion at `:1582` hard-pins **D2** (`grep -q "DISCOVERY-HOLD (leaked-token
+memory)"`). Under load the leaver was descheduled between `:1112` and `:1114`, the
+harness `mv` landed first, D1 fired, D2 never logged → the assertion failed. The product
+behaved correctly in BOTH cases (token remembered, same token observed installed,
+adopted, rc 0, clean release, no residue). Classification: **test flake, product
+correct** — the assertion over-specified an implementation-incidental, scheduler-chosen
+route rather than the contract (a leaked claim installed by a rival is adopted and
+cleaned up).
+
+### Coverage (why relaxing (a) loses nothing)
+- **D2 (memory route)** is covered DETERMINISTICALLY by **sub-leg (b)** (`:1592-1627`):
+  it drives the rival install from inside `_lock_new_token` at NTC=2 so the leaver runs a
+  full aborting claim attempt and adopts only on the per-poll memory check; it asserts
+  `DISCOVERY-HOLD (leaked-token memory)` and the `leak < abort < adoption` ordering.
+- **D1 (direct route)** is covered DETERMINISTICALLY by **Test 25** (`:1323-1425`), the
+  discovery-position matrix: 7 internally-steered positions, each asserting the generic
+  `grep -q "DISCOVERY-HOLD"` + rc 0 + no orphan. (Test 25 already uses the generic grep
+  idiom this fix adopts for (a).)
+
+So (a)'s distinct, irreplaceable job is the END-TO-END "external rival installs a
+recheck-unreadable leaked claim → adopted & cleaned up" scenario, where either route is a
+correct outcome.
+
+## Fix (Option A — accept either discovery route; recommended by all four reviews)
+Test-only, in `tests/git-commit-lock.test.sh` sub-leg (a):
+1. Replace the single D2-pinning assertion (`:1582-1583`) with a three-way check that
+   accepts EITHER route, records WHICH fired (telemetry for the load hunt), and only
+   fails if NEITHER `DISCOVERY-HOLD` route adopted the claim:
+   ```sh
+   if grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG"; then
+     ok "... per-poll memory route ..."
+   elif grep -q "DISCOVERY-HOLD:" "$LOG"; then
+     ok "... inline direct-discovery route ... (memory route pinned by sub-leg (b)) ..."
+   else
+     bad "no DISCOVERY-HOLD adoption of the leaked claim by EITHER route"
+   fi
+   ```
+   `"DISCOVERY-HOLD:"` (immediate colon) matches ONLY D1; D2's text is
+   `DISCOVERY-HOLD (leaked-token memory):` (space+paren after the dash), so the two
+   patterns are disjoint and D2 is checked first regardless.
+2. Update sub-leg (a)'s header comment (`:1550-1552`) to state honestly that adoption may
+   go through either route, that the choice is a load-sensitive scheduling race, and that
+   the memory route is pinned deterministically by (b) and the direct route by Test 25.
+
+### Why A (not B/C)
+- **A** matches (a)'s real intent; not vacuous — still requires the recheck-unreadable
+  feeder (`:1574`), rc 0 (`:1581`), clean release + no leftover (`:1584-1585`), AND a
+  `DISCOVERY-HOLD` adoption (the log line only appears when `_lock_take_hold` runs via a
+  discovery path). No new timing introduced. Keeps (a) as the load-tolerant main leg.
+- **B** (force the memory route via internal steering) duplicates (b).
+- **C** (force the direct route) duplicates Test 25; also `_lock_discover` direct
+  coverage is already comprehensive there. (NB: the subagent's specific C steering — do
+  the mv inside the fire-once read shadow before returning empty — would actually
+  mis-classify the claim as `gone` not `unreadable`, killing the leak feeder; another
+  reason to avoid C. Verified against `_lock_claim_state`, `git-commit-lock.sh:840-850`.)
+
+## Out of scope / NOT changed
+- Product code (`git-commit-lock.sh`, `.ps1`) — no defect.
+- Sub-legs (b)(c)(d), Test 25, any other test.
+
+## Logging
+No product logging change. The new three-way `ok` line records which discovery route
+adopted the claim each run — a small telemetry win making the previously-hidden route
+choice visible in every (a) run's output (helps confirm load is exercising both routes).
+
+## Testing
+1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) on the test file.
+2. Local: run the unit suite on this box; Test 31 (all sub-legs) must pass; confirm the
+   new `ok` line reports a route. Run Test 31 in a loop to confirm no regression.
+3. Real proof: CI under both/load=2 where (a) previously failed — the unit leg must stay
+   green and report a route each run.
+
+## Changelog (implementation)
+- Implemented Fix A in `tests/git-commit-lock.test.sh` sub-leg (a): the single
+  D2-pinning assertion became a three-way `if/elif/else` (memory route → ok; direct route
+  via `grep "DISCOVERY-HOLD:"` → ok; neither → bad). Rewrote (a)'s header comment to
+  document both routes, the load-sensitive race, and the deterministic coverage of each
+  (sub-leg (b) for memory, Test 25 for direct). No product code, no other test touched.
+- Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) clean.
+- Local (Windows MSYS bash, pwsh 7.5.5): full unit suite **207 passed / 0 failed**
+  (fan-out auto-REDUCED under the box load). Sub-leg (a) passed via the memory route on
+  this UNLOADED box (`adoption went through the leaked-token memory (per-poll route ...)`),
+  confirming the normal path still fires and the new assertion accepts it; (b)(c)(d) green.
+- Diagnosis review (4 independent, all converged: test flake / product correct / Fix A):
+  my code-read + the verbatim leak.log, a fresh-context Claude subagent that did NOT read
+  the prior diagnosis, and a Codex foreign-model review. Codex additionally noted D1 is
+  already covered by Test 25's discovery-position matrix → option C (a new D1 sub-leg) is
+  redundant. (I verified Test 25 covers all 7 positions deterministically myself.)
+- Implementation review (2 independent, both clean / no findings): a fresh Claude reviewer
+  ("the change is correct ... no defect found") and Codex `exec` read-only ("None. The fix
+  is correct."). Both verified: grep patterns disjoint (BRE parens literal; `DISCOVERY-HOLD:`
+  needs an immediate colon, absent from the memory line), non-vacuity (a `DISCOVERY-HOLD`
+  line is logged one statement before the pure-assignment `_lock_take_hold`, so it reliably
+  implies a taken hold; backstopped by rc 0 + no-leftover + the feeder assertion), no new
+  race (greps run only after `wait "$w31"`), `$LOG` leg-dedicated (no cross-talk), and the
+  comment's sh:822/1382/1112/1114 line refs accurate.
+- Real proof pending: CI under both/load=2 where (a) previously failed (run 27626826865).
diff --git a/AGENTS.md b/AGENTS.md
index b15e5d2..7d93530 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -95,18 +95,20 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow
   oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the
   50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden
   those two budgets. Data: `.agent-testing/calibration.tsv`.
-- **Test 31(a) (unit, `git-commit-lock.test.sh:1582`)** — FOUND on **ubuntu** under
-  both/load=2 (moderate, likely genuine), run 27626826865: `FAIL: no leaked-token-memory
-  DISCOVERY-HOLD`. DIAGNOSED (not yet fixed): the product has two valid DISCOVERY-HOLD
-  paths — direct `_lock_discover` (sh:822) and the per-poll leaked-token-memory check
-  (sh:1382). 31(a)'s external `mv` (installs the leaked claim at the lock path) RACES the
-  leaver's `_lock_discover`; under load the mv landed first, so 822 adopted the claim
-  instead of the 1382 memory path the assertion pins. Product correct (rc 0, clean
-  release, no leftover all PASSed); test-orchestration race. Sibling 31(b) already covers
-  the memory path DETERMINISTICALLY (internal steering) and passed. Fix options + recommend
-  in `.agent-testing/failures/unit-test31/DIAGNOSIS.md` (recommend A: relax 31a to accept
-  generic DISCOVERY-HOLD since 31b covers memory — but VERIFY via the formal loop it's not
-  vacuous). NEEDS: subagent-diagnosis confirm + Codex review + plan + impl review.
+- **Test 31(a) (unit, `git-commit-lock.test.sh`)** — FOUND on **ubuntu** under both/load=2
+  (moderate, genuine), run 27626826865: `FAIL: no leaked-token-memory DISCOVERY-HOLD`.
+  Mechanism: the product has two valid DISCOVERY-HOLD adoption paths — direct
+  `_lock_discover` (sh:822) and the per-poll leaked-token-memory check (sh:1382). 31(a)'s
+  external `mv` (installs the leaked claim at the lock path) RACES the leaver's inline
+  `_lock_discover` (called one statement after the leak-add: sh:1112 -> sh:1114); under
+  load the mv landed first, so 822 adopted instead of the 1382 memory path the assertion
+  pinned. Product correct (rc 0, clean release, no leftover all PASSed); test-orchestration
+  race. **FIXED (commit <see git log>):** Fix A — sub-leg (a)'s assertion now accepts EITHER
+  DISCOVERY-HOLD route and records which fired (memory route still pinned deterministically
+  by 31(b); direct route by Test 25's 7-position discovery matrix, so no coverage lost).
+  Diagnosis converged across 4 independent reviews (code-read + leak.log + fresh Claude
+  subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See
+  `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load.
 
 ## Hunt status (as of 2026-06-17 ~01:15 local)
 - `both`/load=2 hunt reached **16/50 clean** then halted on Test 31(a) above. The driver
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 57265a9..26fe69d 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -1548,8 +1548,18 @@ bad_touch="$(grep 'touch ' "$LIB" | grep '_LOCK_CLAIM_PATH' | grep -v -- '-c')"
 
 echo "== Test 31: LEAKED-claim discovery — the leaked-token memory closes the unverified-claim lanes =="
 # (a) main leg: a recheck-unreadable exit leaks the claim token; a rival
-# later installs that claim as the lock; the leaver's per-poll memory check
-# adopts it (HOLD) and release returns 0.
+# (the external mv below) then installs that claim as the lock; the leaver
+# adopts it (HOLD) and release returns 0. Adoption may go through EITHER of
+# the product's two discovery routes — both correct: the inline
+# ownership-discovery read that is the unreadable branch's final act
+# (git-commit-lock.sh:822, "DISCOVERY-HOLD: ...") if the external mv lands
+# before it, or the per-poll leaked-token-memory check
+# (git-commit-lock.sh:1382, "DISCOVERY-HOLD (leaked-token memory)") on a later
+# poll if it lands after. Which wins is a pure scheduling race — the external
+# mv vs the leaver's inline discover ONE statement later (sh:1112 leak-add ->
+# sh:1114 discover) — and is load-sensitive, so this leg accepts either and
+# records which fired. The memory route is pinned DETERMINISTICALLY by
+# sub-leg (b) below; the direct route by Test 25's discovery-position matrix.
 # NB: _lock_read_tok / _lock_cur_token shadows run inside COMMAND
 # SUBSTITUTIONS (subshells), so their fire-once state must live in flag
 # FILES — a variable assignment would be lost when the subshell exits.
@@ -1579,8 +1589,18 @@ else
 fi
 wait "$w31"; rc=$?
 [ "$rc" = 0 ] && ok "leaver discovered its installed leaked claim and released rc 0" || bad "leaked-discovery harness rc=$rc"
-grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG" && ok "adoption went through the leaked-token memory" \
-                                                      || bad "no leaked-token-memory DISCOVERY-HOLD"
+# Either discovery route is correct here (see the leg comment); accept both,
+# record which fired, fail only if NEITHER adopted the leaked claim. ("$LOG"
+# is dedicated to this leg, so there is no cross-talk.) "DISCOVERY-HOLD:"
+# (immediate colon) matches ONLY the direct route; the memory route reads
+# "DISCOVERY-HOLD (leaked-token memory):" — disjoint, and checked first.
+if grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG"; then
+  ok "adoption went through the leaked-token memory (per-poll route; the mv landed after the inline discover)"
+elif grep -q "DISCOVERY-HOLD:" "$LOG"; then
+  ok "adoption went through the inline ownership-discovery read (direct route; the mv landed first) — memory route pinned by sub-leg (b)"
+else
+  bad "no DISCOVERY-HOLD adoption of the leaked claim by EITHER route"
+fi
 [ -e "$LOCK" ] && bad "lock leftover after leaked-claim adoption" || ok "lock released cleanly after adoption"
 [ -e "$LOCK.next" ] && bad "claim leftover after leaked-claim adoption" || ok "no claim leftover"
 # Hmm wait: STALE=300 — the ghost is backdated 9999 so it IS stale; fine.

From 810ee415f398d02b920ffd274d68d206d138a24a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 02:00:27 +1000
Subject: [PATCH 11/76] AGENTS.md: mark Test 31(a) fixed (51a1753); resume
 hunt; ignore *.stackdump

Fill the real fix SHA into the Test 31(a) progress entry, update the hunt status
(clean_count reset; both/load=2 hunt resumed toward 50 clean on the fixed tree;
three flakes fixed this session). Add *.stackdump to .gitignore so the suite's
transient Cygwin crash dumps stop cluttering git status during the hunt.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .gitignore |  1 +
 AGENTS.md  | 19 +++++++++++--------
 2 files changed, 12 insertions(+), 8 deletions(-)

diff --git a/.gitignore b/.gitignore
index be293f3..9bdb6bd 100644
--- a/.gitignore
+++ b/.gitignore
@@ -5,6 +5,7 @@
 # OS / editor cruft
 .DS_Store
 Thumbs.db
+*.stackdump
 /.agent/review-queue
 /.agent/review-queue.lock
 /.agent/review-queue.lock.*
diff --git a/AGENTS.md b/AGENTS.md
index 7d93530..1a9f4ae 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -103,17 +103,20 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow
   `_lock_discover` (called one statement after the leak-add: sh:1112 -> sh:1114); under
   load the mv landed first, so 822 adopted instead of the 1382 memory path the assertion
   pinned. Product correct (rc 0, clean release, no leftover all PASSed); test-orchestration
-  race. **FIXED (commit <see git log>):** Fix A — sub-leg (a)'s assertion now accepts EITHER
+  race. **FIXED (commit 51a1753):** Fix A — sub-leg (a)'s assertion now accepts EITHER
   DISCOVERY-HOLD route and records which fired (memory route still pinned deterministically
   by 31(b); direct route by Test 25's 7-position discovery matrix, so no coverage lost).
   Diagnosis converged across 4 independent reviews (code-read + leak.log + fresh Claude
   subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See
   `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load.
 
-## Hunt status (as of 2026-06-17 ~01:15 local)
-- `both`/load=2 hunt reached **16/50 clean** then halted on Test 31(a) above. The driver
-  exited cleanly (sentinel FAIL); no stray dispatcher; no in-flight runs.
-- To RESUME after fixing Test 31(a): `cd .agent-testing && rm -f clean_count sentinel STOP
-  && STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). Expect it to
-  surface further flakes (each is a fresh loop). Load=2 avoids the 8-hog budget artifacts.
-- TWO flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e).
+## Hunt status (as of 2026-06-17 ~02:30 local)
+- Test 31(a) FIXED (51a1753) via the full formal loop; clean_count reset to 0 and the
+  `both`/load=2 hunt RESUMED toward 50 clean (the prior 16/50 streak was on pre-fix code,
+  so it does not count — we want 50 clean on the FIXED tree). Expect more flakes; each is a
+  fresh loop. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a).
+- To resume after any halt: `cd .agent-testing && rm -f clean_count sentinel STOP &&
+  STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). First verify no
+  stray dispatcher + current HEAD (see Process hygiene).
+- THREE flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e),
+  Test 31(a) (51a1753).

From 19a28fd294ee5fb663ba4472b64aed49ae78fcdd Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 03:15:18 +1000
Subject: [PATCH 12/76] =?UTF-8?q?Test=2032b:=20cover=20F2=20=E2=80=94=20st?=
 =?UTF-8?q?eal=20rename=20WON=20but=20read-back=20verification=20FAILED?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

A coverage audit (subagent + my own verification against the code) found the
product's two acquire read-back-verification failure lanes were asymmetrically
covered. The create-path lane (O_EXCL create wins, path reads back the wrong token,
git-commit-lock.sh:1354-1360) is covered by Test 32. Its steal-path twin "F2"
(git-commit-lock.sh:1168-1179) was NOT: the stealer wins the claim race AND wins the
rename-over (STOLE-BY-CLAIM logged, ghost destroyed), but the mandatory post-rename
read-back at :1171 reads back the wrong token, so the product must clear its claim
token and re-enter the wait loop rather than take the hold. After a STOLE-BY-CLAIM a
silent false-hold there would be a mis-attributed hold of a destroyed-ghost path, so
this is the higher-stakes twin — and nothing exercised it.

Test 32b closes the gap. It mirrors Test 32 with the INVERSE token gate: a one-shot
_lock_cur_token shadow gated on [ -n "$_LOCK_CLAIM_TOKEN" ] lands the read-back fault
at the STEAL read-back (:1171), not the create one (:1353, where the claim token is
empty). On firing it backdates the just-installed abandoned lock stale so the re-steal
is immediate (same trick as Test 32 — keeps it fast and deterministic); the second
attempt (shadow spent) reads back the real token and acquires, releasing rc 0. The
test asserts the F2-specific log line (not the shared "acquire verification FAILED"
prefix), STOLE-BY-CLAIM x2, the WARNING preceding the eventual ACQUIRED (no
false-hold), and no leftovers. The stale closing NOTE that called the read-back lanes
"not suite-covered" is corrected (create by Test 32, F2 by Test 32b).

Product code is unchanged; F2 reads correct today — this is a missing-test
(regression exposure), not a present bug.

Diagnosis from a coverage-audit subagent, verified by me against the code. Test-only;
no product change. Static checks clean; local suite 0 failed; Test 32b verified to
exercise the F2 lane (standalone + full suite). Implementation reviewed clean by a
fresh Claude reviewer and by Codex.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...6-06-17-ci-stress-test-f2-coverage-plan.md | 97 +++++++++++++++++++
 tests/git-commit-lock.test.sh                 | 61 +++++++++++-
 2 files changed, 155 insertions(+), 3 deletions(-)
 create mode 100644 .plans/2026-06-17-ci-stress-test-f2-coverage-plan.md

diff --git a/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md b/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md
new file mode 100644
index 0000000..e1d9f4e
--- /dev/null
+++ b/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md
@@ -0,0 +1,97 @@
+# Plan: cover F2 — steal rename WON but read-back verification FAILED (coverage gap)
+
+Status: **DONE** — implemented; reviewed clean (see changelog). Test-only addition; product
+untouched.
+
+## Reviewer notes (top; do not renumber)
+_(none yet)_
+
+## Context
+A coverage audit (subagent + my own verification against the code) found that the product's
+two acquire read-back-verification failure lanes are asymmetrically covered:
+- **Create path (outcome I)** — `git-commit-lock.sh:1354-1360`: O_EXCL create wins, the path
+  read-back ≠ our token → `WARNING: acquire verification FAILED — create won but read-back
+  found ...` → re-enter wait. **Covered** by Test 32 (`tests/git-commit-lock.test.sh:1760`),
+  whose `_lock_cur_token` shadow is gated `[ -z "$_LOCK_CLAIM_TOKEN" ]` (fires only at the
+  create read-back).
+- **Steal path (outcome F2)** — `git-commit-lock.sh:1168-1179`: the stealer WON the claim
+  race AND won the rename-over (`STOLE-BY-CLAIM` already logged, ghost destroyed), but the
+  post-rename read-back ≠ our token → `WARNING: acquire verification FAILED — steal rename
+  completed but read-back found ...` → clear `_LOCK_CLAIM_TOKEN`, return 1, re-enter wait.
+  **UNCOVERED.** Verified: no test greps the F2 string; Test 32's gate excludes it (at the
+  steal read-back `_LOCK_CLAIM_TOKEN` is set); on the success-rename path `:1171` is the only
+  `_lock_cur_token` call with the claim token set (`_lock_rename_over` `:961-979` makes none).
+
+F2 is the higher-stakes twin: it fires AFTER `STOLE-BY-CLAIM` (ghost already destroyed), so a
+future regression here (wrongly taking the hold on a mismatched read-back, or failing to clear
+`_LOCK_CLAIM_TOKEN`) would be a silent false-hold / mis-attributed release. The code reads
+correctly today — this is a missing-test (regression exposure), not a present bug.
+
+The suite's closing NOTE (`:2119-2121`) says "lock_acquire's read-back-verification failure
+lane … not suite-covered", but Test 32 already covers the create lane — the note is stale and
+does not distinguish F2.
+
+## Change (test-only)
+1. Add **Test 32b** immediately after Test 32, mirroring Test 32 with the INVERSE token gate
+   so the fault injection lands at the STEAL read-back:
+   - Set up a stale ghost (`fabricate_lock` + `backdate 9999`) so a steal is attempted.
+   - In a sourced subshell, `clone_fn _lock_cur_token _ct_orig`; shadow it to fire ONCE
+     (flag FILE `$SF1`, subshell-safe) when `[ ! -e "$SF1" ] && [ "${_LOCK_HELD:-0}" = 0 ]
+     && [ -n "$_LOCK_CLAIM_TOKEN" ]` — i.e. at the steal read-back (`:1171`), where the claim
+     token is set and the hold is not yet taken. On firing: `backdate "$AGENT_LOCK_PATH"
+     9999` (so the just-installed abandoned lock is immediately re-stealable — same trick as
+     Test 32, keeps it fast/deterministic), `printf ""` (blank read-back → F2), `return 0`.
+   - `lock_acquire || exit 72; lock_release || exit 74; exit 0`.
+   - Flow: attempt 1 — claim won, rename won (`STOLE-BY-CLAIM`), read-back blanked → F2
+     WARNING → re-enter wait; the abandoned lock is stale → attempt 2 steals it, read-back now
+     real (SF1 set) → HOLD → `ACQUIRED` → release rc 0.
+   - Assertions: rc 0; the **F2-specific** string `steal rename completed but read-back`
+     fired (else `bad "F2 lane never ran"` — guards vacuity / proves the steering reached
+     `:1171`); the WARNING precedes the final `ACQUIRED` (no false-hold on attempt 1);
+     `STOLE-BY-CLAIM` count ≥ 2 (re-stole after the failed read-back); no leftover lock/claim
+     after release.
+2. Update the stale NOTE (`:2119-2121`): both read-back lanes are now suite-covered — create
+   by Test 32, steal by Test 32b — via `_lock_cur_token` fault injection.
+
+## Why deterministic / load-robust
+Internal steering (no scheduling race); the backdate-9999 trick removes any aging wait so the
+re-steal is immediate; `MAX_WAIT=30`, `POLL=0.1` give ample headroom under CI load. Same shape
+as the already-load-robust Test 32.
+
+## Logging
+No product logging change. The new test asserts on existing product log lines (the F2 WARNING,
+`STOLE-BY-CLAIM`, `ACQUIRED`).
+
+## Out of scope / NOT changed
+- Product code (`git-commit-lock.sh`, `.ps1`) — no defect; F2 reads correct.
+- Lower-priority gaps from the audit (A2/G2 wrong-type appearing at the lock path mid-steal;
+  platform-only feeder #3) — left for a separate decision.
+
+## Testing
+1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate).
+2. Local: run the new test (and the full suite); it MUST exercise the F2 string (the
+   `bad "F2 lane never ran"` guard fails loudly if the steering misses `:1171`).
+3. Real proof: CI under load (the hunt) stays green with the new test.
+
+## Changelog (implementation)
+- Added Test 32b to `tests/git-commit-lock.test.sh` (after Test 32) and updated the closing
+  NOTE so both read-back lanes read as covered (create by Test 32, steal/F2 by Test 32b).
+  Product untouched.
+- Verified the steering empirically: a standalone extract of Test 32b (suite header + the
+  Test 32b block, `LIB` pinned absolute) passed 6/6 with the F2-specific line
+  `the steal-path read-back-verification failure lane ran (F2)` firing — proving the fault
+  lands at `git-commit-lock.sh:1171` (`_LOCK_CLAIM_TOKEN` set there; `_lock_rename_over`
+  makes no read; the create read-back at :1353 has it empty).
+- Static: `bash -n` + `shellcheck -S style` (v0.11.0) clean.
+- Local: full unit suite **220 passed / 0 failed** (count varies run-to-run via the fan-out
+  tests; 0 failed is the invariant). Test 32b: rc 0, F2 string fired, STOLE-BY-CLAIM x2,
+  WARNING-before-ACQUIRED, no leftovers.
+- Impl review (2 independent, both clean): fresh Claude reviewer ("VERDICT: CORRECT … No
+  defects") — independently ran the suite twice (220/0), grepped every `_LOCK_CLAIM_TOKEN`
+  set/clear and `_lock_cur_token` call site, confirmed gate precision (all `_lock_discover`
+  branches clear the claim token first, so the `-n` gate excludes :820; release excluded via
+  `_lock_take_hold`), determinism, non-vacuity, termination. Codex `exec` read-only ("No
+  findings … correct and non-vacuous"), confirming the same with file:line cites. Two minor
+  non-blocking notes (the SF1 flag file lives in the throwaway WORK dir; `_ct_orig "$@"` is
+  harmless) — no action.
+- Real proof: CI under load (the hunt) with Test 32b in the tree.
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 26fe69d..b5ca5ee 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -1801,6 +1801,59 @@ grep -q "DISCOVERY-HOLD" "$LOG" && bad "FALSE discovery-HOLD on the abandoned ow
 grep -q "STOLE-BY-CLAIM" "$LOG" && ok "the abandoned lock was then reclaimed by a normal steal" \
                                 || bad "no STOLE-BY-CLAIM of the abandoned lock"
 
+echo "== Test 32b: steal-path read-back FAILED — rename-over WON but the lock did not read back our token (F2) =="
+# The steal-path twin of Test 32. Here the stealer WINS the claim race AND wins
+# the rename-over (STOLE-BY-CLAIM is logged, the ghost is destroyed), but the
+# mandatory post-rename read-back verification (git-commit-lock.sh:1171) comes
+# back wrong. The product must NOT take the hold: it clears its claim token and
+# re-enters the wait loop (git-commit-lock.sh:1176-1179) — never a silent
+# false-hold (which, after a STOLE-BY-CLAIM, would mean a mis-attributed hold of
+# a destroyed-ghost path). We fault-inject the read-back with a one-shot
+# _lock_cur_token shadow gated on the claim token being SET (the INVERSE of Test
+# 32's `-z` gate), so it lands at the STEAL read-back (claim token live, not yet
+# held), not the create one. On firing we also backdate the just-installed
+# abandoned lock stale so the re-steal is immediate (same trick as Test 32 —
+# keeps it fast and deterministic). Attempt 2 (shadow spent) reads back the real
+# token and acquires normally.
+LOCK="$WORK/stealrb.lock"; LOG="$WORK/stealrb.log"; : > "$LOG"
+fabricate_lock "$LOCK" "tok.ghost.t32b" "pid=9 host=ghost"; backdate "$LOCK" 9999
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=5 \
+  AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \
+  bash -c '
+    source "$1" || exit 70
+    clone_fn _lock_cur_token _ct_orig
+    SF1="$AGENT_LOCK_PATH.steer1"      # flag FILE: the cur_token shadow runs in subshells
+    _lock_cur_token() {
+      if [ ! -e "$SF1" ] && [ "${_LOCK_HELD:-0}" = 0 ] && [ -n "$_LOCK_CLAIM_TOKEN" ]; then
+        : > "$SF1"
+        backdate "$AGENT_LOCK_PATH" 9999 2>/dev/null || true
+        printf ""
+        return 0
+      fi
+      _ct_orig "$@"
+    }
+    lock_acquire || exit 72
+    lock_release || exit 74
+    exit 0
+  ' _ "$LIB" 2>/dev/null; rc=$?
+[ "$rc" = 0 ] && ok "steal read-back failure re-entered wait; a later steal acquired and released rc 0" \
+              || bad "steal-readback harness rc=$rc"
+grep -q "steal rename completed but read-back" "$LOG" \
+  && ok "the steal-path read-back-verification failure lane ran (F2)" \
+  || bad "F2 lane never ran (the read-back fault did not land at the steal read-back)"
+nstole="$(grep -c "STOLE-BY-CLAIM" "$LOG")"
+[ "$nstole" -ge 2 ] && ok "re-stole after the failed read-back (STOLE-BY-CLAIM x$nstole)" \
+                    || bad "expected >=2 STOLE-BY-CLAIM (won-rename then re-steal), got $nstole"
+warn_line="$(grep -n "steal rename completed but read-back" "$LOG" | head -1 | cut -d: -f1)"
+acq_line="$(grep -n "ACQUIRED " "$LOG" | tail -1 | cut -d: -f1)"
+if [ -n "$warn_line" ] && [ -n "$acq_line" ] && [ "$warn_line" -lt "$acq_line" ]; then
+  ok "no false-hold: the read-back WARNING preceded the eventual ACQUIRED"
+else
+  bad "ordering: expected the F2 WARNING (line $warn_line) before ACQUIRED (line $acq_line)"
+fi
+[ -e "$LOCK" ] && bad "lock leftover after the steal-readback walk" || ok "lock released cleanly"
+[ -e "$LOCK.next" ] && bad "claim leftover after the steal-readback walk" || ok "no claim leftover"
+
 echo "== Test 33: TERM mid-claim — the trap deletes the claim (token-checked), no 98, no ageout penalty =="
 # (a) main: claimant paused inside its claim window (at the touch), TERM'd.
 # The trap must delete OUR claim, run the discovery read (miss: the ghost is
@@ -2116,9 +2169,11 @@ rm -f "$LOCK" "$LOCK.next"
 #   blocker is most naturally a pwsh FileShare.Read holder, so the interop
 #   suite owns that test (on POSIX, unlink never blocks on open handles and
 #   the lane is unreachable).
-# * lock_acquire's read-back-verification failure lane needs fault injection
-#   to make a winning create read back wrong; it is defence in depth (see the
-#   ACQUIRE VERIFICATION header section), not suite-covered.
+# * lock_acquire's read-back-verification failure lanes (defence in depth; see
+#   the ACQUIRE VERIFICATION header section) are covered via _lock_cur_token
+#   fault injection: the create-path lane (create won, read-back wrong) by
+#   Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by
+#   Test 32b.
 
 echo
 echo "==== RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="

From c762899aa99597f343f9c39bce5eca4d7099b82f Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 03:19:24 +1000
Subject: [PATCH 13/76] AGENTS.md: record F2 coverage addition (Test 32b,
 19a28fd) and hunt restart on final tree

---
 AGENTS.md | 27 +++++++++++++++++++++------
 1 file changed, 21 insertions(+), 6 deletions(-)

diff --git a/AGENTS.md b/AGENTS.md
index 1a9f4ae..c9186db 100644
--- a/AGENTS.md
+++ b/AGENTS.md
@@ -110,13 +110,28 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow
   subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See
   `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load.
 
-## Hunt status (as of 2026-06-17 ~02:30 local)
-- Test 31(a) FIXED (51a1753) via the full formal loop; clean_count reset to 0 and the
-  `both`/load=2 hunt RESUMED toward 50 clean (the prior 16/50 streak was on pre-fix code,
-  so it does not count — we want 50 clean on the FIXED tree). Expect more flakes; each is a
-  fresh loop. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a).
+## Coverage work (not a flake — Ben asked, 2026-06-17)
+- **F2 read-back lane (commit 19a28fd):** a coverage audit (subagent + my code verification)
+  found the steal-path acquire read-back-verification failure lane uncovered — the stealer
+  WINS the claim race AND the rename-over (`STOLE-BY-CLAIM` logged, ghost destroyed) but the
+  post-rename read-back (`git-commit-lock.sh:1171`) reads the wrong token → must re-enter wait,
+  not false-hold. Its create-path twin (`:1358`) was covered by Test 32; F2 was not. Added
+  **Test 32b** (deterministic; mirrors Test 32 with the inverse `[ -n "$_LOCK_CLAIM_TOKEN" ]`
+  gate to land the fault at the steal read-back). Reviewed clean by fresh Claude + Codex;
+  suite 0-failed; F2 lane empirically exercised. Plan:
+  `.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md`. Product unchanged (F2 reads correct;
+  this was regression-exposure, not a bug). Audit also flagged LOWER-priority gaps left for
+  Ben: A2/G2 (a non-file appearing AT the lock path mid-steal — `CLAIM-ABORT (wrong-type)` /
+  `(rename-refused)`), and that feeder-#3/blocked-unlink legs are Windows+pwsh-only.
+
+## Hunt status (as of 2026-06-17 ~03:20 local)
+- The `both`/load=2 hunt reached **40/50 clean** on the post-31(a)-fix tree (810ee41) with
+  ZERO failures, then I gracefully STOPped it to fold in the Test 32b coverage addition.
+  Restarted at **0/50 on the final tree 19a28fd** (with Test 32b) — a test-only change resets
+  the streak per the "50 clean on the current tree" rule, so the contiguous-50 is measured on
+  the final suite. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a).
 - To resume after any halt: `cd .agent-testing && rm -f clean_count sentinel STOP &&
   STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). First verify no
   stray dispatcher + current HEAD (see Process hygiene).
 - THREE flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e),
-  Test 31(a) (51a1753).
+  Test 31(a) (51a1753). Plus one coverage addition: Test 32b / F2 (19a28fd).

From 9438da0c85040ce4855f515e5fc976504e356dea Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 12:16:51 +1000
Subject: [PATCH 14/76] docs: add failure-modes design map for scope decisions

A decision-support document classifying every failure mode into three
robustness tiers (correctness / best-effort-in-envelope / out-of-scope),
each grounded in product code + tests with file:line citations, with a
recommendation on whether it should be an in-scope guarantee.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md | 685 ++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 685 insertions(+)
 create mode 100644 docs/failure-modes.md

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
new file mode 100644
index 0000000..199e9da
--- /dev/null
+++ b/docs/failure-modes.md
@@ -0,0 +1,685 @@
+# git-commit-lock: failure-mode map and scope decisions
+
+**Status:** decision-support document. For each failure mode it states the
+tool's *current* behavior (grounded in the product code and tests), classifies
+it into one of three robustness tiers, and recommends whether it should be an
+in-scope guarantee. The owner uses this to deliberately decide, per mode, "yes,
+we guarantee this" or "no, out of scope."
+
+**Sources of truth, in order:** the product code
+(`git-commit-lock.sh`, `git-commit-lock.ps1`) and the test suites
+(`tests/git-commit-lock.test.sh`, `tests/git-commit-lock.interop.test.sh`,
+`tests/git-commit-lock.integration.test.sh`). Every claim below cites
+`file:line`. The narrative docs (`README.md`, `docs/git-commit-lock.md`) and
+the implementation header comments are corroborating, not authoritative — where
+this document relies on a header comment it has verified the comment against the
+code. (Cited line numbers are against the tree at commit `c762899`; treat them
+as anchors, not exact addresses, if the files move.)
+
+A note on epistemics: the bash file's header (`git-commit-lock.sh:1-426`) is
+itself an exhaustive design narrative and the ps1 header
+(`git-commit-lock.ps1:41-177`) mirrors it. They are unusually trustworthy as
+documentation *because* the tests pin the behaviors they describe. This document
+does not re-derive the protocol; it re-classifies it for a scope decision and
+flags the boundaries the headers state but a reader might skip.
+
+---
+
+## 1. The core guarantee (what must hold under ANY conditions)
+
+**Mutual exclusion + detectable failure.** At most one process at a time
+believes it holds the lock *and* is right about it. The lock cannot be silently
+lost: a holder whose lease was taken from it learns so — `lock_release` returns
+**98** and logs a loud WARNING — rather than reporting a serialized commit that
+wasn't (`git-commit-lock.sh:1607-1688`; `git-commit-lock.ps1:1700-1845`). The
+two reserved failure codes mean the wrapped command was provably *not* run
+(96 usage, 97 timeout) or provably *not serialized* (98)
+(`git-commit-lock.sh:392-415`). There is no fourth outcome in which two
+processes both believe they hold an exclusive lock and both are wrong.
+
+This is a **lease, not a kernel lock** (`docs/git-commit-lock.md:60-126`
+explains why no OS primitive spans bash-on-MINGW and PowerShell/.NET). The
+deliberate consequence: a hold longer than the staleness window (default 300s)
+*can* be stolen mid-work — "fail-open." That is accepted by design and made
+*detectable* (the 98 path), not prevented (`git-commit-lock.sh:213-227`). So the
+core guarantee is precisely: **no silent lost update.** Liveness (eventual
+recovery from any crash) and bounded stalls are best-effort within an operating
+envelope (Tier 2), not absolute.
+
+The integration suite is the end-to-end witness for this guarantee on the real
+use case: many workers committing into one repo, audited for "every commit
+lands, history linear, no sweep-up, no `index.lock` races, no stolen leases,
+clean tree" (`tests/git-commit-lock.integration.test.sh:10-12, 226-283`).
+
+### The three tiers used throughout
+
+1. **Correctness guarantee** — must hold under *any* conditions (load, slow FS,
+   adversarial scheduling): mutual exclusion, no corruption, no silent loss,
+   eventual recovery. If one of these can break, it is a bug.
+2. **Best-effort within a stated envelope** — holds under normal/expected
+   conditions, degrades gracefully (and *detectably*) under pathological ones.
+   Everything wall-clock-bounded lives here, because wall-clock bounds depend on
+   scheduling: timeouts, recovery latency, the diagnostic warnings that depend
+   on timing. Correctness is preserved; only liveness/latency degrades.
+3. **Out of scope** — explicitly not handled; the operating envelope excludes
+   it. Damage, if any, is bounded and documented.
+
+---
+
+## 2. Summary table
+
+Legend — **Tier:** 1 correctness / 2 best-effort-in-envelope / 3 out-of-scope.
+**Tested:** ✓ deterministic test · ~ load/timing-sensitive or partial · ○
+robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
+
+| # | Failure mode | Current behavior | Tier | Tested | Recommendation |
+|---|---|---|---|---|---|
+| A1 | Clean high contention (N workers, no crashes) | Serialized; no lost update | 1 | ✓ U:166-195, I:227-261/341-386, integ | **In scope.** Keep. |
+| A2 | Thundering herd recovering one dead lock | Claim serializes; exactly one steal, zero displacement | 1 | ✓ U:212-346, I:884-1015 | **In scope.** Keep. |
+| A3 | Many concurrent stealers on one ghost | One O_EXCL claim winner | 1 | ✓ U:1095-1128, I:1017-1088 | **In scope.** Keep. |
+| B1 | Holder dies (crash/SIGKILL/power) mid-hold | Lease ages out; stolen after STALE | 1 (recovery) / 2 (latency) | ✓ U:197-210/348-361 | **In scope** (recovery). Latency = Tier 2. |
+| B2 | Holder dies mid-CLAIM (trappable: INT/TERM) | Trap deletes claim, token-checked; discovery read | 1 | ✓ U:1857-1928, I:1151-1244 | **In scope.** Keep. |
+| B3 | Holder dies mid-claim (untrappable: SIGKILL) | Claim ages out ≤ CLAIM_STALE; rival rename can install unowned lock, recovered ≤ STALE | 2 | ✓ U:1648-1677 (forensics) | **Accept** (residual 5). Bounded, no false success. |
+| B4 | Slow but UNCONTENDED holder overruns STALE | Keeps its lock (nothing moved it) | 1 | ✓ U:419-429, I:494-499 | **In scope.** Keep. |
+| B5 | Slow CONTENDED holder overruns STALE | Stolen; robbed holder detects at release → 98 | 1 (detection) | ✓ U:387-417, I:460-492 | **In scope.** This *is* fail-open-but-detectable. |
+| C1 | Orphaned/stale lock | mtime-stale → stolen via claim | 1 | ✓ U:197-210 | **In scope.** Keep. |
+| C2 | Empty lock (crash between create+write) | Empty + stale → stealable | 1 | ✓ U:348-361 | **In scope.** Keep. |
+| C3 | Crashed-claimant / empty claim orphan | Ages out ≤ CLAIM_STALE; cleared | 1 (recovery) / 2 (latency) | ✓ U:1130-1154 | **In scope.** Keep. |
+| C4 | Leaked claim (unverifiable unlink) | Leaked-token memory keeps ownership discoverable | 1 | ✓ U:1549-1758, U:2013-2164 | **In scope.** Keep. |
+| D1 | Atomic rename-over (steal install) | `mv -T` / `File.Move(...,true)` / 5.1 unlink+move | 1 (local FS) | ✓ U:212-346, I:16d S:1141 | **In scope on local FS.** Boundary = D-axis. |
+| D2 | O_EXCL atomic create | `set -C` redirect / `FileMode.CreateNew` | 1 (local FS) | ✓ throughout | **In scope on local FS.** |
+| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262, ~(plat) | **In scope.** ps1-on-POSIX residual = accept. |
+| D4 | Non-lock CONTENT at path (user file) | Never stolen (content guard); warn | 1 | ✓ U:1034-1076 | **In scope.** Two accepted residuals (§D4). |
+| D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. |
+| E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
+| E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. |
+| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ (reasoned, not tested) | **Accept**, document. See §F1. |
+| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ | **Accept;** logging is best-effort by design. |
+| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ | **Accept**, document. |
+| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ | **Accept**, document. See §F4. |
+| G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. |
+| G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. |
+| G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. |
+| G4 | `MAX_WAIT ≤ STALE + CLAIM_STALE` (default MW) | Startup warning | 2 | ✓ U:497-522 | **In scope.** Keep. |
+| H1 | SIGINT/SIGTERM mid-hold | Release + re-raise (143); traps restored | 1 | ✓ U:577-600/1989-2011 | **In scope.** Keep (bash). ps1 = §H. |
+| H2 | EXIT-while-holding | Release + chain caller's EXIT trap | 1 | ✓ U:633-648 | **In scope.** Keep. |
+| H3 | ps1 process death under `-File` | `PowerShell.Exiting` does NOT fire; relies on stale window | 2 | ○ (limit documented) | **Accept;** `run` path is covered. See §H. |
+| I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
+| I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
+| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. |
+| K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. |
+| K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. |
+
+U = `tests/git-commit-lock.test.sh`, I = `tests/git-commit-lock.interop.test.sh`,
+integ = `tests/git-commit-lock.integration.test.sh`.
+
+---
+
+## 3. Per-mode detail
+
+### A. High contention / thundering herd
+
+**A1 — Clean contention, no crashes.** N processes race to acquire a free or
+held-then-released lock. The acquire loop is one O_EXCL create attempt per poll;
+exactly one creator wins, the rest poll and take turns
+(`git-commit-lock.sh:1312-1361`). After winning, the acquirer re-reads its own
+token (read-back verification, `git-commit-lock.sh:1352-1361`) before claiming
+the hold — so even a create that "won" but whose file was concurrently
+clobbered does not produce a false hold.
+*Tier 1.* Tested heavily: unit Test 1 (8 rounds × 25 workers at FULL,
+`U:166-195`), interop Test 1/Test 6 mixed bash+pwsh (`I:227-261`, the strict
+deterministic counter `I:341-386`), and the integration suite's real-commit
+swarm. **Recommend: in scope, keep.** This is the tool's whole reason to exist.
+
+**A2 — Thundering herd recovering one dead lock.** After a holder dies, *every*
+waiter judges the same lock stale off the same mtime in the same poll window —
+the worst case for displacement. The **claim protocol** is the answer: to steal,
+a waiter must first win an O_EXCL claim file `<lock>.next`, re-verify staleness
+under the claim, then install by one atomic rename-over
+(`git-commit-lock.sh:1070-1218`, the steps narrated at `:82-115`). This
+*prevents* the straggler-robs-recovery-winner race rather than detecting and
+repairing it. *Tier 1.* Tested: unit Test 2b asserts zero spurious 98s, exactly
+one `STOLE-BY-CLAIM` per round, and — via a background sampler — that **no
+move-aside `.dead.*` file ever exists** (`U:212-346`); interop Test 16 proves
+the same across mixed impls (`I:884-1015`). The header records the unserialized
+baseline was probed to displace 5/5 with 4 waiters (`git-commit-lock.sh:233-234`).
+**Recommend: in scope, keep — this is a load-bearing correctness property.**
+
+**A3 — Many concurrent stealers.** Distilled A2: N stealers, one O_EXCL claim
+winner, the rest wait and acquire in sequence. *Tier 1.* Tested: unit Test 20
+(`U:1095-1128`), interop Test 16b (one bash claimant vs one ps1 claimant on one
+ghost, cross-parsing each other's claim files, `I:1017-1088`).
+**Recommend: in scope, keep.**
+
+> **Load caveat on A2/A3 (see §K):** *correctness* is load-independent (it rests
+> on O_EXCL + atomic rename, not timing). What stretches under load is the
+> *latency* to recover, and the *test harness's* ability to set up the race
+> deterministically — Test 2b/16 carry heavy sync scaffolding and bounded
+> discard-and-retry precisely because a fast waiter can complete an entire steal
+> before the harness finishes backdating the ghost (`U:70-104, 285-336`). That
+> is a test-harness envelope concern, not a protocol gap.
+
+### B. Holder death
+
+**B1 — Crash/SIGKILL/power loss mid-hold.** The lease ages out: once the lock
+file's mtime is older than `STALE_SECS`, a waiter steals it. *Recovery is Tier
+1; recovery latency is Tier 2* (bounded by STALE + poll cadence under normal
+load). Tested via the stale-lock and empty-orphan steals (`U:197-210, 348-361`).
+**Recommend: in scope (recovery). Document the latency bound (§K).**
+
+**B2 — Trappable death mid-claim (INT/TERM).** The EXIT/INT/TERM handlers are
+armed at acquire *start*, not at hold, in "claim-window mode"
+(`git-commit-lock.sh:1299-1310, 987-997`). A trappable exit while a claim is in
+flight runs the token-checked claim deletion (one bounded retry) and a final
+discovery read; it never runs lock-release (98) semantics on a *mere claim*.
+*Tier 1.* Tested: unit Test 33 — TERM mid-claim deletes our claim, leaves a
+*foreign* claim intact, no 98, no ageout penalty (`U:1857-1928`); the matching
+ps1 lane is interop Test 16e (`I:1151-1244`). **Recommend: in scope, keep.**
+
+**B3 — Untrappable death mid-claim (SIGKILL between claim and rename).**
+Deliberately **accepted, not prevented** (residual 5,
+`git-commit-lock.sh:266-282`). The orphaned claim normally just ages out at
+CLAIM_STALE; the rare bad case is a suspended rival's rename installing it as an
+*unowned* lock that stalls waiters ≤ STALE before the lease recovers it. Crucial
+property: **no false success anywhere** — nobody believes they hold; the only
+cost is a bounded stall, same class as B1 at far lower probability. The preventing
+alternative (a two-rename compare-and-swap) was evaluated and rejected because it
+reintroduces crash litter (`git-commit-lock.sh:276-282`). *Tier 2.* Tested for
+forensics/recovery via the crashed-leaver leg of Test 31 (`U:1648-1677`).
+**Recommend: accept as a documented bounded residual. Do not build the
+two-rename CAS** — the cure is worse than the disease and the failure is already
+false-success-free.
+
+**B4 — Slow but uncontended holder.** With no waiter, nothing moves the file;
+the token still matches at release; success. *Tier 1.* Tested: unit Test 4c,
+interop Test 9 (`U:419-429`, `I:494-499`). **Recommend: in scope, keep** — this
+is what stops the lock punishing every slow-but-safe hold.
+
+**B5 — Slow CONTENDED holder (the fail-open ceiling).** A hold past STALE *with*
+a contender gets stolen; the robbed holder detects it at release (file gone, or
+a foreign token — both definitive because acquire's read-back proved our token
+was at the path) and returns exactly **98** plus a WARNING
+(`git-commit-lock.sh:1620-1688`). *Tier 1 for detection.* Tested: unit Test 4b,
+interop Test 8 both directions (`U:387-417`, `I:460-492`). **Recommend: in
+scope, keep.** This is the deliberate fail-open-but-detectable contract; the
+mitigation is operational — "commits must be fast" (the golden rule,
+`docs/git-commit-lock.md:433-458`), and raise STALE for a genuinely slow hold.
+
+### C. Orphaned / stale locks and claims
+
+**C1/C2 — Stale or empty lock.** Staleness is judged by the lock file's own
+mtime; a lock older than STALE and *lock-shaped* (empty, or line 1 starts
+`tok.`) is stealable (`git-commit-lock.sh:1408-1446`). The empty case is the
+crash-between-create-and-write orphan and is explicitly stealable. *Tier 1.*
+Tested: Test 2 (stale), Test 3 (empty orphan regression) (`U:197-210, 348-361`).
+**Recommend: in scope, keep.**
+
+**C3 — Crashed-claimant / empty-claim orphan.** A claim older than CLAIM_STALE
+(default 60s; claims are normally held for ms) is cleared by any waiter, which
+re-races the claim create (`git-commit-lock.sh:1228-1267`). A crashed claimant
+therefore delays only *steals*, only by ≤ the claim window; a free lock path is
+never blocked by a claim. *Recovery Tier 1, latency Tier 2.* Tested: Test 21
+(aged foreign claim and empty claim both age out and recovery completes,
+`U:1130-1154`). **Recommend: in scope, keep.**
+
+> **Test 21's `≤20s` latency assertion is Tier 2, not Tier 1.** `U:1144` asserts
+> wall-clock recovery `≤20s` with STALE=1, CLAIM_STALE=2, MAX_WAIT=30. The
+> *protocol* recovers correctly regardless; the 20s number is a generous
+> envelope bound that a sufficiently oversubscribed runner (e.g. 8 CPU hogs on a
+> 2-core box under the stress wrapper) can blow without any protocol defect.
+> This is exactly the kind of bound §K says to treat as a test-harness envelope:
+> if it flakes under extreme artificial load, **relax the test's bound or scope
+> the stress level — do not harden the code.**
+
+**C4 — Leaked claim.** A few exits must leave a claim behind without a verifiable
+unlink (an unreadable claim; an unlink blocked by a foreign handle — exactly
+three feeders, `git-commit-lock.sh:138-157`). These append the attempt token to
+an in-process **leaked-token memory**. While non-empty, every poll (and a pass
+at release/timeout) also reads the lock's line 1: a listed token there means a
+rival's rename installed *our* leaked claim as the lock → adopt the hold, or, at
+release, recognise our real hold was displaced, clean the leaked file
+best-effort, and report 98. The result is structural: **no process inside an
+acquire/hold/release arc can leave an *unowned* lock** (per-attempt tokens make
+the discovery read conclusive). *Tier 1.* Tested extensively: Test 31 (the four
+leaked lanes, including a real Windows no-delete-share feeder), Test 35
+(release-time cleanup of a leak installed over a held hold → 98), Test 36
+(inconclusive-read keeps the entry) (`U:1549-1758, 2013-2164`); ps1 parity in
+interop Test 16e. **Recommend: in scope, keep.** This is the most intricate
+machinery in the tool and the most thoroughly tested.
+
+### D. Filesystem semantics the protocol depends on
+
+These are the **load-bearing FS assumptions**. Where one does not hold, that is a
+real robustness boundary, not a bug to fix.
+
+**D1 — Atomic rename-over.** The steal installs by replacing the lock in one
+`rename(2)` with no path-absent window. bash uses GNU `mv -T` where available,
+probed once, with a guarded `[ -d ]` + bare-`mv` fallback on BSD/macOS
+(`git-commit-lock.sh:954-979`); pwsh 7 uses the 3-arg `File.Move(src,dst,true)`,
+**Windows PowerShell 5.1 has no such overload** and falls back to unlink-then-
+2-arg-Move (`git-commit-lock.ps1:941-982`). `File.Replace` is *deliberately
+never used* (throws on read-only dest; partial-failure states) — pinned by a
+static grep in interop Test 16d (`I:1141-1149`). **Boundary:** atomic-replace
+rename is guaranteed on local POSIX FS and NTFS (probe R1: 400 replaces, zero
+absent reads, `git-commit-lock.sh:380-382`); it is *not* guaranteed on some
+network filesystems (see §E). The 5.1 unlink+move lane has a real absent window,
+making it the one engine where a rival's create can win the recovered path —
+documented as a fairness loss, never a clobber (`docs/git-commit-lock.md:471-476`).
+*Tier 1 on local FS.* **Recommend: in scope on local FS; the network-FS boundary
+is §E.**
+
+**D2 — O_EXCL atomic create.** `set -C` noclobber redirect (bash) /
+`FileMode.CreateNew` with `FileShare.ReadWrite|Delete` (ps1,
+`git-commit-lock.ps1:650-670`). Atomic create-or-fail on local POSIX and NTFS;
+exactly one creator wins. *Tier 1 on local FS.* **Recommend: in scope on local
+FS.** Boundary: O_EXCL is the classic NFS weak spot (§E).
+
+**D3 — Wrong-type object at the lock or claim path.** A directory, symlink, FIFO,
+socket, or device at the path is **never stolen or deleted**. bash has a
+pre-create type guard (`[ -f ] && ! [ -L ]`) plus a per-poll wrong-type
+classifier with two-consecutive-poll confirmation to survive Windows
+delete-pending ghosts (`git-commit-lock.sh:1322-1327, 1518-1570`); the same
+guards apply to the claim path with independent per-path warn-once state
+(`:1458-1487`). The FIFO case is *why the pre-create guard is mandatory*: a
+noclobber `>` onto a FIFO blocks in `open(2)` before any timeout logic — a hang,
+not a warning. *Tier 1 on bash, and on ps1-on-Windows.* Tested: Test 17
+(dir/symlink/FIFO at lock path), Test 22 (claim path), Test 17d (churn must not
+false-warn) (`U:818-892, 1156-1262, 894-1032`).
+
+> **The one real D3 boundary — ps1 on POSIX (Tier 2, accepted).** The .NET API
+> exposes no portable type bit for FIFO/device/socket on Unix; they stat as size
+> 0 and take the **empty-orphan steal lane** (lock path) or empty-claim clear
+> lane (`git-commit-lock.ps1:62-78, 520-525`; `docs/git-commit-lock.md:215-222`).
+> Damage is capped at the one misconfigured inode (consumed by the rename). This
+> is an **unsupported configuration** (ps1 is Windows-only; POSIX runs it solely
+> as cross-impl protocol verification, `README.md:91-95`). **Recommend: accept,
+> as documented.** Closing it would need a `stat(2)` shell-out the port avoids;
+> not worth it for an unsupported config.
+
+**D4 — Non-lock CONTENT at the path.** An age-gated content guard steals only
+empty or `tok.`-prefixed line-1 content; a real user file at a typo'd path
+survives forever (`git-commit-lock.sh:1411-1444`). *Tier 1.* Tested: Test 18
+(user file untouched; sub-prefix torn write `to` never stolen; `tok.`-prefixed
+torn write *is* stolen) (`U:1034-1076`). **Two accepted residuals** make the
+guarantee precise (`git-commit-lock.sh:298-311`): (a) a stale **empty** user
+file is indistinguishable from the crash orphan and *is* stolen; (b) a stale
+user file whose line 1 happens to start `tok.` passes the wire test and *is*
+stolen. Both are deliberate (a fuller shape check buys near-zero protection for a
+harder-bound wire format). **Recommend: in scope, keep, with the two residuals
+documented** (already are).
+
+**D5 — Case-insensitive filesystem.** Not handled explicitly. The lock and claim
+paths differ only by the `.next` suffix (`<lock>` vs `<lock>.next`), which never
+collide under case folding, and the token content is case-exact regardless of FS
+case sensitivity. The only theoretical exposure is two *different* configured
+`AGENT_LOCK_PATH` values that differ only in case resolving to one file on
+NTFS/APFS — but that would be a single shared lock, which is *correct* behavior
+(they'd serialize), not a break. *Tier 3 (non-issue).* **Recommend: out of
+scope as a non-issue; no action.** (Cheap to add one sentence to the design doc
+if desired.)
+
+### E. Network / shared filesystems and clocks
+
+**E1 — Network/shared FS (NFS, SMB/CIFS, 9p, Dropbox/OneDrive sync).** The design
+doc states this plainly: the repo must live on a **local FS with atomic
+create/rename and sane mtimes**; "repos on network or sync-backed storage … are
+outside the design's guarantees" (`docs/git-commit-lock.md:122-126`). This is the
+honest boundary, because the protocol's *correctness* rests on D1 (atomic
+rename-over) and D2 (O_EXCL create), and both are exactly the operations network
+filesystems weaken:
+- **NFS:** `O_EXCL` create is famously unreliable on older NFS (the client can't
+  guarantee exclusive create across the network); `rename` atomicity and mtime
+  granularity vary by version/server. On such a mount, **D2 can let two creators
+  both "win"** → two live holders, and the read-back verification
+  (`:1352-1361`) is the only backstop (it would catch *some* but not all
+  interleavings).
+- **SMB/CIFS:** delete/rename semantics and the no-delete-share handle behavior
+  differ from both POSIX and local NTFS; mtime resolution and clock source may be
+  the *server's*, not the client's.
+- **Sync folders (Dropbox/OneDrive):** asynchronous replication means the lock
+  file's existence and content are *not* globally consistent — two machines can
+  both create "the" lock locally before sync reconciles. Fundamentally broken;
+  not a tunable.
+
+*Tier 3 (out of scope, stated).* Untested (CI runs local FS only). **Recommend:
+keep out of scope — but consider making it harder to *fall into* accidentally.**
+The current failure mode on a bad FS is *silent* (the tool runs, exclusion may
+just not hold). Options, in increasing cost: (i) leave as-is, documented — the
+default lock lives in `.git`, which is almost always local, so accidental
+network use is rare; (ii) a one-line caveat in `README.md` (currently only in the
+deeper design doc); (iii) an optional best-effort startup probe of the lock dir's
+FS type with a stderr warning on a known-network type (cheap on Linux via
+`stat -f`, awkward cross-platform, and inherently incomplete). **My
+recommendation: (ii) now** (surface the boundary in the README, where an operator
+actually looks), and treat (iii) as optional polish — do *not* try to *support*
+network FS.
+
+**E2 — Multi-host clock skew / NTP jumps / timezone.** *This is the one place
+the documentation is genuinely thin, and it deserves a deliberate decision.*
+Staleness is mtime-vs-`now` arithmetic (`git-commit-lock.sh:928, 1409`). The
+lock file records `host=<hostname>` (`:519`), which *suggests* cross-host use —
+but the staleness math implicitly assumes **the mtime and the comparing
+process's clock come from the same time source.** Reasoning from first
+principles about what can go wrong:
+- On a **single host** (the actual supported case — all contenders share one
+  checkout, hence one machine), mtime and `now` are the same clock; skew is a
+  non-issue, and the **mtime floor** (946684800 / 2000-01-01,
+  `git-commit-lock.sh:925`) already absorbs the only real local clock glitch:
+  the Windows FILETIME-zero (1601) transient on fresh files
+  (`docs/git-commit-lock.md:283-293`, probed at 0.04–0.5% of readings).
+- A **backward NTP step / large clock correction** on the one host could make a
+  live lock look stale (premature steal) or a stale lock look fresh (delayed
+  recovery). The first is the dangerous one — but it degrades into the *already
+  handled* B5 lane: a premature steal of a still-live hold is detected at release
+  as 98, never a silent double-commit. So even a local clock jump is
+  **correctness-safe, liveness-degraded** — Tier 2.
+- **Cross-host** use over a shared FS (already E1-out-of-scope) is where skew
+  would actually bite: host A's mtime compared against host B's `now` with
+  minutes of skew could steal live locks wholesale. But this only arises *on a
+  network FS*, which is already excluded.
+- **Timezone** is a non-factor: all arithmetic is in epoch seconds
+  (`git-commit-lock.sh:439-449`, `git-commit-lock.ps1:448-451`), never local
+  time.
+
+*Tier 3 for cross-host (rides on E1); Tier 2 for a local NTP jump.* Untested.
+**Recommend:** (a) **document explicitly** that the tool assumes a single time
+source — i.e. single-host use (the common case) or a shared FS with a single
+server clock — and that this is *why* network/multi-host is out of scope; the
+current docs imply it but never say "one clock." (b) Note the reassuring part: a
+*local* clock jump is correctness-safe (degrades to the detected-98 lane), so no
+code change is warranted. This is a **doc gap, not a code gap.**
+
+### F. Resource exhaustion
+
+**F1 — Disk full (ENOSPC) during a claim/lock create or write.** The create is
+one open+write+close in a subshell; if the write fails (ENOSPC), the subshell
+fails and the acquirer falls through to wait (`git-commit-lock.sh:1336-1361`,
+comment at `:1341-1343`). A created-but-write-failed file is an empty orphan that
+ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the
+accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud,
+fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write
+manual-fix residual).* Reasoned from code, **not tested** (no ENOSPC fault
+injection). **Recommend: accept and document.** ENOSPC is a host-health failure;
+the tool degrades safely (no corruption, no false hold) and the one sharp edge
+(sub-`tok.` torn write needing manual `rm`) is already documented. Not worth
+fault-injection tests.
+
+**F2 — ENOSPC during a LOG write.** All log writes end in `|| true`
+(`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.*
+**Recommend: accept** — logging is best-effort by explicit design (it must never
+block or fail the lock). The only downside is reduced post-mortem signal under
+disk pressure, which is acceptable.
+
+**F3 — Inode / FD exhaustion.** Same shape as F1: a create that can't get an
+inode fails → wait → eventually 97. The tool holds at most a couple of FDs
+briefly. *Tier 2.* Untested. **Recommend: accept, document as host-health.**
+
+**F4 — Read-only / unwritable lock dir or parent.** `lock_acquire` does a
+best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is
+unwritable the create fails every poll and the waiter times out at 97. No
+corruption, no false hold. A *release* unlink blocked by an unwritable parent
+routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* Untested directly.
+**Recommend: accept, document.** A correct, if blunt, outcome (97); arguably an
+*earlier, clearer* error would be nicer — optional polish, low priority.
+
+**F5 — Memory exhaustion.** The scripts allocate trivially (a few shell vars; the
+leaked-token list is "almost always empty"). Not a meaningful failure surface.
+*Tier 3 / non-issue.* **Recommend: no action.**
+
+### G. Misconfiguration
+
+**G1 — Lock path is a directory / `$HOME` / a real file.** Covered by D3/D4:
+never stolen or deleted, loud one-time warning, waiters reach 97
+(`U:818-840`). *Tier 1.* The security note (`docs/git-commit-lock.md:530-541`)
+bounds the worst case even for a *hostile* repo redirecting the git dir: the tool
+only ever creates its own small set of files at its own names and never deletes
+recursively. **Recommend: in scope, keep.**
+
+**G2 — Garbage numeric config.** Each knob is validated at source time; invalid
+values fall back to default with a stderr note (`git-commit-lock.sh:481-500`).
+The ps1 port *tightens* .NET's permissive parser to bash's grammar so the same
+env var configures the same value on both impls — e.g. rejecting `"1e3"`,
+trailing newlines, whitespace (`git-commit-lock.ps1:327-359`). *Tier 1.* Tested:
+unit Test 13, interop Test 12 (cross-impl parity, including `1e3`/`+2`/`'   '`/
+trailing-newline) (`U:695-703`, `I:554-608`). **Recommend: in scope, keep.**
+
+**G3 — `run` outside a git repo, no `AGENT_LOCK_PATH`.** Refused with 96 — a
+CWD-scoped lock would serialize against nobody (`git-commit-lock.sh:1768-1773`).
+Sourcing keeps a CWD fallback with a stderr warning and creates no files
+(`:570-572`; unit Test 14/14b). *Tier 1.* **Recommend: in scope, keep.**
+
+**G4 — `MAX_WAIT ≤ STALE + CLAIM_STALE`.** A startup warning, gated on MAX_WAIT
+being left at its default (a caller who set it chose the relationship). The
+relation is the stacked worst-case recovery: a crashed holder *plus* a crashed
+claimant (`git-commit-lock.sh:502-514`). *Tier 2 (advisory).* Tested: Test 8
+exercises the gate and the stacking (`U:497-522`). **Recommend: in scope,
+keep.**
+
+### H. Signals, interrupts, cleanup-on-exit
+
+**H1/H2 — bash INT/TERM/EXIT.** Handlers armed at acquire start; on a held lock
+they release and re-raise the signal (wrapper dies 143, what a watchdog needs);
+they restore the caller's pre-acquire traps exactly (`git-commit-lock.sh:1037-
+1054, 1002-1023, 780-784`). *Tier 1.* Tested: Test 11 (TERM mid-hold → 143,
+released), Test 12c (exit-while-holding chains the caller's EXIT trap), Test 12d/e
+(trap restoration), Test 34 (TERM on a *steal*-acquired hold behaves identically
+— all acquisition paths funnel through one hold helper) (`U:577-600, 633-693,
+1989-2011`). One documented caveat: a SIGINT delivered to the `run` wrapper alone
+while its foreground child survives is discarded by bash before any trap
+(`git-commit-lock.sh:1030-1036`) — a real Ctrl+C hits the whole group and does
+take the path. **Recommend: in scope, keep.**
+
+**H3 — ps1 process death.** PowerShell has no `trap SIGTERM`. The port substitutes
+(a) `try/finally` inside `Lock-Acquire`, which runs on Ctrl+C/pipeline-stop/
+terminating errors and does the claim-window cleanup + discovery read
+(`git-commit-lock.ps1:1378, 1672-1683, 1240-1295`); and (b) a `PowerShell.Exiting`
+engine-event backstop for a *held* lock (`:704, 1303-1324`). **Documented limit:**
+`PowerShell.Exiting` fires under `-Command` and interactively but **NOT under
+`-File`**, and not on hard kill / `[Environment]::Exit()`
+(`git-commit-lock.ps1:241-245, 1298-1302`). So a held lock abandoned by a
+forgetful dot-source `-File` caller relies on the stale window, not the backstop.
+The **`run` contract path is unaffected** — it pairs Acquire/Release in
+try/finally (`:1928-1979`). *Tier 2 (for the dot-source `-File` gap).* The happy
+path and trap-time claim cleanup are tested (interop Test 16e); the `-File`
+non-firing is documented, not test-pinned. **Recommend: accept the `-File`
+backstop gap as documented** — the stale window recovers it, and the supported
+`run`/try-finally paths are covered. If you want to close it, the documented
+option is handle-based ops (`git-commit-lock.ps1:146-151`), a larger change not
+worth it for a forgetful-caller edge.
+
+### I. Cross-implementation
+
+**I1 — Wire/format compatibility.** One on-disk format (token line 1, owner line
+2, `tok.` prefix as wire contract), one read-retry schedule (8 attempts,
+20/40/80/160/320/320/320 ms — verified byte-identical between
+`git-commit-lock.sh:670` and `git-commit-lock.ps1:597-629`), one set of release
+verdicts, one config grammar. *Tier 1.* The interop suite is built to break this:
+mixed bash+pwsh exclusion (T1/T6), each side steals the other's genuine stale
+lock (T4/T5), robbed-holder 98 both directions (T8), release-classification
+agreement (T11), cross-impl claim staleness clearing (T16c), and a Windows
+PowerShell 5.1 smoke lane (T17). **Recommend: in scope, keep — and keep the
+interop suite as the guard.** Two independent implementations hammering one lock
+is the cheap adversarial verification (`README.md:92-95`).
+
+**I2 — Mixed-version tree.** Prevention (the claim protocol) holds only when
+*all* parties run it; older releases stole with an unserialized move-aside, so a
+mixed tree degrades prevention to detection (98) and can leave `.dead.*` litter
+current versions don't clean (residual 4, `git-commit-lock.sh:261-265`). *Tier
+3.* Untested (would require shipping an old version into the suite). **Recommend:
+out of scope; keep the "upgrade both implementations together" deployment note**
+(it's in `README.md` and the design doc). Acceptable because the degraded mode is
+still *detected* (98), never silent.
+
+### J. Logging subsystem failure
+
+**J1.** Every log write is `|| true`; the log self-truncates past ~1 MB rather
+than rotating (`git-commit-lock.sh:554-562`). A broken log never blocks or fails
+the lock. Under a redirected git dir, log *content* (the owner line) is
+attacker-influenceable — one-line text spoofing, no execution; the tool itself
+writes only its token, owner line, and protocol events, never secrets
+(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept** — logging
+is best-effort by design, which is the right call for a lock that must keep
+working when the disk is full or the log path is bad. The only follow-on: don't
+build automation that *trusts* log text from an untrusted repo (already
+documented).
+
+### K. Behavior under extreme load / scheduling pressure, and internal time budgets
+
+**This is the most important analytical section** — it separates "must hold under
+any load" from "holds within an envelope," and tells the owner which apparent
+flakes are real gaps vs harness concerns.
+
+**The clean split: correctness is load-independent; liveness/latency is not.**
+
+- **Load-independent (Tier 1, must always hold):** mutual exclusion, no silent
+  lost update, no corruption, eventual recovery. These rest on O_EXCL create +
+  atomic rename + per-attempt-token discovery — *structural* properties that do
+  not reference the clock for their *correctness*. The mtime floor
+  (`:925`) and the read-retry ladder (`:668-684`) exist precisely so that the
+  one timing-sensitive input (mtime, and transient empty reads) cannot corrupt a
+  correctness decision: a sub-floor or unsettled reading is treated as "wait,"
+  never "steal." A 25-worker round can go 3s → 41s under load
+  (`agents/600-claude.md` observation) and *still* lose no update.
+
+- **Load-dependent (Tier 2, best-effort in an envelope):** every wall-clock bound.
+  - **Recovery latency** ≈ STALE (+ CLAIM_STALE if a claimant also crashed) +
+    poll cadence. Under CPU oversubscription or a slow FS, polls stretch, so
+    recovery takes longer — but still completes.
+  - **`MAX_WAIT` timeout (97):** a waiter on a genuinely squatted/blocked lock
+    gives up at MAX_WAIT. Under load the *real* time to MAX_WAIT stretches with
+    poll cadence; the guarantee is "bounded by MAX_WAIT polls," not "exactly
+    MAX_WAIT seconds." Interop Test 14b explicitly checks that a blocked steal
+    **never busy-spins past MAX_WAIT** and logs in a damped, bounded way
+    (`I:746-817`) — a real correctness-adjacent property (no busy-spin), with a
+    timing-dependent upper bound on the STALE-line count (`[1,8]`).
+  - **The read-retry ladder (~1.26s budget):** sized to ride out a sub-second
+    transient (AV scanner handle, probe-F create→write gap). Under pathological
+    load a transient *longer* than ~1.26s would surface as the unverifiable-2 /
+    run-1 verdict (a detected, non-corrupting outcome), not a wrong hold. Test
+    16c pins that a 0.4s transient is ridden out (`U:784-817`).
+
+**Internal time budgets, enumerated** (all tunable via `AGENT_LOCK_*`):
+
+| Budget | Default | Role | Load sensitivity |
+|---|---|---|---|
+| `STALE_SECS` | 300s | steal threshold (the lease length) | the fail-open ceiling; raise for slow holds |
+| `CLAIM_STALE_SECS` | 60s | crashed-claimant ageout | delays only steals |
+| `POLL_SECS` | 2s | poll interval | cadence stretches under load |
+| `MAX_WAIT` | 420s | total wait cap → 97 | real wall-clock stretches with cadence |
+| read-retry ladder | ~1.26s | ride out transient empty reads | a longer transient → detected-2, not wrong hold |
+| mtime floor | 2000-01-01 | reject FILETIME-zero | static, not load-sensitive |
+
+**Judgments on the load-sensitive behaviors — gap, degradation, or harness
+concern:**
+
+1. **Protocol correctness under load — (c) non-issue / already guaranteed.**
+   The stress branch wraps every suite in artificial CPU+disk load
+   (`tests/with-load.sh`) specifically to widen timing windows and surface
+   *latency/race flakes*, and the protocol assertions (exclusion, one-steal,
+   zero-98) are written to hold regardless. **Recommend: nothing to harden.**
+
+2. **Wall-clock test *bounds* under extreme load — (b) acceptable degradation;
+   fix the TEST, not the code.** Two examples surfaced by the prior stress
+   effort (which I verified independently against the code, not adopted):
+   - *Test 21's `≤20s` recovery-latency assertion* (`U:1144`) and
+   - *Test 22(a)'s claim-warning timing* (which needs ≥2 blocked polls before
+     MAX_WAIT to fire the two-consecutive-poll-confirmed warning, `U:1162-1168`),
+   - and *Test 29's `≥2 CLAIM lines` discriminator* (explicitly given `MAX_WAIT=6`
+     headroom, `U:1514-1518`).
+
+   Each asserts a wall-clock or poll-count bound that an oversubscribed runner
+   (e.g. 8 hogs on 2 cores) can blow *without any protocol defect* — the
+   protocol still recovers/warns correctly, just slower. **Recommend: where these
+   flake only under extreme artificial load, relax the bound or scope the stress
+   level for that test; do NOT change product code.** The correctness assertions
+   in the same tests must stay strict.
+
+3. **Test-*harness* race setup under load — (c) harness concern, already
+   mitigated.** Tests 2b/16/16b carry heavy sync scaffolding (`sync_waiting_fresh`,
+   token-guarded `backdate_ghost`, bounded discard-and-retry, `U:70-151`) because
+   a fast waiter can complete an entire steal before the harness finishes setting
+   up the race. This is purely about *constructing* the scenario deterministically;
+   the protocol is fine. **Recommend: keep the scaffolding; it is the right fix.**
+
+4. **No-busy-spin under a permanently blocked lock — (a) a real property, and
+   it's guarded.** A failed-steal lane that `continue`d past the timeout+sleep
+   would busy-spin and never reach 97 — a genuine bug class. Interop Test 14b is
+   the regression guard (`I:746-817`). **Recommend: keep that test; treat any
+   regression here as Tier 1.**
+
+**Net K recommendation:** adopt the explicit envelope — *"correctness holds under
+any load; wall-clock recovery/timeout latency scales with poll cadence and
+scheduling, bounded by the configured knobs."* Put that sentence in the design
+doc. Then audit the suite's wall-clock assertions and **scope each to the load
+level it's meant to run at** (the stress branch's extreme `both/8-hog` mode is a
+flake-hunting tool, not a contract the product must meet on a 2-core runner).
+This is the cleanest way to stop chasing "flakes" that are really the test
+asserting a Tier-1 bound on a Tier-2 quantity.
+
+---
+
+## 4. Open questions / recommended scope decisions
+
+Ordered by how much they need an explicit owner decision.
+
+1. **Define and document the load/timing envelope (§K) — highest value.**
+   *Recommendation:* state in `docs/git-commit-lock.md` that correctness
+   (exclusion, no silent loss, eventual recovery) is load-independent, while all
+   wall-clock bounds (recovery latency, MAX_WAIT, the read ladder) are
+   best-effort and scale with scheduling. Then **scope the suite's wall-clock
+   assertions to a defined load level** so extreme-stress flakes (Test 21's 20s,
+   Test 22a's warning timing, Test 29's poll count) are recognised as Tier-2
+   envelope misses, not product regressions. *This resolves the recurring
+   "flake" question structurally.* Cost: doc + a test-bound audit; no product
+   change.
+
+2. **Multi-host / clock-skew assumption is under-documented (§E2) — doc gap, not
+   code gap.** The tool implicitly assumes a single time source; a *local* NTP
+   jump is correctness-safe (degrades to the detected-98 lane), and cross-host
+   skew only bites on a network FS that's already out of scope. *Recommendation:*
+   add one explicit sentence — "assumes a single clock, i.e. single-host (the
+   common case) or a shared FS with one server clock" — and the reassurance that
+   a local clock jump cannot cause a silent double-commit. No code change.
+
+3. **Network/shared FS is out of scope but fails *silently* if entered (§E1).**
+   The boundary is correctly stated in the design doc but only there.
+   *Recommendation:* surface it in `README.md` (where operators look), since the
+   failure on a bad FS is silent loss of exclusion. Do **not** attempt to
+   *support* network FS. An optional best-effort FS-type startup probe is
+   possible but cross-platform-awkward and incomplete — treat as low-priority
+   polish, not a requirement.
+
+4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
+   (§H3) — accept as documented.** Both are real but confined to an unsupported
+   config (ps1-on-POSIX) or a forgetful-caller edge that the stale window
+   recovers. *Recommendation:* no code change; confirm they stay documented.
+   Reconsider only if PowerShell-on-POSIX ever becomes supported (it isn't,
+   `README.md:91-95`).
+
+5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
+   failure F2/J1).** These degrade safely (wait/97, or silent best-effort log
+   loss) but have **no fault-injection tests** — they are reasoned-correct, not
+   verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection
+   tests (low ROI; the degradation is structurally safe). If the owner wants one
+   belt-and-braces test, the highest-value single one is an **unwritable lock dir
+   → clean 97** (cheap to write deterministically; F4), since that's the most
+   likely real-world misconfiguration of the set.
+
+6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
+   confirm.** The first degrades to detection (98), never silent, and is covered
+   by the "upgrade both together" note. The second is a non-issue. *Recommendation:*
+   leave both out of scope; optionally one sentence each in the design doc.
+
+### Things explicitly NOT to do (the design already considered and rejected them)
+
+- **A background heartbeat** to refresh the lease — would make the tool more than
+  a single synchronous script; the fail-open-but-detectable lease is the
+  deliberate alternative (`git-commit-lock.sh:217-218`).
+- **A two-rename compare-and-swap** to prevent residual 5 (B3) — reintroduces
+  crash litter + a sweep, for a failure that is already bounded and
+  false-success-free (`git-commit-lock.sh:276-282`).
+- **`File.Replace` in the ps1 port** — pinned out by interop Test 16d for good
+  reasons (read-only-dest throw, partial-failure states).
+- **Trying to support network/shared filesystems** — the protocol's correctness
+  rests on local-FS atomic create/rename; this is a boundary to *document*, not
+  to engineer around.

From 402dc1e538ac7ca48fcbe7297050470d8069530a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 12:33:38 +1000
Subject: [PATCH 15/76] =?UTF-8?q?docs(failure-modes):=20review=20round=201?=
 =?UTF-8?q?=20=E2=80=94=20sharpen=20core=20guarantee;=20fix=20clock=20dir;?=
 =?UTF-8?q?=20add=20E3/H4?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address findings from a foreign-model (Codex) + fresh-Claude review of the
failure-modes map, each verified against the code:

- Core-guarantee precision (the doc's central thesis): the unconditional safety
  property is "no silent lost update, given cooperative wrapper unwind", NOT
  unconditional mutual exclusion. Strict mutual exclusion holds only within the
  staleness window; beyond it the lease is fail-open-but-detectable. Split Tier 1
  into safety (unconditional) vs recovery (lock-shaped orphans only, under a
  readable-clock / supported-FS envelope; foreign objects at the path are
  deliberately never auto-removed).
- Add H4: hard kill (SIGKILL) or a wrapped command's [Environment]::Exit() while
  holding bypasses release-time detection -> the explicit boundary of the
  no-silent-loss guarantee.
- Add E3: mtime probe entirely unreadable -> staleness detection disabled; fails
  SAFE (never steals a lock whose age it cannot establish), recovery lost, loudly
  announced once per process (both ports).
- Fix E2 clock-jump direction (age = now - mtime: a FORWARD jump makes a live lock
  look stale -> premature steal -> detected-98; a BACKWARD jump delays recovery).
- D1: separate the atomic-overwrite engines (mv -T / 3-arg File.Move) from the
  non-atomic Windows PowerShell 5.1 unlink-then-Move fallback (claim-guarded;
  fairness loss, never a clobber).
- Note the leaked-token memory is process-local (ties the "no unowned lock"
  framing to residual 5); correct the README-location claim for the mixed-version
  note (it is in the design doc only); minor citation fixes (README quote/line,
  Test 22a over-attribution).

Reviewers confirmed the central thesis (correctness load-independent; only latency
degrades) holds against every interleaving attacked on a local FS.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md | 203 +++++++++++++++++++++++++++++++-----------
 1 file changed, 152 insertions(+), 51 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 199e9da..d078453 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -27,24 +27,41 @@ flags the boundaries the headers state but a reader might skip.
 
 ## 1. The core guarantee (what must hold under ANY conditions)
 
-**Mutual exclusion + detectable failure.** At most one process at a time
-believes it holds the lock *and* is right about it. The lock cannot be silently
-lost: a holder whose lease was taken from it learns so — `lock_release` returns
-**98** and logs a loud WARNING — rather than reporting a serialized commit that
-wasn't (`git-commit-lock.sh:1607-1688`; `git-commit-lock.ps1:1700-1845`). The
-two reserved failure codes mean the wrapped command was provably *not* run
-(96 usage, 97 timeout) or provably *not serialized* (98)
-(`git-commit-lock.sh:392-415`). There is no fourth outcome in which two
-processes both believe they hold an exclusive lock and both are wrong.
-
-This is a **lease, not a kernel lock** (`docs/git-commit-lock.md:60-126`
-explains why no OS primitive spans bash-on-MINGW and PowerShell/.NET). The
-deliberate consequence: a hold longer than the staleness window (default 300s)
-*can* be stolen mid-work — "fail-open." That is accepted by design and made
-*detectable* (the 98 path), not prevented (`git-commit-lock.sh:213-227`). So the
-core guarantee is precisely: **no silent lost update.** Liveness (eventual
-recovery from any crash) and bounded stalls are best-effort within an operating
-envelope (Tier 2), not absolute.
+**No silent lost update — given cooperative wrapper unwind.** The absolute safety
+property is that the tool never reports a *serialized* critical section that
+wasn't: a holder whose lease was taken from it learns so — `lock_release` returns
+**98** and logs a loud WARNING — rather than exiting success
+(`git-commit-lock.sh:1607-1688`; `git-commit-lock.ps1:1717-1837`). The two
+reserved failure codes mean the wrapped command was provably *not* run (96 usage,
+97 timeout) or provably *not serialized* (98) (`git-commit-lock.sh:392-415`).
+
+Two honest qualifications make this a precise property rather than a slogan, and
+both matter for the scope decision:
+
+- **It is a lease, not a kernel lock** (`docs/git-commit-lock.md:60-126` explains
+  why no OS primitive spans bash-on-MINGW and PowerShell/.NET). **Strict mutual
+  exclusion holds only *within* the staleness window** (default 300s): a hold that
+  overruns it *can* be stolen mid-work — "fail-open" — so two processes can
+  briefly *both* believe they hold the lock. That overlap is accepted by design
+  and made *detectable* (the displaced holder's 98 at release), not prevented
+  (`git-commit-lock.sh:213-227`). At most one process is ever the *legitimate*
+  holder; a displaced believer finds out at release. So "mutual exclusion" is a
+  Tier-1 guarantee **within the envelope (commits faster than STALE)**, not an
+  unconditional one.
+- **Detection requires the wrapper to actually reach release.** The 98 path fires
+  on normal return and on trapped signals. It does **not** fire if the held
+  process is *hard-killed* (SIGKILL) or if the wrapped command terminates the
+  process abruptly — notably PowerShell `[Environment]::Exit()`, which bypasses
+  both `Lock-Release` and the `PowerShell.Exiting` backstop
+  (`git-commit-lock.ps1:221-245`). Such an abrupt exit can report success without
+  the 98 (see **§H4**). The *next* holder still recovers via staleness, but the
+  abruptly-exiting one is not warned. Hence the precise statement: **no silent
+  lost update, provided the wrapper unwinds cooperatively.**
+
+Liveness (eventual recovery) and bounded stalls are best-effort within an
+operating envelope (Tier 2), not absolute — and "recovery" means lock-shaped
+orphans get reclaimed, **not** that every bad state self-heals (a foreign object
+at the path is deliberately never auto-removed; see the tier split).
 
 The integration suite is the end-to-end witness for this guarantee on the real
 use case: many workers committing into one repo, audited for "every commit
@@ -54,8 +71,22 @@ clean tree" (`tests/git-commit-lock.integration.test.sh:10-12, 226-283`).
 ### The three tiers used throughout
 
 1. **Correctness guarantee** — must hold under *any* conditions (load, slow FS,
-   adversarial scheduling): mutual exclusion, no corruption, no silent loss,
-   eventual recovery. If one of these can break, it is a bug.
+   adversarial scheduling). Two kinds, and the distinction matters:
+   - **Safety (unconditional):** no corruption, and **no silent lost update** —
+     the displaced holder detects the loss (98) *provided its wrapper reaches
+     release* (§1's hard-kill/`Exit()` caveat). Strict **mutual exclusion holds
+     within the staleness window**; beyond it the lease is
+     fail-open-but-detectable.
+   - **Recovery (for lock-shaped stale state, under the supported FS/clock/tooling
+     envelope):** a crashed holder's stale lock, an orphaned claim, and an empty
+     crash-orphan are eventually reclaimed. This does **not** extend to *foreign*
+     objects at the path — a directory, a real user file, or non-`tok.` junk
+     content are deliberately *never* auto-removed; they wait at 97 for an
+     operator. "Eventual recovery" means lock-shaped orphans self-clear, not that
+     every bad state self-heals.
+   If a *safety* property can break, it is a bug; a *recovery* property failing
+   outside its envelope (e.g. a foreign object, an unreadable clock) is a
+   classified Tier-2/3 degradation, not a Tier-1 violation.
 2. **Best-effort within a stated envelope** — holds under normal/expected
    conditions, degrades gracefully (and *detectably*) under pathological ones.
    Everything wall-clock-bounded lives here, because wall-clock bounds depend on
@@ -93,6 +124,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. |
 | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
 | E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. |
+| E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ○ | **Accept** — fails safe + announced. See §E3. |
 | F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ (reasoned, not tested) | **Accept**, document. See §F1. |
 | F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ | **Accept;** logging is best-effort by design. |
 | F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ | **Accept**, document. |
@@ -104,6 +136,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | H1 | SIGINT/SIGTERM mid-hold | Release + re-raise (143); traps restored | 1 | ✓ U:577-600/1989-2011 | **In scope.** Keep (bash). ps1 = §H. |
 | H2 | EXIT-while-holding | Release + chain caller's EXIT trap | 1 | ✓ U:633-648 | **In scope.** Keep. |
 | H3 | ps1 process death under `-File` | `PowerShell.Exiting` does NOT fire; relies on stale window | 2 | ○ (limit documented) | **Accept;** `run` path is covered. See §H. |
+| H4 | Hard kill / `[Environment]::Exit()` while held | Bypasses release → a displaced holder is unwarned (no 98) | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. |
 | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
 | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
 | J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. |
@@ -240,7 +273,14 @@ rival's rename installed *our* leaked claim as the lock → adopt the hold, or,
 release, recognise our real hold was displaced, clean the leaked file
 best-effort, and report 98. The result is structural: **no process inside an
 acquire/hold/release arc can leave an *unowned* lock** (per-attempt tokens make
-the discovery read conclusive). *Tier 1.* Tested extensively: Test 31 (the four
+the discovery read conclusive). One scope nuance worth stating, because the
+memory is **process-local**: only the leaking process can *adopt* its own
+installed claim. If that process exits the arc first — times out (97), releases
+cleanly, or dies — *before* adopting, the installed claim becomes an unowned lock
+recovered by the ordinary staleness lane, never adopted by another process (this
+is exactly residual 5 / §B3). Per-attempt-token uniqueness still guarantees that
+lock can never be *mistaken* for owned by anyone, so there is **no false
+success** — the only cost is a bounded stall. *Tier 1.* Tested extensively: Test 31 (the four
 leaked lanes, including a real Windows no-delete-share feeder), Test 35
 (release-time cleanup of a leak installed over a held hold → 98), Test 36
 (inconclusive-read keeps the entry) (`U:1549-1758, 2013-2164`); ps1 parity in
@@ -252,21 +292,27 @@ machinery in the tool and the most thoroughly tested.
 These are the **load-bearing FS assumptions**. Where one does not hold, that is a
 real robustness boundary, not a bug to fix.
 
-**D1 — Atomic rename-over.** The steal installs by replacing the lock in one
-`rename(2)` with no path-absent window. bash uses GNU `mv -T` where available,
-probed once, with a guarded `[ -d ]` + bare-`mv` fallback on BSD/macOS
-(`git-commit-lock.sh:954-979`); pwsh 7 uses the 3-arg `File.Move(src,dst,true)`,
-**Windows PowerShell 5.1 has no such overload** and falls back to unlink-then-
-2-arg-Move (`git-commit-lock.ps1:941-982`). `File.Replace` is *deliberately
-never used* (throws on read-only dest; partial-failure states) — pinned by a
-static grep in interop Test 16d (`I:1141-1149`). **Boundary:** atomic-replace
-rename is guaranteed on local POSIX FS and NTFS (probe R1: 400 replaces, zero
-absent reads, `git-commit-lock.sh:380-382`); it is *not* guaranteed on some
-network filesystems (see §E). The 5.1 unlink+move lane has a real absent window,
-making it the one engine where a rival's create can win the recovered path —
-documented as a fairness loss, never a clobber (`docs/git-commit-lock.md:471-476`).
-*Tier 1 on local FS.* **Recommend: in scope on local FS; the network-FS boundary
-is §E.**
+**D1 — Steal install: atomic overwrite vs. the 5.1 fallback.** The steal installs
+its lock at the path by replacing whatever is there. There are two engine classes
+and they differ in atomicity — so this row is *not* uniformly "atomic rename":
+- **Atomic overwrite (the guaranteed lane):** one `rename(2)`-class replace with
+  no path-absent window. bash uses GNU `mv -T` where available, probed once, with
+  a guarded `[ -d ]` + bare-`mv` fallback on BSD/macOS
+  (`git-commit-lock.sh:954-979`); pwsh 7 uses the 3-arg `File.Move(src,dst,true)`
+  (`git-commit-lock.ps1:941-982`). Atomic replace is guaranteed on local POSIX FS
+  and NTFS (probe R1: 400 replaces, zero absent reads,
+  `git-commit-lock.sh:380-382`); *not* guaranteed on some network FS (§E).
+- **Windows PowerShell 5.1 fallback (NOT atomic, but claim-guarded):** 5.1 has no
+  3-arg overload, so it unlinks then does a 2-arg `Move` (`git-commit-lock.ps1:941-982`).
+  This lane has a real path-absent window in which a rival's *create* can win the
+  recovered path — a **fairness loss, never a clobber** (claim serialization still
+  admits one stealer; the loser re-polls), documented at
+  `docs/git-commit-lock.md:471-476`.
+`File.Replace` is *deliberately never used* (throws on read-only dest;
+partial-failure states) — pinned by a static grep in interop Test 16d
+(`I:1141-1149`). *The atomic lane is Tier 1 on local FS; the 5.1 fallback is Tier
+1 for safety (no clobber) but gives up rename atomicity (fairness only).*
+**Recommend: in scope on local FS; the network-FS boundary is §E.**
 
 **D2 — O_EXCL atomic create.** `set -C` noclobber redirect (bash) /
 `FileMode.CreateNew` with `FileShare.ReadWrite|Delete` (ps1,
@@ -367,12 +413,15 @@ principles about what can go wrong:
   `git-commit-lock.sh:925`) already absorbs the only real local clock glitch:
   the Windows FILETIME-zero (1601) transient on fresh files
   (`docs/git-commit-lock.md:283-293`, probed at 0.04–0.5% of readings).
-- A **backward NTP step / large clock correction** on the one host could make a
-  live lock look stale (premature steal) or a stale lock look fresh (delayed
-  recovery). The first is the dangerous one — but it degrades into the *already
-  handled* B5 lane: a premature steal of a still-live hold is detected at release
-  as 98, never a silent double-commit. So even a local clock jump is
-  **correctness-safe, liveness-degraded** — Tier 2.
+- A **large local clock correction** on the one host splits by sign, because
+  staleness is `age = now - mtime` (`git-commit-lock.sh:928, 1409`): a **forward**
+  jump (now leaps ahead) inflates the computed age, so a *live* lock can look
+  stale → premature steal; a **backward** jump (NTP steps back) shrinks the age,
+  so a genuinely *stale* lock can look fresh → delayed recovery. The
+  forward/premature-steal case is the only worrying one — and it degrades into the
+  *already handled* B5 lane: a premature steal of a still-live hold is detected at
+  release as 98 (given cooperative unwind), never a silent double-commit. So even
+  a local clock jump is **correctness-safe, liveness-degraded** — Tier 2.
 - **Cross-host** use over a shared FS (already E1-out-of-scope) is where skew
   would actually bite: host A's mtime compared against host B's `now` with
   minutes of skew could steal live locks wholesale. But this only arises *on a
@@ -389,6 +438,25 @@ current docs imply it but never say "one clock." (b) Note the reassuring part: a
 *local* clock jump is correctness-safe (degrades to the detected-98 lane), so no
 code change is warranted. This is a **doc gap, not a code gap.**
 
+**E3 — mtime probe fails entirely (the staleness clock is unreadable).** Distinct
+from a *wrong* clock (E2): here the lock file's mtime cannot be read at all. Both
+ports retry three times on a *present* file, then warn loudly once per process —
+bash via `stat -c %Y` / `stat -f %m` / `date -r` (`git-commit-lock.sh:629-645`),
+pwsh via `Get-Item.LastWriteTimeUtc` (`git-commit-lock.ps1:531-560`): *"Staleness
+detection is BROKEN: stale locks will never be stolen, so a crashed holder wedges
+waiters until MAX_WAIT."* The stale check then treats an unreadable mtime as **not
+stale** — the floor guard `[ "$mt" -gt 946684800 ]` fails closed to "fresh"
+(`git-commit-lock.sh:925-927`). **Safety is preserved**: the tool never steals a
+lock whose age it cannot establish, so no premature steal and no corruption — but
+**recovery of a genuinely crashed holder is disabled**, and waiters block to
+MAX_WAIT (97). *Tier 2 (safety held, recovery lost — and loudly announced).*
+Untested (no stat-failure injection). **Recommend: accept and document** — it is a
+host/FS-health failure the tool already detects and announces, and it fails *safe*
+(no false steal). Fault injection is low-ROI; the loud warning is the right
+behavior. This is also the clean reason recovery is a *Tier-1-within-envelope*
+property, not unconditional (see the tier split under §1): it presumes a readable
+clock.
+
 ### F. Resource exhaustion
 
 **F1 — Disk full (ENOSPC) during a claim/lock create or write.** The create is
@@ -487,6 +555,30 @@ backstop gap as documented** — the stale window recovers it, and the supported
 option is handle-based ops (`git-commit-lock.ps1:146-151`), a larger change not
 worth it for a forgetful-caller edge.
 
+**H4 — Hard process termination / `[Environment]::Exit()` while holding (the
+no-silent-loss boundary).** §1's safety guarantee — a displaced holder reports 98
+rather than a false success — relies on the wrapper *reaching its release path*.
+Two ways that doesn't happen while a lease is held: (a) the held process is
+SIGKILL'd (untrappable; no handler runs in either port); (b) the wrapped command
+itself ends the process abruptly, the sharpest case being PowerShell
+`[Environment]::Exit(n)`, which bypasses `Lock-Release`, the `finally`, *and* the
+`PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-245`). If such a process
+was *already displaced* (its lease stolen past STALE) and exits **0**, its caller
+sees success with no 98 — the one interleaving that defeats "no silent lost
+update." Two bounds keep it narrow: a SIGKILL yields a non-zero wait status, so a
+caller that checks exit codes does *not* see success; and the `run` contract pairs
+acquire/release in `try/finally`, so only a command that *itself* hard-exits the
+process (or an external SIGKILL) skips release — a normal-returning or
+signal-trapped command always reaches it. The *next* holder still recovers via
+staleness; only the abruptly-exiting one is unwarned. *Tier 2 — the residual edge
+of the fail-open lease.* Exercised indirectly: interop Test 5 *uses*
+`[Environment]::Exit()` to fabricate a no-release orphan, confirming the bypass
+(`I:308-334`). **Recommend: document this as the explicit boundary of the
+no-silent-loss guarantee**, alongside the "commits must be fast" golden rule — a
+command that hard-exits mid-critical-section *after being displaced* is exactly
+the fail-open case the STALE budget exists to make rare. No code change closes it
+without the handle-based ops the design rejected (§H3).
+
 ### I. Cross-implementation
 
 **I1 — Wire/format compatibility.** One on-disk format (token line 1, owner line
@@ -499,7 +591,7 @@ lock (T4/T5), robbed-holder 98 both directions (T8), release-classification
 agreement (T11), cross-impl claim staleness clearing (T16c), and a Windows
 PowerShell 5.1 smoke lane (T17). **Recommend: in scope, keep — and keep the
 interop suite as the guard.** Two independent implementations hammering one lock
-is the cheap adversarial verification (`README.md:92-95`).
+is "cheap adversarial verification of the protocol" (`README.md:94`).
 
 **I2 — Mixed-version tree.** Prevention (the claim protocol) holds only when
 *all* parties run it; older releases stole with an unserialized move-aside, so a
@@ -507,8 +599,9 @@ mixed tree degrades prevention to detection (98) and can leave `.dead.*` litter
 current versions don't clean (residual 4, `git-commit-lock.sh:261-265`). *Tier
 3.* Untested (would require shipping an old version into the suite). **Recommend:
 out of scope; keep the "upgrade both implementations together" deployment note**
-(it's in `README.md` and the design doc). Acceptable because the degraded mode is
-still *detected* (98), never silent.
+— currently in the design doc only (`docs/git-commit-lock.md:251-255`), **not** in
+`README.md`; surface it there too, where operators actually look. Acceptable
+because the degraded mode is still *detected* (98), never silent.
 
 ### J. Logging subsystem failure
 
@@ -531,10 +624,15 @@ flakes are real gaps vs harness concerns.
 
 **The clean split: correctness is load-independent; liveness/latency is not.**
 
-- **Load-independent (Tier 1, must always hold):** mutual exclusion, no silent
-  lost update, no corruption, eventual recovery. These rest on O_EXCL create +
-  atomic rename + per-attempt-token discovery — *structural* properties that do
-  not reference the clock for their *correctness*. The mtime floor
+- **Load-independent (Tier 1 *safety*, must always hold):** no silent lost update
+  (given cooperative unwind, §1/§H4), no corruption, and strict mutual exclusion
+  *within the staleness window*. These rest on O_EXCL create + atomic rename +
+  per-attempt-token discovery — *structural* properties that do not reference the
+  clock for their *correctness*. (Recovery of lock-shaped orphans is also
+  load-independent in *correctness* — only its latency degrades — but it presumes
+  a readable clock, §E3, and does not extend to foreign objects, per the tier
+  split under §1.) The mtime
+  floor
   (`:925`) and the read-retry ladder (`:668-684`) exist precisely so that the
   one timing-sensitive input (mtime, and transient empty reads) cannot corrupt a
   correctness decision: a sub-floor or unsettled reading is treated as "wait,"
@@ -582,8 +680,11 @@ concern:**
    fix the TEST, not the code.** Two examples surfaced by the prior stress
    effort (which I verified independently against the code, not adopted):
    - *Test 21's `≤20s` recovery-latency assertion* (`U:1144`) and
-   - *Test 22(a)'s claim-warning timing* (which needs ≥2 blocked polls before
-     MAX_WAIT to fire the two-consecutive-poll-confirmed warning, `U:1162-1168`),
+   - *Test 22(a)'s claim-path warning* — the warning relies on the
+     two-consecutive-poll confirmation (the mechanism Test 17d pins for the lock
+     path) having poll *headroom* before MAX_WAIT, which an oversubscribed runner
+     can starve (`U:1156-1172`); the test asserts the warning fires, not a specific
+     poll count,
    - and *Test 29's `≥2 CLAIM lines` discriminator* (explicitly given `MAX_WAIT=6`
      headroom, `U:1514-1518`).
 

From 57b141854a0d2c180bf5653dce75c3ec25a7500a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 12:40:47 +1000
Subject: [PATCH 16/76] =?UTF-8?q?docs(failure-modes):=20review=20round=202?=
 =?UTF-8?q?=20=E2=80=94=20generalize=20the=20no-silent-loss=20bypass=20cla?=
 =?UTF-8?q?ss?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address the round-2 reviews (Codex + fresh Claude), verified against the code:

- Codex (blocking): §H4's bypass list was incomplete for bash. lock_run runs the
  wrapped command in the wrapper shell itself (git-commit-lock.sh:1733), so a
  wrapped `exec` replaces that shell and skips BOTH lock_release and the EXIT trap
  — the same silent-loss boundary as the pwsh [Environment]::Exit() case.
  Generalize §H4 and the §1 bullet from an enumeration to the class
  "termination/replacement without wrapper unwind" (external SIGKILL / bash exec /
  [Environment]::Exit()), and add the contrast that a plain `exit` is safe
  (it unwinds: bash EXIT trap, pwsh finally).
- Claude (nit): §H4 had attributed try/finally to the bash run path; corrected to
  bash EXIT trap vs pwsh try/finally.

Both rounds confirmed the central thesis holds (no two believed-legitimate holders;
no UNdetected lost update on a local FS within the envelope) and that round 1's
revisions 1-7 are factually correct and internally consistent.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md | 72 ++++++++++++++++++++++++-------------------
 1 file changed, 41 insertions(+), 31 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index d078453..5bb0f4f 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -49,14 +49,16 @@ both matter for the scope decision:
   Tier-1 guarantee **within the envelope (commits faster than STALE)**, not an
   unconditional one.
 - **Detection requires the wrapper to actually reach release.** The 98 path fires
-  on normal return and on trapped signals. It does **not** fire if the held
-  process is *hard-killed* (SIGKILL) or if the wrapped command terminates the
-  process abruptly — notably PowerShell `[Environment]::Exit()`, which bypasses
-  both `Lock-Release` and the `PowerShell.Exiting` backstop
-  (`git-commit-lock.ps1:221-245`). Such an abrupt exit can report success without
-  the 98 (see **§H4**). The *next* holder still recovers via staleness, but the
-  abruptly-exiting one is not warned. Hence the precise statement: **no silent
-  lost update, provided the wrapper unwinds cooperatively.**
+  on normal return and on trapped signals. It does **not** fire if the held process
+  is terminated or *replaced* without unwinding — an external SIGKILL, a bash
+  `exec` in the wrapped command (which replaces the holding shell, so neither
+  `lock_release` nor the EXIT trap runs), or PowerShell `[Environment]::Exit()`
+  (bypasses `Lock-Release`, the `finally`, and the `PowerShell.Exiting` backstop,
+  `git-commit-lock.ps1:221-245`). A *plain* `exit` is safe — it unwinds. A
+  non-unwinding exit returning 0 *while displaced* can report success without the
+  98 (see **§H4**). The *next* holder still recovers via staleness, but the
+  abruptly-exiting one is not warned. Hence the precise statement: **no silent lost
+  update, provided the wrapper unwinds cooperatively.**
 
 Liveness (eventual recovery) and bounded stalls are best-effort within an
 operating envelope (Tier 2), not absolute — and "recovery" means lock-shaped
@@ -136,7 +138,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | H1 | SIGINT/SIGTERM mid-hold | Release + re-raise (143); traps restored | 1 | ✓ U:577-600/1989-2011 | **In scope.** Keep (bash). ps1 = §H. |
 | H2 | EXIT-while-holding | Release + chain caller's EXIT trap | 1 | ✓ U:633-648 | **In scope.** Keep. |
 | H3 | ps1 process death under `-File` | `PowerShell.Exiting` does NOT fire; relies on stale window | 2 | ○ (limit documented) | **Accept;** `run` path is covered. See §H. |
-| H4 | Hard kill / `[Environment]::Exit()` while held | Bypasses release → a displaced holder is unwarned (no 98) | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. |
+| H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. |
 | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
 | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
 | J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. |
@@ -555,29 +557,37 @@ backstop gap as documented** — the stale window recovers it, and the supported
 option is handle-based ops (`git-commit-lock.ps1:146-151`), a larger change not
 worth it for a forgetful-caller edge.
 
-**H4 — Hard process termination / `[Environment]::Exit()` while holding (the
-no-silent-loss boundary).** §1's safety guarantee — a displaced holder reports 98
-rather than a false success — relies on the wrapper *reaching its release path*.
-Two ways that doesn't happen while a lease is held: (a) the held process is
-SIGKILL'd (untrappable; no handler runs in either port); (b) the wrapped command
-itself ends the process abruptly, the sharpest case being PowerShell
-`[Environment]::Exit(n)`, which bypasses `Lock-Release`, the `finally`, *and* the
-`PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-245`). If such a process
-was *already displaced* (its lease stolen past STALE) and exits **0**, its caller
-sees success with no 98 — the one interleaving that defeats "no silent lost
-update." Two bounds keep it narrow: a SIGKILL yields a non-zero wait status, so a
-caller that checks exit codes does *not* see success; and the `run` contract pairs
-acquire/release in `try/finally`, so only a command that *itself* hard-exits the
-process (or an external SIGKILL) skips release — a normal-returning or
-signal-trapped command always reaches it. The *next* holder still recovers via
-staleness; only the abruptly-exiting one is unwarned. *Tier 2 — the residual edge
-of the fail-open lease.* Exercised indirectly: interop Test 5 *uses*
-`[Environment]::Exit()` to fabricate a no-release orphan, confirming the bypass
-(`I:308-334`). **Recommend: document this as the explicit boundary of the
+**H4 — Process termination/replacement *without wrapper unwind* (the no-silent-loss
+boundary).** §1's safety guarantee — a displaced holder reports 98 rather than a
+false success — relies on the wrapper *reaching its release path*. The bypass class
+is any termination or replacement of the holding process that skips that unwind;
+crucially it is **not** triggered by a normal `exit`. The instances:
+- **External SIGKILL** — untrappable; no handler runs in either port.
+- **bash `exec` in the wrapped command** — `run` executes `"$@"` *in the wrapper
+  shell itself* (`git-commit-lock.sh:1733`), so an `exec` replaces that shell's
+  process image and *neither* the trailing `lock_release` *nor* the `EXIT` trap
+  (`git-commit-lock.sh:1002-1013`, armed at `:1308`) runs.
+- **PowerShell `[Environment]::Exit(n)`** — a CLR hard-exit that bypasses
+  `Lock-Release`, the `finally`, *and* the `PowerShell.Exiting` backstop
+  (`git-commit-lock.ps1:221-245`).
+
+The useful contrast: a **plain `exit` is safe** — bash `exit` fires the EXIT trap
+(which releases), and a plain `exit` inside the pwsh `run` body unwinds its
+`finally` (`git-commit-lock.ps1:1928-1979`). Only *non-unwinding* termination or
+replacement escapes. If such a process was *already displaced* (its lease stolen
+past STALE) and exits **0**, its caller sees success with no 98 — the one
+interleaving that defeats "no silent lost update." What keeps it narrow: an external
+SIGKILL yields a non-zero wait status (`128+9`), so a caller checking exit codes does
+*not* see success; the leak needs a command that *deliberately* replaces or
+hard-exits the process **and** returns 0 **while displaced**. The *next* holder
+still recovers via staleness; only the abruptly-exiting one is unwarned. *Tier 2 —
+the residual edge of the fail-open lease.* Exercised indirectly: interop Test 5
+*uses* `[Environment]::Exit()` to fabricate a no-release orphan, confirming the
+bypass (`I:308-334`). **Recommend: document this as the explicit boundary of the
 no-silent-loss guarantee**, alongside the "commits must be fast" golden rule — a
-command that hard-exits mid-critical-section *after being displaced* is exactly
-the fail-open case the STALE budget exists to make rare. No code change closes it
-without the handle-based ops the design rejected (§H3).
+command that replaces/hard-exits the process mid-critical-section *after being
+displaced* is exactly the fail-open case the STALE budget exists to make rare. No
+code change closes it without the handle-based ops the design rejected (§H3).
 
 ### I. Cross-implementation
 

From 534a0073dc5f799575e24b8218686c4e00c6eb8f Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 17:39:45 +1000
Subject: [PATCH 17/76] ben comments: docs/failure-modes.md

---
 docs/failure-modes.md | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 5bb0f4f..61a4dce 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -733,6 +733,8 @@ asserting a Tier-1 bound on a Tier-2 quantity.
 
 Ordered by how much they need an explicit owner decision.
 
+agree except where indicated
+
 1. **Define and document the load/timing envelope (§K) — highest value.**
    *Recommendation:* state in `docs/git-commit-lock.md` that correctness
    (exclusion, no silent loss, eventual recovery) is load-independent, while all
@@ -760,14 +762,16 @@ Ordered by how much they need an explicit owner decision.
    possible but cross-platform-awkward and incomplete — treat as low-priority
    polish, not a requirement.
 
-4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
+Don't do the polish, just document.
+
+1. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
    (§H3) — accept as documented.** Both are real but confined to an unsupported
    config (ps1-on-POSIX) or a forgetful-caller edge that the stale window
    recovers. *Recommendation:* no code change; confirm they stay documented.
    Reconsider only if PowerShell-on-POSIX ever becomes supported (it isn't,
    `README.md:91-95`).
 
-5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
+2. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
    failure F2/J1).** These degrade safely (wait/97, or silent best-effort log
    loss) but have **no fault-injection tests** — they are reasoned-correct, not
    verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection
@@ -776,7 +780,9 @@ Ordered by how much they need an explicit owner decision.
    → clean 97** (cheap to write deterministically; F4), since that's the most
    likely real-world misconfiguration of the set.
 
-6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
+i'd add test coverage for the various scenarios. It just makes the project easier to maintain and for future users to use if the these sorts of edge cases are actually tested rather than reasoned correct but untested.
+
+1. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
    confirm.** The first degrades to detection (98), never silent, and is covered
    by the "upgrade both together" note. The second is a non-issue. *Recommendation:*
    leave both out of scope; optionally one sentence each in the design doc.

From 959cca90e839af37b0b96f1cd2edd9413678f5fe Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 17:39:45 +1000
Subject: [PATCH 18/76] Revert "ben comments: docs/failure-modes.md"

This reverts commit 534a0073dc5f799575e24b8218686c4e00c6eb8f.
---
 docs/failure-modes.md | 12 +++---------
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 61a4dce..5bb0f4f 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -733,8 +733,6 @@ asserting a Tier-1 bound on a Tier-2 quantity.
 
 Ordered by how much they need an explicit owner decision.
 
-agree except where indicated
-
 1. **Define and document the load/timing envelope (§K) — highest value.**
    *Recommendation:* state in `docs/git-commit-lock.md` that correctness
    (exclusion, no silent loss, eventual recovery) is load-independent, while all
@@ -762,16 +760,14 @@ agree except where indicated
    possible but cross-platform-awkward and incomplete — treat as low-priority
    polish, not a requirement.
 
-Don't do the polish, just document.
-
-1. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
+4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
    (§H3) — accept as documented.** Both are real but confined to an unsupported
    config (ps1-on-POSIX) or a forgetful-caller edge that the stale window
    recovers. *Recommendation:* no code change; confirm they stay documented.
    Reconsider only if PowerShell-on-POSIX ever becomes supported (it isn't,
    `README.md:91-95`).
 
-2. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
+5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
    failure F2/J1).** These degrade safely (wait/97, or silent best-effort log
    loss) but have **no fault-injection tests** — they are reasoned-correct, not
    verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection
@@ -780,9 +776,7 @@ Don't do the polish, just document.
    → clean 97** (cheap to write deterministically; F4), since that's the most
    likely real-world misconfiguration of the set.
 
-i'd add test coverage for the various scenarios. It just makes the project easier to maintain and for future users to use if the these sorts of edge cases are actually tested rather than reasoned correct but untested.
-
-1. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
+6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
    confirm.** The first degrades to detection (98), never silent, and is covered
    by the "upgrade both together" note. The second is a non-issue. *Recommendation:*
    leave both out of scope; optionally one sentence each in the design doc.

From a5df9d9ae2d2f72199b27f320bc2a09f10d36288 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 17:45:36 +1000
Subject: [PATCH 19/76] c converged: docs/failure-modes.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Applied Ben's review comments (frozen in 534a0073) to the failure-modes
scope-decisions doc:
- §4 status: the recommendations are reviewed and accepted, except two overrides.
- Network FS (§4.3 / §E1): document-only — surface the boundary in README, do NOT
  build the FS-type startup probe ("don't do the polish, just document").
- Untested-but-robust lanes (§4.5 / F1-F4, J1): OVERRIDE the prior "accept untested"
  -> add test coverage. Rationale (Ben): actually-tested edge cases make the project
  easier to maintain and give future users confidence vs reasoned-correct-but-untested.
  Propagated to the per-mode F1/F2/F3/F4/J1 entries and their summary-table rows.
Disposition check passed (fresh verifier): every comment dispositioned, propagation
consistent, no leaked comment text, §4 numbering coherent.

comment-commit: 534a0073dc5f799575e24b8218686c4e00c6eb8f

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md | 79 +++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 30 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 5bb0f4f..0332055 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -127,10 +127,10 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
 | E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. |
 | E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ○ | **Accept** — fails safe + announced. See §E3. |
-| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ (reasoned, not tested) | **Accept**, document. See §F1. |
-| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ | **Accept;** logging is best-effort by design. |
-| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ | **Accept**, document. |
-| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ | **Accept**, document. See §F4. |
+| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ → test planned | **Add test** (§4.5) + document. See §F1. |
+| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ → test planned | **Add test** (§4.5); logging best-effort, lock unaffected. |
+| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, FD via `ulimit`), document. |
+| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, highest-value). See §F4. |
 | G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. |
 | G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. |
 | G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. |
@@ -141,7 +141,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. |
 | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
 | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
-| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. |
+| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ → test planned | **Add test** (§4.5, via F2); logging never blocks the lock. |
 | K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. |
 | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. |
 
@@ -469,28 +469,35 @@ ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the
 accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud,
 fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write
 manual-fix residual).* Reasoned from code, **not tested** (no ENOSPC fault
-injection). **Recommend: accept and document.** ENOSPC is a host-health failure;
-the tool degrades safely (no corruption, no false hold) and the one sharp edge
-(sub-`tok.` torn write needing manual `rm`) is already documented. Not worth
-fault-injection tests.
+injection). **Recommend: document + add a fault-injection test (per §4.5).** ENOSPC
+is a host-health failure; the tool degrades safely (no corruption, no false hold)
+and the one sharp edge (sub-`tok.` torn write needing manual `rm`) is already
+documented. Per Ben's §4.5 decision, add an ENOSPC test where it can be injected
+deterministically and portably (e.g. a small dedicated tmpfs/quota); if portable
+injection proves impractical, say so in the plan rather than shipping a flaky test.
 
 **F2 — ENOSPC during a LOG write.** All log writes end in `|| true`
 (`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.*
-**Recommend: accept** — logging is best-effort by explicit design (it must never
-block or fail the lock). The only downside is reduced post-mortem signal under
-disk pressure, which is acceptable.
+**Recommend: accept + add a test (per §4.5)** — logging is best-effort by explicit
+design (it must never block or fail the lock); the only downside is reduced
+post-mortem signal under disk pressure. Add a test that an unwritable/failing log
+path leaves the lock fully working (the write is swallowed) — this also covers J1.
 
 **F3 — Inode / FD exhaustion.** Same shape as F1: a create that can't get an
 inode fails → wait → eventually 97. The tool holds at most a couple of FDs
-briefly. *Tier 2.* Untested. **Recommend: accept, document as host-health.**
+briefly. *Tier 2.* Untested. **Recommend: document + add a test (per §4.5)** as
+host-health — an FD-exhaustion test via `ulimit -n` is the deterministic, portable
+one; add inode exhaustion only if it can be injected cleanly.
 
 **F4 — Read-only / unwritable lock dir or parent.** `lock_acquire` does a
 best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is
 unwritable the create fails every poll and the waiter times out at 97. No
 corruption, no false hold. A *release* unlink blocked by an unwritable parent
 routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* Untested directly.
-**Recommend: accept, document.** A correct, if blunt, outcome (97); arguably an
-*earlier, clearer* error would be nicer — optional polish, low priority.
+**Recommend: add a test (per §4.5 — the highest-value one).** An unwritable lock
+dir → clean 97 is cheap and deterministic to write. A correct, if blunt, outcome
+(97); an *earlier, clearer* error would be nicer but is optional polish, low
+priority.
 
 **F5 — Memory exhaustion.** The scripts allocate trivially (a few shell vars; the
 leaked-token list is "almost always empty"). Not a meaningful failure surface.
@@ -620,11 +627,11 @@ than rotating (`git-commit-lock.sh:554-562`). A broken log never blocks or fails
 the lock. Under a redirected git dir, log *content* (the owner line) is
 attacker-influenceable — one-line text spoofing, no execution; the tool itself
 writes only its token, owner line, and protocol events, never secrets
-(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept** — logging
-is best-effort by design, which is the right call for a lock that must keep
-working when the disk is full or the log path is bad. The only follow-on: don't
-build automation that *trusts* log text from an untrusted repo (already
-documented).
+(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept + covered by the
+F2 log-failure test (per §4.5)** — logging is best-effort by design, which is the
+right call for a lock that must keep working when the disk is full or the log path
+is bad. The follow-on (unchanged): don't build automation that *trusts* log text
+from an untrusted repo (already documented).
 
 ### K. Behavior under extreme load / scheduling pressure, and internal time budgets
 
@@ -733,6 +740,12 @@ asserting a Tier-1 bound on a Tier-2 quantity.
 
 Ordered by how much they need an explicit owner decision.
 
+**Status (Ben, 2026-06-17): reviewed and accepted — with two changes marked below.**
+Item 3 (network FS) is **document-only**: do not build the FS-type probe. Item 5 is
+**overridden** — the untested-but-robust lanes *will* get test coverage (actually-tested
+edge cases make the tool more maintainable and give future users confidence), rather than
+"accept untested". Every other recommendation is accepted as written.
+
 1. **Define and document the load/timing envelope (§K) — highest value.**
    *Recommendation:* state in `docs/git-commit-lock.md` that correctness
    (exclusion, no silent loss, eventual recovery) is load-independent, while all
@@ -754,11 +767,11 @@ Ordered by how much they need an explicit owner decision.
 
 3. **Network/shared FS is out of scope but fails *silently* if entered (§E1).**
    The boundary is correctly stated in the design doc but only there.
-   *Recommendation:* surface it in `README.md` (where operators look), since the
-   failure on a bad FS is silent loss of exclusion. Do **not** attempt to
-   *support* network FS. An optional best-effort FS-type startup probe is
-   possible but cross-platform-awkward and incomplete — treat as low-priority
-   polish, not a requirement.
+   *Decision (Ben — document-only):* surface the boundary in `README.md` (where
+   operators look), since the failure on a bad FS is silent loss of exclusion. Do
+   **not** attempt to *support* network FS, and **do not build** the optional
+   FS-type startup probe — just document. (It would be cross-platform-awkward and
+   incomplete anyway; Ben: "don't do the polish, just document.")
 
 4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
    (§H3) — accept as documented.** Both are real but confined to an unsupported
@@ -770,11 +783,17 @@ Ordered by how much they need an explicit owner decision.
 5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
    failure F2/J1).** These degrade safely (wait/97, or silent best-effort log
    loss) but have **no fault-injection tests** — they are reasoned-correct, not
-   verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection
-   tests (low ROI; the degradation is structurally safe). If the owner wants one
-   belt-and-braces test, the highest-value single one is an **unwritable lock dir
-   → clean 97** (cheap to write deterministically; F4), since that's the most
-   likely real-world misconfiguration of the set.
+   verified. *Decision (Ben — overrides the prior "accept untested"):* **add test
+   coverage** for these lanes. Rationale: actually-tested edge cases make the
+   project easier to maintain and give future users confidence, versus
+   "reasoned-correct but untested." Add deterministic fault-injection tests where
+   feasible — **unwritable lock dir → clean 97** (F4, cheapest/highest-value and
+   the most likely real-world misconfig); an **unwritable log path → the lock
+   still works, the log write is swallowed** (F2/J1); and the **ENOSPC / inode /
+   FD-exhaustion** lanes (F1/F3) where they can be injected deterministically and
+   portably (e.g. a small dedicated tmpfs or quota for ENOSPC, `ulimit -n` for
+   FDs). Flag in the plan any lane that proves genuinely impractical to fault-inject
+   portably, rather than forcing a flaky test.
 
 6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
    confirm.** The first degrades to detection (98), never silent, and is covered

From 9048400ae6ef92f904e01e62fda46fd7a18aba4b Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 17:48:10 +1000
Subject: [PATCH 20/76] Plan proposal: guarantees spec + failure-modes
 follow-ups (await Ben review)

---
 ...-ci-stress-guarantees-and-coverage-plan.md | 124 ++++++++++++++++++
 1 file changed, 124 insertions(+)
 create mode 100644 .plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md

diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
new file mode 100644
index 0000000..7b067ce
--- /dev/null
+++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
@@ -0,0 +1,124 @@
+# Plan proposal: guarantees spec + close the failure-modes follow-ups
+
+Status: **PROPOSAL — awaiting Ben's review.** No implementation until approved.
+This is the action list + proposed workflow Ben asked for after the `/c` pass on
+`docs/failure-modes.md` (his comments converged at commit a5df9d9; recorded 534a0073).
+
+## Where this comes from
+`docs/failure-modes.md` is the **analysis / decision-support** doc (current behavior,
+3-tier classification, recommendations). Ben has now decided on its §4 (agree, with two
+overrides). The follow-ups below turn those decisions into work, and add the new doc Ben
+asked for: a **normative spec** ("what we guarantee / what's out of scope") — distinct from
+the analysis doc.
+
+## Action list (requirements / things to do)
+
+### Bucket 1 — NEW normative guarantees spec (Ben's explicit ask)
+- **A1.** Create a normative spec doc — *what the tool guarantees* and *what is out of
+  scope* — derived from `failure-modes.md`'s tiers but written as a contract, not analysis.
+  - Guarantees: the Tier-1 **safety** properties (no silent lost update given cooperative
+    unwind; strict mutual exclusion within the staleness window; no corruption) and the
+    Tier-1 **recovery** properties (lock-shaped orphans reclaimed), each with their stated
+    conditions/envelope.
+  - Out of scope: network/shared FS, multi-host/clock-skew, mixed-version trees,
+    ps1-on-POSIX, the non-unwinding-exit boundary (§H4) — the documented boundaries.
+  - Defines the **operating envelope** precisely (the load/timing envelope from §4.1) — the
+    reference Bucket 4 scopes tests against.
+  - *Open decision D-a:* location/name — `docs/guarantees.md` (new), or a normative section
+    inside `docs/git-commit-lock.md`? (Recommend a dedicated `docs/guarantees.md` — a crisp
+    contract is easier to point users/CI at than a section.)
+
+### Bucket 2 — Test coverage for the untested-but-robust lanes (§4.5, Ben's override)
+Decision (Ben): tested edge cases > reasoned-correct-but-untested. Add deterministic,
+**portable**, fault-injection tests; flag any lane that can't be injected portably rather
+than shipping a flake. **All test execution via CI** (local runs are banned — they lag
+Ben's box).
+- **B-F4.** Unwritable lock dir/parent → clean 97 (cheapest, highest-value; `chmod`).
+- **B-F2/J1.** Unwritable / failing log path → lock still works, the log write is swallowed.
+- **B-F1.** ENOSPC during claim/lock create+write (small dedicated tmpfs or quota).
+- **B-F3.** FD exhaustion via `ulimit -n` (portable); inode exhaustion only if cleanly
+  injectable.
+- **B-E3 (candidate).** mtime probe unreadable → staleness-detection-disabled, fail-safe
+  (no steal), 97 + the once-per-process warning. (Also a ○ untested lane; fits the same
+  decision — include unless Ben says skip.)
+- *Open decision D-b:* scope — just the §4.5 set (F1-F4, J1) + E3, or also fold in the two
+  **deferred F2-audit gaps**: #7 wrong-type object appearing *at the lock path mid-steal*
+  (A2/G2 — `CLAIM-ABORT (wrong-type)`/`(rename-refused)`), and #8 the Windows-only
+  blocked-unlink legs? (Recommend: do F4/F2/J1/F3 now; treat F1-ENOSPC, E3, and #7/#8 as a
+  second tier to confirm.)
+- Platform reality: several lanes are POSIX-only (tmpfs, `ulimit`, chmod semantics) — guard
+  by platform like the existing suite does; Windows-specific lanes (no-delete-share) already
+  have their own gated tests.
+
+### Bucket 3 — Documentation gaps (all "document" decisions: §4.1-4.3, §4.6, §I2)
+- **C-envelope (§4.1).** Document the load/timing envelope in `docs/git-commit-lock.md`:
+  "correctness is load-independent; wall-clock bounds (recovery latency, MAX_WAIT, the read
+  ladder) are best-effort and scale with scheduling."
+- **C-clock (§4.2).** One sentence: the tool assumes a single time source (single-host, or a
+  shared FS with one server clock); a local clock jump is correctness-safe.
+- **C-netfs (§4.3).** Surface the network/shared-FS boundary in `README.md` (document-only,
+  **no** FS-type probe).
+- **C-mixedver (§I2).** Add the "upgrade both implementations together" note to `README.md`
+  (currently design-doc-only).
+- **C-misc (§4.6, optional).** One-line each for mixed-version + case-insensitive FS in the
+  design doc.
+
+### Bucket 4 — Scope the wall-clock test bounds (§4.1 — the Test 21/22a resolution)
+- **S1.** Relax / scope the wall-clock assertions that flake only under extreme artificial
+  load — **Test 21** (≤20s recovery), **Test 22a** (claim-warning timing), **Test 29**
+  (≥2-CLAIM poll count) — to the envelope Bucket 1 defines, so the protocol's correctness
+  assertions in those tests stay strict while the latency/poll-count bounds get headroom (or
+  are gated to a defined load level). *Depends on Bucket 1's envelope.*
+- *Open decision D-c:* relax the numbers in place, or split the suite into a
+  "correctness" tier (always strict) and a "latency/envelope" tier the extreme-stress runs
+  don't hard-fail on? (Recommend the latter — it makes the envelope explicit and stops
+  future stress runs re-raising these as "flakes".)
+
+### Bucket 5 — Branch hygiene (standing, NOT part of this workflow unless wanted)
+- The mergeable commits (the 4 test fixes 58c3741/06c6d8e/51a1753/19a28fd + the docs) vs the
+  **stress-only, do-not-merge** commits (980856b concurrency tweak, b430d73 load wrapper).
+  When this lands on `main`, cherry-pick the mergeable set and leave the stress scaffolding.
+  *Open decision D-d:* do this work on `ci-stress` and cherry-pick later, or branch a clean
+  `failure-modes` off `main` now? (Recommend: keep working on `ci-stress`; cherry-pick at the
+  end — the stress wrapper is useful for CI-verifying the new tests under load.)
+
+## Proposed workflow (our usual approach: spec → plan → implement → review)
+
+Each phase ends with **Claude + Codex review rounds to convergence** and a **Ben gate**.
+Test execution is **CI-only** throughout.
+
+**Phase 1 — Spec.** Write the Bucket-1 guarantees/scope spec + the precise operating
+envelope. Review (Claude + Codex) against the code and `failure-modes.md`. → Ben approves the
+spec before any implementation. (This is where the new doc Ben asked for gets created.)
+
+**Phase 2 — Plan.** A concrete implementation plan for Buckets 2-4: per-test injection method
+(tmpfs / `ulimit` / chmod) + platform guard + CI wiring; the exact doc edits; the test-bound
+scoping approach (per D-c). Include a logging/observability note (what each new test asserts
+in the logs). Record in `.plans/`, review (Claude + Codex). → Ben approves the plan.
+
+**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2), apply the doc edits
+(Bucket 3), scope the wall-clock bounds (Bucket 4). Commit incrementally under the
+commit-lock. **Verify via CI** (dispatch `tests.yml` on `ci-stress`) — never locally.
+
+**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **under the
+stress load wrapper** to confirm (a) the new tests pass and are non-flaky, and (b) the scoped
+bounds stop Test 21/22a/29 flaking at extreme load while keeping correctness strict. Iterate
+to clean. → Ben's final review.
+
+**Execution mechanics (open decision D-e):** run the phases by hand (subagent review rounds as
+this session has been doing), or drive Phases 3-4 with a Claude Code **Workflow** (multi-agent
+fan-out — one agent per test lane, adversarial verify, etc.)? (Recommend: hand-run Phase 1-2;
+consider a Workflow for Phase 3-4 if the test count grows. Your call.)
+
+## Decisions I need from Ben (summary)
+- **D-a:** new `docs/guarantees.md` vs a section in the design doc. (rec: new doc)
+- **D-b:** test scope — §4.5 set + E3 now, defer #7/#8? (rec: F4/F2/J1/F3 first; F1/E3/#7/#8 second tier)
+- **D-c:** scope test bounds by relaxing numbers vs a correctness/envelope test split. (rec: split)
+- **D-d:** keep on `ci-stress` + cherry-pick later vs clean branch now. (rec: ci-stress)
+- **D-e:** hand-run vs Workflow for Phase 3-4. (rec: hand-run 1-2, decide later for 3-4)
+
+## Out of scope for this plan
+- Anything the design already rejected (heartbeat, two-rename CAS, `File.Replace`, supporting
+  network FS) — see `failure-modes.md` §4 "Things explicitly NOT to do".
+- No product *behavior* changes are implied by any of the above — these are tests + docs +
+  test-bound scoping. (If a new test surfaces a real product bug, that's a separate loop.)

From 261744963fa68dae4ce090f2ec025a414b369550 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 18:05:29 +1000
Subject: [PATCH 21/76] Plan: lock decisions D-a..e; add Bucket 6 / Phase 1b
 (load-testing strategy recommendation, Ben f)

---
 ...-ci-stress-guarantees-and-coverage-plan.md | 96 ++++++++++++-------
 1 file changed, 64 insertions(+), 32 deletions(-)

diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
index 7b067ce..0bf4445 100644
--- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
+++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
@@ -82,40 +82,72 @@ Ben's box).
   `failure-modes` off `main` now? (Recommend: keep working on `ci-stress`; cherry-pick at the
   end — the stress wrapper is useful for CI-verifying the new tests under load.)
 
-## Proposed workflow (our usual approach: spec → plan → implement → review)
+### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code
+The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete
+loops) was thrown together from a few lines of discussion. Ben wants a **considered,
+first-principles rethink** — explicitly **not anchored on the existing approach** — whose
+**deliverable is a recommendation doc for Ben, NOT an implementation.** Scope:
+- **Is the load injection right?** From first principles: which KINDS of load actually stress
+  *this* tool's timing-critical windows (claim→rename, read-back, discovery, mtime/staleness,
+  fsync durability, scheduler preemption at critical points)? Are CPU-spin + disk-fsync the
+  right proxies, or are better mechanisms warranted (cgroup CPU throttling, `taskset`/`nice`,
+  `ionice`, `stress-ng` stressors, FUSE/FS-latency injection, memory pressure)? Faithfulness,
+  reproducibility, and calibration (load relative to runner core count).
+- **Expand the CI matrix** on free public GitHub runners: run the suite across
+  {OS} × {load level} × {load kind} × {config} in parallel. How many cells is *considered* vs
+  *blowing it up* — diminishing returns, signal-per-cell, GitHub concurrency limits, a small
+  per-PR tier vs a larger nightly tier.
+- **Get more from EXISTING tests, routinely:** parametrize the fan-out/timing tests across
+  waiter counts and knob values (STALE / CLAIM_STALE / POLL / MAX_WAIT) so each run exercises
+  more surface — without adding flakiness. Which tests benefit most.
+- **Considered, not maximalist:** principles for choosing the matrix + a routine cadence.
+Output: `docs/load-testing-strategy.md` (recommendation). Runs EARLY (Phase 1b) because it
+shapes Buckets 2 & 4 and the Phase-2 plan.
+
+## Workflow (settled: spec → plan → implement → review)
 
 Each phase ends with **Claude + Codex review rounds to convergence** and a **Ben gate**.
-Test execution is **CI-only** throughout.
-
-**Phase 1 — Spec.** Write the Bucket-1 guarantees/scope spec + the precise operating
-envelope. Review (Claude + Codex) against the code and `failure-modes.md`. → Ben approves the
-spec before any implementation. (This is where the new doc Ben asked for gets created.)
-
-**Phase 2 — Plan.** A concrete implementation plan for Buckets 2-4: per-test injection method
-(tmpfs / `ulimit` / chmod) + platform guard + CI wiring; the exact doc edits; the test-bound
-scoping approach (per D-c). Include a logging/observability note (what each new test asserts
-in the logs). Record in `.plans/`, review (Claude + Codex). → Ben approves the plan.
-
-**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2), apply the doc edits
-(Bucket 3), scope the wall-clock bounds (Bucket 4). Commit incrementally under the
-commit-lock. **Verify via CI** (dispatch `tests.yml` on `ci-stress`) — never locally.
-
-**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **under the
-stress load wrapper** to confirm (a) the new tests pass and are non-flaky, and (b) the scoped
-bounds stop Test 21/22a/29 flaking at extreme load while keeping correctness strict. Iterate
-to clean. → Ben's final review.
-
-**Execution mechanics (open decision D-e):** run the phases by hand (subagent review rounds as
-this session has been doing), or drive Phases 3-4 with a Claude Code **Workflow** (multi-agent
-fan-out — one agent per test lane, adversarial verify, etc.)? (Recommend: hand-run Phase 1-2;
-consider a Workflow for Phase 3-4 if the test count grows. Your call.)
-
-## Decisions I need from Ben (summary)
-- **D-a:** new `docs/guarantees.md` vs a section in the design doc. (rec: new doc)
-- **D-b:** test scope — §4.5 set + E3 now, defer #7/#8? (rec: F4/F2/J1/F3 first; F1/E3/#7/#8 second tier)
-- **D-c:** scope test bounds by relaxing numbers vs a correctness/envelope test split. (rec: split)
-- **D-d:** keep on `ci-stress` + cherry-pick later vs clean branch now. (rec: ci-stress)
-- **D-e:** hand-run vs Workflow for Phase 3-4. (rec: hand-run 1-2, decide later for 3-4)
+Test execution is **CI-only** throughout (local runs lag Ben's box).
+
+**Phase 1a — Guarantees spec.** Write `docs/guarantees.md` (D-a) — what we guarantee / what's
+out of scope, as a normative contract + the precise operating envelope. Review (Claude +
+Codex) against the code + `failure-modes.md`. → Ben gate.
+
+**Phase 1b — Load-&-matrix testing STRATEGY recommendation (Bucket 6 / Ben "f").** Run a
+considered, first-principles process (parallel research agents on distinct facets: the tool's
+timing-window→load-type mapping + critique of the current wrapper; CI-matrix design on free
+runners; existing-test parametrization), synthesize into `docs/load-testing-strategy.md`,
+review (Claude + Codex). **Recommendation only — NO implementation.** → Ben reviews; his chosen
+recommendations feed Phase 2. Runs early because it shapes Buckets 2 & 4. (1a and 1b are
+independent and can run in parallel.)
+
+**Phase 2 — Plan.** Concrete implementation plan for Buckets 2-4, incorporating Ben's chosen
+load/matrix recommendations: per-test injection method (tmpfs / `ulimit` / chmod) + platform
+guard + CI wiring; the matrix/parametrization to adopt; exact doc edits; the
+correctness/envelope test split (D-c); a logging/observability note. Record in `.plans/`,
+review. → Ben gate.
+
+**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2, tiered per D-b), apply
+the doc edits (Bucket 3), scope the wall-clock bounds + split the tiers (Bucket 4 / D-c), wire
+the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **Verify via CI**
+(dispatch `tests.yml` on `ci-stress`) — never locally.
+
+**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the
+agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the
+matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then (D-d) cherry-pick
+the mergeable commits to `main`.
+
+## Decisions (settled 2026-06-17)
+- **D-a → new `docs/guarantees.md`** (dedicated normative doc).
+- **D-b → accept rec:** F4 / F2-J1 / F3 first tier; F1-ENOSPC, E3, and the deferred F2-audit
+  gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier.
+- **D-c → split the suite** into a strict-correctness tier (always enforced) and a
+  latency/envelope tier (not hard-failed by extreme-stress runs).
+- **D-d → keep on `ci-stress`**, cherry-pick the mergeable commits to `main` at the end.
+- **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the
+  test/matrix count is known.
+- **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing
+  **recommendation doc** (not implementation), run early as Phase 1b.
 
 ## Out of scope for this plan
 - Anything the design already rejected (heartbeat, two-rename CAS, `File.Replace`, supporting

From 0397aaa1e971974cac6402873fbd7475673f1043 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 18:12:36 +1000
Subject: [PATCH 22/76] docs: load-&-matrix testing strategy recommendation
 (Ben f, Phase 1b draft)

---
 docs/load-testing-strategy.md | 311 ++++++++++++++++++++++++++++++++++
 1 file changed, 311 insertions(+)
 create mode 100644 docs/load-testing-strategy.md

diff --git a/docs/load-testing-strategy.md b/docs/load-testing-strategy.md
new file mode 100644
index 0000000..c7ebd73
--- /dev/null
+++ b/docs/load-testing-strategy.md
@@ -0,0 +1,311 @@
+# Load & matrix testing strategy — recommendation
+
+**Status: RECOMMENDATION for Ben's decision — not an implementation.** Produced by a
+considered, first-principles process (three parallel research agents — load fidelity, CI
+matrix, test parametrization — synthesized and cross-checked against the code), deliberately
+**not anchored** on the current `tests/with-load.sh` approach (which was thrown together from a
+few lines of discussion). It answers: are we injecting load the right way / of the right
+kinds; how to use the free public GitHub runners for a load×config matrix; and how to get more
+from the existing tests routinely — while staying **considered, not maximalist**.
+
+Grounded in `docs/failure-modes.md` (esp. §K and the correctness-vs-liveness split) and the
+product/test code. Where it cites a fact about GitHub Actions limits, treat the number as
+"current as of writing, confirm against GitHub docs before relying on it."
+
+---
+
+## 0. Headline recommendations (skim)
+
+1. **Reframe load's job.** Correctness here is *load-independent* (O_EXCL + atomic rename +
+   per-attempt tokens never consult the clock for a correctness decision). So load can't break
+   exclusion or cause a silent lost update. Load has exactly two jobs: **(J1)** perturb
+   scheduling so the protocol's multi-syscall sequences get preempted at adversarial points
+   (race-surfacing), and **(J2)** broaden configs to exercise different code paths. Load
+   *magnitude* past ~2× CPU oversubscription mostly manufactures *harness wall-clock flakes*,
+   not bugs.
+2. **The biggest race-coverage lever is NOT external load — it's deterministic steering.** The
+   genuinely dangerous windows are reachable *deterministically* only by the in-process
+   function-interposition the suite already uses. Invest there first; external load is a
+   secondary, probabilistic complement for the few windows it can actually move.
+3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate
+   always means a real correctness bug); a **Nightly** non-blocking tier that adds calibrated
+   load × kind and the parametrization sweeps, with wall-clock assertions relaxed to warnings;
+   and an on-demand **Deep sweep** (the current stress design) for the 50-clean hunt.
+4. **Fix the injection: calibrate, target, record.** Express load as an *oversubscription
+   ratio* relative to core count (not an absolute hog count); prefer calibrated mechanisms
+   (`stress-ng`, Linux cgroup `cpu.max`/`io.max`) over free-running spinners; write a per-run
+   load-manifest artifact so a flake is reproducible.
+5. **Embrace platform asymmetry** instead of a uniform injection layer: steering everywhere
+   (portable); calibrated latency on the Linux leg only; plain CPU oversubscription as the
+   macOS/Windows fallback — and record per-leg which regime actually ran.
+6. **Get more from existing tests** via a *bounded* parametrization of a named handful (waiter
+   count, fail-open ratio, poll cadence) — with strict correctness assertions kept
+   config-independent and wall-clock assertions moved to the envelope tier.
+
+---
+
+## 1. What load testing is FOR here (the reframe that drives everything)
+
+This is **not** a throughput-bound system whose correctness degrades under load. Per
+`failure-modes.md` §1/§K, safety/exclusion rest on structural primitives (atomic
+create/rename, per-attempt-token discovery) that never reference the clock for a *correctness*
+decision. No amount of CPU/IO pressure makes `rename(2)` non-atomic or lets two O_EXCL creates
+both win on a local FS.
+
+So load's honest purpose is narrow: **make the protocol's multi-syscall sequences (which are
+not individually atomic) get preempted at adversarial points, so the inter-process
+interleavings the code claims to handle are actually exercised** — plus widen the few
+genuinely timing-derived decisions (mtime staleness, the FILETIME-zero floor, empty-read
+retries). The right metric for a load regime is *"does it raise the probability that process A
+is suspended between syscall N and N+1 while process B advances?"* — **not** *"does it consume
+the box?"*
+
+**Direct consequence (the most important single point):** beyond ~2× CPU oversubscription,
+more load does not find new correctness bugs — it only stretches wall-clock latency and starts
+blowing the suite's *Tier-2* wall-clock assertions (Test 21's ≤20s recovery, Test 22a's
+warning timing, Test 29's poll-count), which `failure-modes.md` §K already identifies as
+Tier-1-bound-on-a-Tier-2-quantity. The fix for those is to **scope the bound**, not pile on
+load. This is why the strategy below puts load in non-blocking tiers and keeps the gate clean.
+
+---
+
+## 2. The biggest lever is deterministic steering, not load
+
+The protocol's scary windows — and whether *external load* can even reach them:
+
+| Window | Code | Reachable by external load? |
+|---|---|---|
+| create → read-back verify | `git-commit-lock.sh:1336-1357` | Only probabilistically (1 command-sub wide); deterministically via steering |
+| **claim recheck → touch → re-verify → rename** (residual 1/2 — THE delicate path) | `:1092-1168` | Probabilistically via CPU preemption; deterministically only via steering |
+| rename-over → read-back (steal install) | `:1168-1179` | Same — steering for determinism |
+| **mtime staleness / fail-open boundary (B5)** | `:1408-1410`, `:928` | **Yes** — CPU/IO load stretches cadence and can push a contended holder past STALE → exercises the 98-detect lane. The most realistic "load surfaces a real lane" case. |
+| two-poll wrong-type confirmation (ghosts) | `:1518-1567` | **Yes, but mostly the bad way** — oversubscription *starves* the poll headroom → manufactures the Test 22a-style flake rather than finding a bug |
+| FILETIME-zero floor (Windows) | `:925`, `:1408` | **No** — a *create-churn* artifact, not load-driven |
+| empty-read retry ladder (AV/create→write) | `:668-684` | Realistic trigger is Windows AV/filter-drivers, not synthetic load |
+
+**Takeaway:** the windows where a *wrong interleaving could actually corrupt state*
+(create→readback, claim→rename, rename→readback, release boundary) are reached *deterministically*
+only by the in-process function-interposition steering the suite already does (`clone_fn`,
+`tests/git-commit-lock.test.sh:127-136`). External load merely raises the background
+probability of hitting an interleaving nobody scripted. **So the primary race-coverage
+investment is MORE STEERED SCENARIOS** (portable, deterministic, attributable) — e.g. steered
+cases that park the claimant between recheck and rename, and between touch and rename, firing a
+clearer + rival. External load is a *secondary, probabilistic* complement, valuable mainly for
+the staleness/fail-open boundary (B5) it can genuinely move.
+
+A corollary for triage: because external load *cannot* break correctness, a load run that
+produces a *correctness* failure is surfacing either (a) a real logic bug in a steering-only
+window (high value) or (b) a *test-harness* setup race (`sync_waiting_fresh`/`backdate_ghost`
+losing its race under load) — a harness fix, not a code fix. Prefer deterministic mechanisms so
+an observed failure is *attributable*.
+
+---
+
+## 3. Fix the load injection: calibrate, target, record
+
+**Critique of the current `tests/with-load.sh`** (N bare CPU spinners + N `dd … conv=fsync`
+create/write/delete loops): it is a *reasonable background-jitter generator* and adequate for
+"run the whole suite under generic pressure," but from first principles it is:
+- **Uncalibrated / non-reproducible:** `LOAD=N` spinners produce wildly different real
+  preemption pressure on a 2-core vs 4-core runner, so "we tested at load N" doesn't mean a
+  fixed thing — violating the reproducible-experiments requirement.
+- **Untargeted:** a box-wide hog perturbs *everyone uniformly* (including the rival you wanted
+  to advance), so it adds jitter but doesn't *bias* the interleaving toward the adversarial
+  order. The high-value windows need a *scalpel* (slow one syscall in one process), which it
+  can't do.
+- **Blind to two windows:** it can't widen the create→write gap (the lock create is one
+  redirect, no fsync to delay) and can't *produce* the Windows delete-pending ghost (it churns
+  unrelated files); its main effect on those is the *poll-starvation false-flake* direction.
+- **Self-defeating at high N:** on a 2-core runner it pushes wall-clock far enough to blow the
+  harness's own timeouts (the workflow already had to raise every step timeout 2–3×) — load
+  manufacturing churn, not findings.
+
+**Recommendations:**
+- **Express load as an oversubscription ratio `R = stressors / nproc`** (e.g. R ∈ {0, 1, 2}),
+  not an absolute hog count, so a level is runner-independent.
+- **Prefer calibrated mechanisms:** `stress-ng --cpu $((R*nproc)) --cpu-load … --metrics`
+  (defined, measurable) over bare spinners; on **Linux**, prefer **cgroup throttling**
+  (`systemd-run --user --scope -p CPUQuota=…` / `io.max`) which gives *deterministic,
+  reproducible* latency — the right tool for **envelope validation** (a 10% CPU quota means the
+  same everywhere; "8 hogs" does not).
+- **Record a per-run `load-manifest`** artifact next to the suite logs: `{kind, R, nproc,
+  achieved-slowdown, tool versions, runner os/arch, git sha}`, uploaded on *success too* (you
+  need the negatives to interpret the positives). Optionally probe achieved slowdown with a
+  fixed micro-benchmark before/during load.
+- **Cap routine load at ~2× oversubscription;** higher R only on the deep-sweep flake-hunt leg
+  (whose *correctness* assertions stay strict but *wall-clock* assertions are relaxed).
+
+---
+
+## 4. Embrace platform asymmetry (don't build a uniform injection layer)
+
+The platforms diverge too much for a "uniform" load layer (cgroups & FUSE are Linux-only;
+macOS SIP blocks `DYLD_INSERT_LIBRARIES` on system binaries; Windows has neither). Don't fight
+it — structure around it and **record which regime ran per leg**:
+
+- **Deterministic steering** — *everywhere* (portable bash; pwsh equivalent). The real
+  race-coverage tool.
+- **Calibrated latency** (cgroup `cpu.max`/`io.max`; optionally `strace -e inject` to slow one
+  syscall in one process; a FUSE fsync-delay shim only if window W7 is prioritized) — **Linux
+  leg only**.
+- **CPU oversubscription** (`stress-ng` or the bash-spinner fallback) — the **macOS/Windows**
+  fallback, uncalibrated; document the asymmetry.
+
+Low-yield, **avoid:** memory/swap pressure (trivial allocation surface; risks OOM-killing the
+harness), raw disk-bandwidth saturation (doesn't touch metadata-op latency), de-prioritizing
+the background hogs. `ulimit`/inode/FD exhaustion belong to the *fault-injection tests* (the
+§4.5 work), not the timing-load regime.
+
+---
+
+## 5. The three-tier CI structure (the matrix)
+
+The organizing recommendation. It maps directly onto the already-decided correctness/envelope
+test split (D-c).
+
+### Tier R — Required / per-PR (blocking) — KEEP the existing 4 cells, STRIP the load
+| Cell | OS | Engines | Buys |
+|---|---|---|---|
+| R1 | ubuntu | bash + pwsh7 (all suites) | Linux correctness + interop baseline |
+| R2 | macos | bash + pwsh7 (all suites) | BSD `stat`/`mv` lanes (D1/E3) — *only* place these run |
+| R3 | windows (unit leg) | bash (MINGW) | delete-pending ghosts, FILETIME floor |
+| R4 | windows (interop+integration leg) | bash + pwsh7 + **PowerShell 5.1** | the 5.1 non-atomic-fallback path (D1) + real NTFS commit swarm |
+
+This is exactly today's matrix **minus the stress env**. Running it at **`none` load** means it
+only ever asserts Tier-1 correctness — it *cannot* flake on a Tier-2 wall-clock bound, so **a
+red required check always means a real bug.** Target < ~8 min. (Also: flip the concurrency group
+back to `${{ github.workflow }}-${{ github.ref }}` + `cancel-in-progress: true` — the current
+per-run-unique group is a *deep-sweep* setting, which is exactly why the stress branch is marked
+"do NOT merge to main.")
+
+### Tier N — Nightly / scheduled (non-blocking, triaged)
+~6 cells adding load **kind** (cpu / disk / both) at **one** oversubscribed level (R≈2), plus
+the §6 parametrization sweeps. Run with **`GCL_ENVELOPE_TIER=relax`** so the three known
+load-sensitive assertions (Test 21 ≤20s, Test 22a warning, Test 29 poll-count) **downgrade to
+warnings** while correctness assertions stay hard. Example cells: ubuntu×{disk, both, cpu},
+macos×disk, windows×{disk on the interop+5.1 leg — highest-value, both on the unit leg}.
+Auto-file a triaged issue on failure tagged `correctness` (investigate) vs `envelope-flake`
+(expected). macOS gets one harsh cell only (it's the scarce/slow runner); ubuntu absorbs the
+extra kinds (cheapest).
+
+### Tier D — On-demand deep sweep (`workflow_dispatch`, never gates)
+The current stress-branch design *is* this tier — keep its `stress_kind`/`stress_load` inputs
+and per-run-unique concurrency (many parallel dispatches), add `repeat` (run a cell K times)
+and `width` inputs. This is the "50-clean under both/8-hog" hunt: informational, time-boxed by
+choice, never a contract.
+
+**Why this is the linchpin:** keeping artificial load *off the required gate* is what makes the
+gate trustworthy; putting all load in non-blocking tiers with the envelope assertions relaxed is
+what stops load from manufacturing flakes that erode trust. The split needs a small product/test
+change: a `GCL_ENVELOPE_TIER=relax` env that downgrades the wall-clock assertions — nightly/deep
+set it, required never does.
+
+---
+
+## 6. Get more from existing tests: bounded parametrization
+
+Today there are only two coarse knobs: `GCL_TEST_FULL` (global fan-out) and per-case
+hard-coded `AGENT_LOCK_*` values (never swept). Add **one** mechanism — a per-axis sweep over a
+**named handful** of tests (sum the axes, do **not** cross-product):
+
+- **Axis A — waiter/stealer count (highest value):** T2b (frozen at 4), T20, interop T16. Sweep
+  N ∈ {4, 12, 24}. Widens the thundering-herd/claim-serialization and displacement windows that
+  re-running N=4 never will.
+- **Axis B — fail-open ratio (hold ÷ STALE):** a parametrized T4b/T1 variant running hold ≪
+  STALE / hold ≈ STALE / hold > STALE, asserting the *correct verdict per regime* (clean → 0
+  steals; over → exactly one steal + a 98).
+- **Axis C — poll cadence:** {fast 0.05, **default 2s**}. The shipped 2s default is currently
+  never exercised under contention.
+- **Axis D — CLAIM_STALE depth (lower value):** {2, 60} on T21.
+
+**Do not sweep:** round count (keep as the nightly *soak* dial, not a coverage axis), MAX_WAIT
+(timeout-only), the deterministic steered protocol tests (T23–T36 — re-running reruns the same
+steered path), or the integration suite's worker count beyond FULL/REDUCED (it's strict in both
+modes by design and wall-clock-bound by serialized commits).
+
+**Flakiness discipline (critical):** keep correctness assertions **config-independent** — when
+sweeping N, hold STALE ≫ hold so "zero-98 / one-steal" stays a pure correctness statement, and
+**scale MAX_WAIT with N** (more waiters = more serialized turns) so a large-N run doesn't time
+out and *look* like a product failure. Move wall-clock/poll-count assertions to the envelope
+tier. Keep the existing `sync_waiting_fresh`/`backdate_ghost` scaffolding — at higher N it
+matters more.
+
+**Cadence:** per-PR runs the floor point of each axis (today's behavior, deterministic);
+nightly runs the sweeps under a `GCL_TEST_SWEEP=1` gate. The sweep (per-suite fan-out/knobs) is
+*orthogonal* to the OS/leg matrix — compose additively (per-PR = matrix × floor; nightly =
+matrix × sweep), never multiply everything on every PR.
+
+---
+
+## 7. GitHub Actions realities (the real constraints — confirm against current docs)
+
+- **Minutes are free on public repos, but concurrency is the real ceiling.** Free/public
+  accounts cap concurrent jobs on the order of ~20 (with a much smaller macOS sub-limit). A
+  matrix past that **queues** (serialises into waves), it doesn't fail. Design any single
+  triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep intentionally exceeds
+  this and accepts waves.
+- **Runner scarcity ≠ billing:** even free, **macOS runners are scarce/slow (~10× cost-weight),
+  windows ~2×, ubuntu 1×.** Be stingy with macOS cells, liberal with ubuntu.
+- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal);
+  `max-parallel` on nightly/deep so a big sweep doesn't starve the required gate of runners;
+  256-job hard cap per workflow (irrelevant at our scale).
+- **Triggers:** required on `pull_request` + `push: main`; nightly on `schedule` (cron,
+  off-peak minute) + `workflow_dispatch`; deep on `workflow_dispatch` only — heavy load never
+  sits in a PR's critical path. Keep `paths-ignore` (`**.md`, `.plans/**`) on required.
+  (Note: `schedule` triggers are auto-disabled after ~60 days of repo inactivity.)
+- **Artifacts:** keep the existing `upload-artifact` (with `include-hidden-files` for the
+  `.git/`-buried lock logs); name uniquely per (os, leg, kind, level) so parallel cells don't
+  collide.
+
+---
+
+## 8. Considered, not maximalist — the decision rule
+
+> **A cell enters the routine matrix (R or N) only if it can surface a bug class no other
+> routine cell can. Otherwise it's a deep-sweep cell, or it doesn't exist.**
+
+- Cap the routine matrix: **R ≤ 4, N ≤ ~8.** New routine cells must *displace* one, forcing the
+  "does this find something the others can't?" question.
+- **Earn the slot:** a config/cell graduates deep → nightly only after the deep sweep actually
+  caught a distinct failure there (mirrors the project's own "tested edge cases earn confidence"
+  philosophy). Demote a cell that's been green for ~60 days and whose window is a subset of
+  another green cell's.
+- Prefer *one* oversubscribed level over a level sweep; prefer *attributable* single-kind cells
+  over `both`-only when you want to localise a flake.
+- **Trustworthiness invariant:** required = always-meaningful-red; nightly = triaged-amber-
+  tolerant; deep = noise-by-design. Don't retry-mask the required tier (a retry that hides a
+  1-in-20 real race is exactly the silent-loss class this tool exists to prevent).
+
+---
+
+## 9. Open decisions for Ben (what to pick before Phase 2 plans the build)
+
+1. **Nightly aggressiveness:** ~6 cells, cron daily vs weekly? (rec: ~6 cells, daily off-peak;
+   start smaller and grow by the earn-the-slot rule.)
+2. **Linux load mechanism:** adopt calibrated cgroup `cpu.max`/`io.max` throttling on the Linux
+   leg (reproducible, the right envelope-validation tool) vs keep the simple wrapper but
+   calibrate it by oversubscription ratio? (rec: cgroup on Linux for the envelope leg; keep a
+   ratio-calibrated `stress-ng`/spinner as the cross-platform race-jitter lane.)
+3. **`stress-ng` dependency:** add an install step (apt/brew) vs keep a pure bash spinner
+   (zero-dep, uncalibrated)? (rec: `stress-ng` where available + spinner fallback on Windows.)
+4. **Parametrization scope now:** Axis A (waiter count) only, or A+B+C? (rec: A first — highest
+   value, lowest flake risk — then B, then C.)
+5. **The envelope-tier switch** (`GCL_ENVELOPE_TIER=relax`): confirm this is how we implement the
+   D-c correctness/envelope split (a small test-harness change downgrading the 3 wall-clock
+   assertions to warnings under load). (rec: yes — it's the cleanest implementation of D-c.)
+6. **Nightly triage channel:** auto-file/track issues on nightly failure, tagged correctness vs
+   envelope? (rec: yes — otherwise scheduled-run reds are invisible.)
+
+These choices feed **Phase 2** (the implementation plan). This doc is a recommendation only —
+no code, no workflow changes, until you've decided.
+
+---
+
+## Appendix — provenance
+Synthesized from three parallel first-principles research passes (load fidelity & injection
+mechanisms; CI matrix on free public runners; existing-test parametrization), each grounded in
+`git-commit-lock.sh`/`.ps1`, the three suites, `tests/with-load.sh`, `.github/workflows/tests.yml`,
+and `docs/failure-modes.md`, and cross-checked against the code (one agent's claim that
+`tests/with-load.sh` was absent was verified false — it exists and is tracked). Pending: a
+foreign-model (Codex) review pass over the GitHub-Actions limit claims and the load-mechanism
+portability claims before this is treated as settled.

From aeba95c435e36f8729e2a160b672c2b698c50799 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 18:25:49 +1000
Subject: [PATCH 23/76] docs(load-testing): apply Codex factual review (cgroup
 probe-required; max-parallel/paths-ignore caveats; billing vs scarcity;
 FUSE/SIP hedges; reconcile non-Linux disk cells)

---
 docs/load-testing-strategy.md | 97 ++++++++++++++++++++++++-----------
 1 file changed, 66 insertions(+), 31 deletions(-)

diff --git a/docs/load-testing-strategy.md b/docs/load-testing-strategy.md
index c7ebd73..e26d68c 100644
--- a/docs/load-testing-strategy.md
+++ b/docs/load-testing-strategy.md
@@ -27,8 +27,9 @@ product/test code. Where it cites a fact about GitHub Actions limits, treat the
    genuinely dangerous windows are reachable *deterministically* only by the in-process
    function-interposition the suite already uses. Invest there first; external load is a
    secondary, probabilistic complement for the few windows it can actually move.
-3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate
-   always means a real correctness bug); a **Nightly** non-blocking tier that adds calibrated
+3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate is
+   never a stress-manufactured wall-clock flake — it's actionable); a **Nightly** non-blocking
+   tier that adds calibrated
    load × kind and the parametrization sweeps, with wall-clock assertions relaxed to warnings;
    and an on-demand **Deep sweep** (the current stress design) for the 50-clean hunt.
 4. **Fix the injection: calibrate, target, record.** Express load as an *oversubscription
@@ -122,12 +123,19 @@ create/write/delete loops): it is a *reasonable background-jitter generator* and
 
 **Recommendations:**
 - **Express load as an oversubscription ratio `R = stressors / nproc`** (e.g. R ∈ {0, 1, 2}),
-  not an absolute hog count, so a level is runner-independent.
+  not an absolute hog count, so a level is runner-independent. Note `R` is **per kind**: the
+  current wrapper's `GCL_STRESS_LOAD=N` spawns N hogs per selected kind, so `both` doubles total
+  hogs — define and cap `R_total`, and record cpu- and disk-stressor counts separately.
 - **Prefer calibrated mechanisms:** `stress-ng --cpu $((R*nproc)) --cpu-load … --metrics`
-  (defined, measurable) over bare spinners; on **Linux**, prefer **cgroup throttling**
-  (`systemd-run --user --scope -p CPUQuota=…` / `io.max`) which gives *deterministic,
-  reproducible* latency — the right tool for **envelope validation** (a 10% CPU quota means the
-  same everywhere; "8 hogs" does not).
+  (defined, measurable) over bare spinners. On **Linux**, calibrated **CPU** throttling is the
+  cleanest *envelope-validation* tool — `sudo systemd-run --scope -p CPUQuota=10%` gives a
+  runner-independent quota (a 10% quota means the same everywhere; "8 hogs" does not). **Treat
+  this as a probe-required Linux-only option, not a turnkey fact:** it needs cgroup v2 +
+  controller delegation + a usable systemd manager on the GitHub `ubuntu-24.04` runner, so gate
+  it behind a CI capability probe with the `stress-ng`/ratio path as the fallback. **IO** cgroup
+  throttling is *experimental* here — it is not a simple `systemd-run -p io.max`; systemd
+  exposes it as `IOReadBandwidthMax=`/`IOWriteBandwidthMax=` with device/path caveats — so don't
+  rely on it until proven on the runner.
 - **Record a per-run `load-manifest`** artifact next to the suite logs: `{kind, R, nproc,
   achieved-slowdown, tool versions, runner os/arch, git sha}`, uploaded on *success too* (you
   need the negatives to interpret the positives). Optionally probe achieved slowdown with a
@@ -139,17 +147,23 @@ create/write/delete loops): it is a *reasonable background-jitter generator* and
 
 ## 4. Embrace platform asymmetry (don't build a uniform injection layer)
 
-The platforms diverge too much for a "uniform" load layer (cgroups & FUSE are Linux-only;
-macOS SIP blocks `DYLD_INSERT_LIBRARIES` on system binaries; Windows has neither). Don't fight
-it — structure around it and **record which regime ran per leg**:
+The platforms diverge too much for a "uniform" *calibrated/targeted* load layer (cgroup
+throttling and FUSE fault-injection filesystems are Linux-only for this CI plan; `strace`
+inject is Linux-only; `DYLD_INSERT_LIBRARIES` injection is unreliable on macOS for
+SIP-protected Apple/system binaries like `mv`/`git` — possible only for non-protected helper
+binaries). Don't fight it — structure around it and **record which regime ran per leg**:
 
 - **Deterministic steering** — *everywhere* (portable bash; pwsh equivalent). The real
   race-coverage tool.
-- **Calibrated latency** (cgroup `cpu.max`/`io.max`; optionally `strace -e inject` to slow one
-  syscall in one process; a FUSE fsync-delay shim only if window W7 is prioritized) — **Linux
-  leg only**.
-- **CPU oversubscription** (`stress-ng` or the bash-spinner fallback) — the **macOS/Windows**
-  fallback, uncalibrated; document the asymmetry.
+- **Calibrated / targeted latency** (cgroup CPU quota; optionally `strace -e inject` to slow one
+  syscall in one process; a FUSE fsync-delay shim — charybdefs-style — only if window W7 is
+  prioritized) — **Linux leg only** (probe-gated, per §3).
+- **Uncalibrated oversubscription — the macOS/Windows fallback.** Both **CPU** (`stress-ng` or
+  the bash-spinner fallback) **and the simple disk-churn hog** (the current
+  `dd`/create-write-fsync-delete wrapper) run cross-platform; they are *low-fidelity and
+  uncalibrated* but real metadata-op pressure, which is why the Tier-N macOS/Windows `disk`
+  cells (§5) use them. Document the asymmetry: calibrated latency only on Linux; everywhere else
+  it's blunt oversubscription.
 
 Low-yield, **avoid:** memory/swap pressure (trivial allocation surface; risks OOM-killing the
 harness), raw disk-bandwidth saturation (doesn't touch metadata-op latency), de-prioritizing
@@ -173,7 +187,9 @@ test split (D-c).
 
 This is exactly today's matrix **minus the stress env**. Running it at **`none` load** means it
 only ever asserts Tier-1 correctness — it *cannot* flake on a Tier-2 wall-clock bound, so **a
-red required check always means a real bug.** Target < ~8 min. (Also: flip the concurrency group
+red required check is never stress-manufactured envelope noise.** It's always actionable — a
+real bug, or at worst runner-image/action-download/infra drift (which is also worth knowing) —
+never a "load was too high" false alarm. Target < ~8 min. (Also: flip the concurrency group
 back to `${{ github.workflow }}-${{ github.ref }}` + `cancel-in-progress: true` — the current
 per-run-unique group is a *deep-sweep* setting, which is exactly why the stress branch is marked
 "do NOT merge to main.")
@@ -239,20 +255,34 @@ matrix × sweep), never multiply everything on every PR.
 
 ## 7. GitHub Actions realities (the real constraints — confirm against current docs)
 
-- **Minutes are free on public repos, but concurrency is the real ceiling.** Free/public
-  accounts cap concurrent jobs on the order of ~20 (with a much smaller macOS sub-limit). A
-  matrix past that **queues** (serialises into waves), it doesn't fail. Design any single
-  triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep intentionally exceeds
-  this and accepts waves.
-- **Runner scarcity ≠ billing:** even free, **macOS runners are scarce/slow (~10× cost-weight),
-  windows ~2×, ubuntu 1×.** Be stingy with macOS cells, liberal with ubuntu.
-- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal);
-  `max-parallel` on nightly/deep so a big sweep doesn't starve the required gate of runners;
-  256-job hard cap per workflow (irrelevant at our scale).
+- **Minutes are free on public repos; concurrency is the real ceiling.** Free-plan accounts cap
+  concurrent jobs at **20 total, with a 5-job macOS sub-limit** (confirm against GitHub's
+  current limits page). A matrix past that **queues** (serialises into waves), it doesn't fail.
+  Design any single triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep
+  intentionally exceeds this and accepts waves.
+- **Cost-weight is separate from queue scarcity (don't conflate).** On a public repo standard
+  runners are *free* — the per-minute rates don't consume credits or set queue priority. They do
+  signal relative runner *cost/scarcity*: roughly Linux 1×, **Windows ~1.7×** ($0.010 vs
+  $0.006/min), **macOS ~10×** ($0.062/min). The real constraint on macOS is the **5-job
+  sub-limit** above, plus it being the slowest pool. → keep macOS cells **sparse**, ubuntu
+  liberal.
+- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal).
+  **`max-parallel` only limits parallelism *within a single matrix run*** — it does **not**
+  reserve capacity across separate workflow runs or the deep sweep's many `workflow_dispatch`
+  invocations. To stop a sweep starving the required gate, **bound the deep/nightly tiers with a
+  workflow-level `concurrency` group (and cap the dispatcher width)**, not `max-parallel` alone.
+  256-job hard cap per workflow run (irrelevant at our scale).
 - **Triggers:** required on `pull_request` + `push: main`; nightly on `schedule` (cron,
   off-peak minute) + `workflow_dispatch`; deep on `workflow_dispatch` only — heavy load never
-  sits in a PR's critical path. Keep `paths-ignore` (`**.md`, `.plans/**`) on required.
-  (Note: `schedule` triggers are auto-disabled after ~60 days of repo inactivity.)
+  sits in a PR's critical path. (Note: `schedule` triggers are auto-disabled after ~60 days of
+  repo inactivity.)
+- **`paths-ignore` gotcha on a *required* check.** A workflow skipped by path filtering leaves
+  its checks **Pending**, which *blocks merge* if those checks are required. So **don't** put
+  `paths-ignore` on the workflow whose jobs are the required checks and expect doc-only PRs to
+  merge. Instead either (a) keep the required workflow always-running with a tiny always-green
+  job and path-filter only the expensive test jobs, or (b) make a separate cheap job the
+  required check. (Doc-only-skip is still worth doing — just not on the required-check workflow
+  naively.)
 - **Artifacts:** keep the existing `upload-artifact` (with `include-hidden-files` for the
   `.git/`-buried lock logs); name uniquely per (os, leg, kind, level) so parallel cells don't
   collide.
@@ -306,6 +336,11 @@ Synthesized from three parallel first-principles research passes (load fidelity
 mechanisms; CI matrix on free public runners; existing-test parametrization), each grounded in
 `git-commit-lock.sh`/`.ps1`, the three suites, `tests/with-load.sh`, `.github/workflows/tests.yml`,
 and `docs/failure-modes.md`, and cross-checked against the code (one agent's claim that
-`tests/with-load.sh` was absent was verified false — it exists and is tracked). Pending: a
-foreign-model (Codex) review pass over the GitHub-Actions limit claims and the load-mechanism
-portability claims before this is treated as settled.
+`tests/with-load.sh` was absent was verified false — it exists and is tracked). A foreign-model
+(Codex, web-grounded) review has been applied: it confirmed the §2 window→load reachability
+table against the code and the core GitHub-Actions facts (20-total / 5-macOS free-plan
+concurrency, 256-job matrix cap, 60-day schedule auto-disable, `cancel-in-progress`, `stress-ng`
+availability), and its corrections are folded in — the cgroup mechanism is now marked
+**probe-required** (CPU quota only; IO throttling experimental), the `max-parallel` and
+`paths-ignore`-on-required caveats added, billing-weight separated from queue-scarcity, and the
+FUSE/SIP claims hedged.

From 8ba634177214304fb4dcba6c27befdd68a6b7af0 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 18:54:30 +1000
Subject: [PATCH 24/76] =?UTF-8?q?Plan:=20=C2=A79=20accepted;=20add=20Bucke?=
 =?UTF-8?q?t=207=20(steering=20coverage=20/=20Phase=201c)=20+=20Bucket=208?=
 =?UTF-8?q?=20(harness=20ergonomics=20research)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

---
 ...-ci-stress-guarantees-and-coverage-plan.md | 49 ++++++++++++++++++-
 1 file changed, 48 insertions(+), 1 deletion(-)

diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
index 0bf4445..757f601 100644
--- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
+++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
@@ -102,7 +102,46 @@ first-principles rethink** — explicitly **not anchored on the existing approac
   more surface — without adding flakiness. Which tests benefit most.
 - **Considered, not maximalist:** principles for choosing the matrix + a routine cadence.
 Output: `docs/load-testing-strategy.md` (recommendation). Runs EARLY (Phase 1b) because it
-shapes Buckets 2 & 4 and the Phase-2 plan.
+shapes Buckets 2 & 4 and the Phase-2 plan. **§9 open decisions: all accepted by Ben (2026-06-17)
+with the doc's recommendations** — daily ~6-cell nightly (start smaller, grow by earn-the-slot);
+Linux cgroup CPU quota (probe-gated) for the envelope leg + ratio-calibrated stress-ng/spinner
+as the cross-platform race-jitter lane; stress-ng with a Windows spinner fallback;
+parametrization Axis A (waiter count) first; `GCL_ENVELOPE_TIER=relax` as the D-c
+correctness/envelope-split implementation; nightly issue auto-triage (correctness vs envelope).
+
+### Bucket 7 — Complete deterministic-steering coverage (Ben raised 2026-06-17)
+The load-strategy doc establishes deterministic STEERING (in-process function-interposition) —
+not external load — as the primary lever for the protocol's race-critical windows, and "more
+steered scenarios" as the #1 coverage investment. We have **not** scoped what *complete*
+steering coverage requires.
+- **Audit (Phase 1c):** enumerate every window/branch/residual across acquire / steal / hold /
+  release and map each to its deterministic-steering test or a GAP. Inputs: `failure-modes.md`,
+  the load-strategy §2 reachability table, the earlier F2 audit. Known gaps already: residual-
+  1/2/3 (claimant parked between recheck / touch and rename), and the F2-audit #7/#8 (wrong-type
+  appearing at the lock path mid-steal — A2/G2; Windows blocked-unlink legs). Add a **mechanical
+  branch-coverage pass (kcov for bash, on the Linux CI leg)** to find never-executed lines
+  objectively, as an input to the manual window audit.
+- **Output:** a coverage gap-list doc that scopes the steering-test work.
+- **Fill (Phase 3):** write the missing steered tests, bundled with Bucket 2.
+
+### Bucket 8 — Test-harness ergonomics (research done 2026-06-17; small, zero-dep)
+A subagent researched "big bash files vs alternatives." Verdict: **keep the plain-bash, zero-dep,
+custom-harness, steering-friendly design** — do NOT adopt bats-core (its forced `set -e` fights
+the suite's deliberate `set -uo` + exit-code assertions; its Windows/MINGW path quirks add risk
+on this project's most fragile axis) or shunit2 (lateral move, weaker Windows story). But the
+*monolith* (not the harness) costs a single-test selector + machine-readable reporting.
+Recommended incremental, **zero-dependency** additions, priority order:
+  1. **TAP output** from `ok`/`bad` + a `1..N` plan line (~15 lines) — machine-readable CI
+     reporting AND closes the silent-undercount gap (an early `exit`/crash currently drops every
+     later assertion from the count, total still prints "passed").
+  2. **A single-test selector** (`GCL_TEST_ONLY=<regex>`) — the biggest day-to-day pain (today
+     you run all 36 unit tests to iterate on one, on the slowest leg).
+  3. **Extract the duplicated helpers** into `tests/_harness.sh` (ok/bad/backdate/clone_fn/
+     wait — copy-pasted verbatim across all three files).
+  4. (Optional) split the two large files by concern; leave the integration suite whole (its
+     cross-test repo-state audit is an intentional dependency).
+Fold into the Phase-2 plan / Phase-3 build; items 1–2 are an afternoon and pay off every
+iteration (esp. given the local-test ban → faster CI triage from machine-readable output).
 
 ## Workflow (settled: spec → plan → implement → review)
 
@@ -121,6 +160,14 @@ review (Claude + Codex). **Recommendation only — NO implementation.** → Ben
 recommendations feed Phase 2. Runs early because it shapes Buckets 2 & 4. (1a and 1b are
 independent and can run in parallel.)
 
+**Phase 1c — Complete-steering-coverage audit (Bucket 7 / Ben raised 2026-06-17).**
+Systematically map every protocol window/branch/residual to its deterministic-steering test or
+a GAP, plus a mechanical kcov branch-coverage pass to find never-executed lines objectively;
+output a coverage gap-list doc. Analytical (read-only), parallel to 1a/1b; its gap-list is a
+major input to the Phase-2 test plan (steering is the #1 race-coverage lever per the
+load-strategy doc). **Audit now; gap-filling is Phase 3.** → Ben gate. (Also folds in the
+Bucket-8 harness-ergonomics items, which the new tests will want.)
+
 **Phase 2 — Plan.** Concrete implementation plan for Buckets 2-4, incorporating Ben's chosen
 load/matrix recommendations: per-test injection method (tmpfs / `ulimit` / chmod) + platform
 guard + CI wiring; the matrix/parametrization to adopt; exact doc edits; the

From e1f31a5e1fa54ed508105597c86f66a9332163ef Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 19:57:36 +1000
Subject: [PATCH 25/76] Phase 1a + 1c: add guarantees contract +
 steering-coverage gap-list
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Phase 1a (Bucket 1 / D-a) — docs/guarantees.md: the normative contract
derived from failure-modes.md. Operating envelope (E1-E7), guarantees
(safety G-S1..5, recovery G-R1..4, interop G-I1), failure semantics,
best-effort tier (BE-1..5), out-of-scope (OOS-1..7 incl. the non-unwinding-
exit no-silent-loss boundary), operating rules, and a verification map.

Phase 1c (Bucket 7) — docs/steering-coverage.md: the deterministic-steering
coverage audit + prioritized gap list, synthesized from two manual window
audits and an objective kcov pass (83.1% line coverage, 451/543; ~30 lines
platform-gated, ~62 Linux-reachable). kcov corrected three manual over-
credits (step-3.3 CLAIM-ABORT, the foreign claim-recheck branch, the EXIT-
trap no-hold twin). Gaps ranked: Tier A portable steering (A1 rename-refused
wrong-type-mid-steal = headline; A2 step-3.3 abort lane; A4 the exec/H4
boundary), Tier B fault-injection (failure-modes 4.5), Tier C platform-only,
Tier D document-not-test.

Both reviewed to convergence: a fresh-context Claude reviewer plus two Codex
rounds. The foreign Codex check plus a 4-line empirical test corrected the
exec-bypass characterization across all three docs: "run -- bash -c 'exec'"
does NOT skip release (the child shell is replaced, the wrapper releases
normally); only an exec in the lock-holding shell itself (a sourced
lock_acquire+exec, or "run -- exec") bypasses. Propagated to guarantees.md
OOS-5, steering-coverage.md A4, and the failure-modes.md H4 precision fix in
this commit.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md     |  13 +-
 docs/guarantees.md        | 408 ++++++++++++++++++++++++++++++++++++++
 docs/steering-coverage.md | 283 ++++++++++++++++++++++++++
 3 files changed, 700 insertions(+), 4 deletions(-)
 create mode 100644 docs/guarantees.md
 create mode 100644 docs/steering-coverage.md

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 0332055..a187c15 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -570,10 +570,15 @@ false success — relies on the wrapper *reaching its release path*. The bypass
 is any termination or replacement of the holding process that skips that unwind;
 crucially it is **not** triggered by a normal `exit`. The instances:
 - **External SIGKILL** — untrappable; no handler runs in either port.
-- **bash `exec` in the wrapped command** — `run` executes `"$@"` *in the wrapper
-  shell itself* (`git-commit-lock.sh:1733`), so an `exec` replaces that shell's
-  process image and *neither* the trailing `lock_release` *nor* the `EXIT` trap
-  (`git-commit-lock.sh:1002-1013`, armed at `:1308`) runs.
+- **bash `exec` that replaces the lock-holding shell** — `run` executes `"$@"`
+  *in the wrapper shell itself* (`git-commit-lock.sh:1733`), so the bypass needs the
+  exec to run in *that* shell: the wrapped command *is* an exec (`run -- exec …`),
+  or a **sourced** caller does `lock_acquire; exec …` in its own shell. Then the
+  exec replaces that shell's process image and *neither* the trailing `lock_release`
+  *nor* the `EXIT` trap (`git-commit-lock.sh:1002-1013`, armed at `:1308`) runs. An
+  exec **nested in a child** — the ordinary `run -- bash -c 'exec …'` — does **not**
+  bypass (the child is replaced; the wrapper waits and releases normally). *Verified
+  empirically 2026-06-17.*
 - **PowerShell `[Environment]::Exit(n)`** — a CLR hard-exit that bypasses
   `Lock-Release`, the `finally`, *and* the `PowerShell.Exiting` backstop
   (`git-commit-lock.ps1:221-245`).
diff --git a/docs/guarantees.md b/docs/guarantees.md
new file mode 100644
index 0000000..dca88b1
--- /dev/null
+++ b/docs/guarantees.md
@@ -0,0 +1,408 @@
+# git-commit-lock: guarantees and scope (the normative contract)
+
+**Status: normative.** This document states *what the tool guarantees*, *under
+what conditions* (the operating envelope), and *what is explicitly out of
+scope*. It is the contract a user or a CI gate can point at: a behavior listed
+under [Guarantees](#2-guarantees) is a property the code must uphold and the
+tests defend; a behavior under [Out of scope](#5-out-of-scope-not-guaranteed) is
+one the tool deliberately does not promise.
+
+**How this relates to the other two docs.** This is the *contract*;
+[`failure-modes.md`](failure-modes.md) is the *analysis* behind it (per-mode
+current behavior, tier classification, and the scope decisions that produced
+this contract); [`git-commit-lock.md`](git-commit-lock.md) is the *design
+reference* (why the protocol is shaped this way and how it works). Where they
+appear to disagree, the **code and tests are authoritative**, then this contract,
+then the analysis, then the design narrative. Each guarantee below cites its
+witnessing test(s) and the failure-modes section that justifies it; the
+[Verification map](#7-verification-map) collects those pointers.
+
+This contract makes **no new claims** about behavior — it is a re-statement of
+the decisions recorded in `failure-modes.md` §4 as commitments. It does not
+re-derive the protocol (see the design doc) or re-argue the tiers (see the
+analysis).
+
+---
+
+## 1. The operating envelope
+
+Every guarantee in §2 holds **within this envelope**. Outside it, the tool
+degrades as described in §4 (best-effort) or §5 (out of scope) — in most cases
+*detectably and without corruption*, but the strict guarantees are not promised.
+The envelope is not a disclaimer bolted on; it is the precise set of assumptions
+the filesystem-lease design rests on.
+
+**E1 — Single host, single time source.** All contenders share one working tree,
+hence one machine, hence one clock. Staleness is `age = now − mtime` arithmetic
+(`git-commit-lock.sh:928,1409`); it assumes the mtime and the comparing process's
+`now` come from the *same* clock. Single-host use satisfies this. A *local* clock
+jump remains correctness-safe (it degrades to the detected-98 lane, never a
+silent double-commit; see G-S1 and `failure-modes.md` §E2). Multi-host use over a
+shared FS does not satisfy it and is out of scope (§5, OOS-2).
+
+**E2 — Local filesystem with atomic create/rename and sane mtimes.** The protocol
+is built from three filesystem operations — atomic create-or-fail (`O_EXCL` /
+`FileMode.CreateNew`), atomic rename-over, and unlink — each atomic on local
+POSIX filesystems and NTFS (ext4, APFS, NTFS, and kin). (The one exception is the
+Windows PowerShell 5.1 steal, which lacks the atomic 3-arg move and uses a
+claim-guarded unlink-then-move — a fairness loss, never a clobber; see BE-5.)
+Network and sync-backed storage (NFS, SMB/CIFS, 9p, Dropbox/OneDrive) weaken
+exactly these operations and are out of scope (§5, OOS-1;
+`git-commit-lock.md:122-126`).
+
+**E3 — Cooperative wrapper unwind.** The theft-detection guarantee (G-S1) fires
+when the lock-holding shell *reaches its release path* — on normal return, on a
+handled INT/TERM, or on a plain `exit` (all of which unwind). It is **not**
+triggered by a termination or replacement that skips the unwind: an external
+SIGKILL, an `exec` that replaces the lock-holding shell itself, or PowerShell
+`[Environment]::Exit()`. (An `exec` nested in a child — the ordinary
+`run -- bash -c 'exec …'` — does *not* skip release.) See §5, OOS-5 for the
+precise boundary.
+
+**E4 — Commits fast relative to the staleness window (for *strict* exclusion).**
+The lease is fail-open: a hold older than `AGENT_LOCK_STALE_SECS` (default 300s)
+can be stolen mid-work. *Strict* mutual exclusion (G-S3) is therefore guaranteed
+only for holds that complete within the staleness window. A hold that overruns it
+is still *safe* — a displaced holder is detected (G-S1) — but two processes can
+briefly both believe they hold the lock. Keep commits well inside the window, or
+raise `AGENT_LOCK_STALE_SECS` for a deliberately slow hold (the golden rule,
+`git-commit-lock.md:433-458`).
+
+**E5 — Matching protocol version on all parties.** Prevention of the
+crash-recovery-under-contention race (G-S3's no-displacement property) holds only
+when every contender runs the claim protocol. A mixed-version tree degrades
+prevention to detection and is out of scope (§5, OOS-3).
+
+**E6 — Supported platforms.** `git-commit-lock.sh` (bash) is supported on Linux,
+macOS, and Windows under Git-for-Windows' MINGW bash. `git-commit-lock.ps1`
+(PowerShell) is supported on **Windows only**. Running the `.ps1` port on POSIX is
+a CI-only cross-implementation protocol check, not a supported configuration (§5,
+OOS-4; `README.md:91-95`).
+
+**E7 — Cooperating, non-hostile agents.** The lock is advisory: it serializes
+*cooperating* agents. It detects interference where it can (token checks; exit 98)
+but cannot prevent a process running as the same user from deleting or
+overwriting the lock file. The threat model is honest agents racing each other,
+not an actively hostile local process (§5, OOS-6;
+`git-commit-lock.md:520-528`).
+
+---
+
+## 2. Guarantees
+
+Each guarantee holds **within the envelope (§1)**. The defaults named are knobs
+(`AGENT_LOCK_*`); the guarantee is in terms of the configured value, not a fixed
+number of seconds.
+
+### 2A. Safety (unconditional within the envelope)
+
+These are correctness properties. If one can break inside the envelope, that is a
+bug.
+
+- **G-S1 — No silent lost update.** A holder whose lease is taken from it never
+  reports a serialized critical section that wasn't. On release, a **definitive**
+  theft (the lock file is gone, or carries a foreign token) returns **98** with a
+  loud WARNING rather than success (`git-commit-lock.sh:1607-1688`;
+  `git-commit-lock.ps1:1717-1837`); a state the release cannot disambiguate (the
+  file is present but reads **empty** after the retry ladder — possibly a successor
+  mid-create after a boundary steal) returns the distinct **unverifiable** code
+  (`lock_release` 2; `run` maps it to 1 when the command itself succeeded, else
+  keeps the command's code) — still **never** a silent success. *Condition:* the
+  wrapper unwinds cooperatively (E3). *Witness:* unit Test 4b (98 + WARNING), Test
+  16 (unverifiable lane), interop Test 8 (98 both directions) (`U:387-417`,
+  `I:460-492`). *Basis:* `failure-modes.md` §1, §B5.
+
+- **G-S2 — No corruption and no false hold.** An acquirer that cannot prove its
+  own token is at the lock path (after the read-back retry ladder) treats the lock
+  as **not** acquired and logs loudly; it never "repairs" a failed read-back by
+  rewriting the path (`git-commit-lock.sh:1352-1361`). Every path that cannot
+  establish a fact fails toward "wait", never toward "steal" or "hold". This
+  extends to resource-exhaustion lanes: a create that fails (ENOSPC, FD/inode
+  exhaustion, an unwritable lock dir) **never produces a false hold or corruption**
+  — it falls through to wait/97 (an empty orphan ages into the recovery lane). The
+  guarantee is *no false hold*, not a uniformly clean 97: a torn write shorter than
+  `tok.` is a non-lock-shaped residual, never stolen, that needs manual removal
+  (`failure-modes.md` §F1 — an accepted residual). *Witness:* the read-back-failure lanes —
+  create-path Test 32, steal-path Test 32b (`U:1760-1855`); resource lanes —
+  coverage planned (Bucket 2 / `failure-modes.md` §4.5). *Basis:* §1, §A1, §F.
+
+- **G-S3 — Strict mutual exclusion within the staleness window, with no
+  displacement during crash recovery.** Within `AGENT_LOCK_STALE_SECS` no steal
+  occurs at all, so at most one process holds the lock. When a holder dies and a
+  herd of waiters recovers the one stale lock, the **claim protocol** admits
+  exactly one stealer and the recovering waiter keeps the lock it recovered — a
+  straggler whose stale judgement predates the recovery cannot displace it
+  (`git-commit-lock.sh:1070-1218`). At most one process is ever the *legitimate*
+  holder. (On the supported Windows PowerShell 5.1 unlink-then-move lane the
+  recovering waiter can *lose* the recovered path to a rival's create in the
+  transient absent window — a fairness loss, never a clobber; see BE-5.)
+  *Condition:* holds complete within the window (E4); a stable clock (E1) — a local
+  clock jump preserves *no silent loss* (G-S1) but can break *strict exclusion* by
+  making a live lock look stale (a premature, but detected, steal); and matching
+  version (E5). *Witness:* unit Tests 1/2b/20, interop Tests 1/6/16/16b, integration suite
+  (`U:166-195,212-346,1095-1128`; `I:227-261,341-386,884-1088`). *Basis:*
+  §A1/§A2/§A3.
+
+- **G-S4 — Never destroys a non-lock-shaped object.** A directory, symlink, FIFO,
+  device, socket, or a regular file whose line 1 is neither empty nor `tok.`-
+  prefixed is **never** stolen or deleted, at either the lock path or the claim
+  path (`git-commit-lock.sh:1322-1327,1411-1444,1458-1487,1518-1570`). The
+  never-steal *safety* is unconditional; the *warning* is best-effort — it normally
+  fires once and names the object, but an **actively-rewritten** user file may never
+  age into the content guard and then times out at 97 *without* the warning
+  (`git-commit-lock.sh:308`). Deletion is
+  never recursive; the tool only ever removes its own named lock-protocol files.
+  *Two accepted residuals* bound this and are documented, not bugs: a stale
+  *empty* user file, and a stale file whose line 1 happens to start `tok.`, are
+  stolen (`git-commit-lock.sh:298-311`). *Witness:* unit Tests 17/17d/18/22
+  (`U:818-892,894-1032,1034-1076,1156-1262`). *Basis:* §D3/§D4/§G1. *Scoped
+  exception:* ps1-on-POSIX has no .NET type probe for FIFO/device/socket (§5,
+  OOS-4).
+
+- **G-S5 — Truthful exit codes.** The three reserved high codes from `run` are
+  exact: **96** = usage error (command **not** run), **97** = acquisition timed
+  out (command **not** run), **98** = lock stolen mid-hold (command **ran but was
+  not serialized** — redo it) (`git-commit-lock.sh:392-415`). A `run` exit of the
+  command's own code (including 0) means the command was serialized — *subject to
+  the one carve-out in OOS-5* (a non-unwinding exit returning 0 while displaced).
+  *Two stated assumptions* keep the high-code contract exact: the wrapped command
+  must not itself exit 96/97/98 (such an exit is indistinguishable from a tool
+  verdict, `git-commit-lock.sh:392`), and an **unverifiable** release maps a
+  *successful* command to **1** (G-S1), so 0 is never reported over an unverifiable
+  hold. *Witness:* Test 7 (96), Test 8 (97), Test 4b (98), Test 5 (propagation),
+  Test 16 (unverifiable→1), interop `run` verdict tests. *Basis:* §1, §H4.
+
+### 2B. Recovery (within the FS/clock/tooling envelope)
+
+These hold given a readable clock (E1) and lock-shaped state; latency is
+best-effort (§4).
+
+- **G-R1 — Lock-shaped orphans are reclaimed.** A crashed holder's stale lock, an
+  orphaned or empty claim, and an empty crash-orphan (a crash between create and
+  content write) all eventually become stealable and are recovered, bounded by
+  `STALE` (+ `CLAIM_STALE` if a claimant also crashed) plus poll cadence
+  (`git-commit-lock.sh:1408-1446,1228-1267`). This does **not** extend to *foreign*
+  objects (G-S4) — those wait for an operator. *Witness:* unit Tests 2/3/21
+  (`U:197-210,348-361,1130-1154`). *Basis:* §B1/§C1/§C2/§C3.
+
+- **G-R2 — One stuck agent cannot wedge the fleet.** Because the lock is a lease
+  and the claim is itself leased, a hung-but-alive holder or claimant is recovered
+  within its window; the fleet does not deadlock behind it. *Witness:* the stale-
+  steal and crashed-claimant lanes above. *Basis:* §1, `git-commit-lock.md:60-82`
+  (the explicit reason for a lease over a kernel lock).
+
+- **G-R3 — No busy-spin; bounded wait.** A waiter on a genuinely squatted or
+  delete-blocked lock gives up at `MAX_WAIT` and never busy-spins past it; the
+  failed-steal lane logs in a damped, bounded way (`I:746-817`). *Witness:* interop
+  Test 14b. *Basis:* §K(4).
+
+- **G-R4 — No process leaves an *unowned* lock behind.** Per-attempt tokens make
+  the ownership-discovery read conclusive, so no process inside an
+  acquire/hold/release arc can install a lock nobody owns and walk away: it either
+  discovers it holds, or the lock is recovered by staleness, and in no case is a
+  steal-installed lock mistaken for owned by the wrong process
+  (`git-commit-lock.sh:138-157` + the leaked-token memory). The one bounded
+  residual — an untrappably-killed claimant's claim installed as an unowned lock —
+  stalls waiters ≤ one stale window with **no false success** (accepted; §B3).
+  *Witness:* unit Tests 31/35/36 (`U:1549-1758,2013-2164`). *Basis:* §C4.
+
+### 2C. Interoperation
+
+- **G-I1 — bash and PowerShell take the same lock.** One on-disk wire format
+  (`tok.`-prefixed line 1, owner line 2), one read-retry ladder
+  (8 attempts, 20/40/80/160/320/320/320 ms — byte-identical between ports), one
+  set of release verdicts, one config grammar. A `.sh` holder and a `.ps1` holder
+  in one tree serialize against each other and steal each other's genuinely stale
+  locks. *Condition:* Windows for the supported ps1 config (E6). *Witness:* the
+  interop suite throughout (`I:*`). *Basis:* §I1.
+
+---
+
+## 3. Failure semantics (the shape of every degradation)
+
+When the tool cannot uphold a property it fails in one of these bounded,
+documented ways — **never** silently:
+
+- **Detect, don't pretend** — a displaced holder returns 98 + WARNING (G-S1).
+- **Wait, don't guess** — an unprovable state routes to poll/wait → 97, never to
+  a steal or a hold (G-S2).
+- **Refuse, don't destroy** — a non-lock-shaped object is left in place (and
+  normally warned about — the warning is best-effort, see G-S4); waiters reach 97.
+- **Announce, don't hide** — a broken staleness clock (unreadable mtime) warns
+  loudly once and disables stealing (fails safe; §4, BE-2).
+
+**Within the operating envelope**, the only place a *correctness* degradation can
+be silent — a non-unwinding exit returning 0 while displaced — is carved out
+explicitly in OOS-5. Two silences fall *outside* that scope and are disclosed
+separately: a degradation **outside** the envelope (a network/sync FS silently
+losing exclusion, OOS-1), and a **non-correctness** loss (a swallowed log write,
+BE-4). Logging is best-effort by design; correctness is not.
+
+---
+
+## 4. Best-effort (within the envelope, not a hard guarantee)
+
+These hold under normal conditions and degrade *gracefully and detectably* under
+pathological scheduling or host-health failures. **Correctness (§2) is preserved
+throughout; only liveness/latency degrades.** This tier is the reference Bucket 4
+scopes the suite's wall-clock test assertions against (the strict/envelope test
+split, `failure-modes.md` §4.1 / D-c).
+
+- **BE-1 — Wall-clock latency bounds are in poll-count, not seconds.** Recovery
+  latency (≈ `STALE` + poll cadence), the `MAX_WAIT` timeout, and the ~1.26s
+  read-retry ladder all *stretch* under CPU oversubscription or a slow FS while
+  still completing. The guarantee is "bounded by the configured knobs in
+  poll-count," not "exactly N seconds." Tests asserting a specific wall-clock or
+  poll-count number (Test 21's ≤20s, Test 22a's warning timing, Test 29's ≥2-CLAIM
+  count) assert an *envelope* bound, not a correctness bound, and may be relaxed or
+  gated to a defined load level (`GCL_ENVELOPE_TIER=relax`) without any product
+  change. *Basis:* `failure-modes.md` §K, §4.1.
+
+- **BE-2 — Diagnostic warnings are best-effort.** The wrong-type config warning
+  and the claim-path warning rely on poll headroom that an oversubscribed runner
+  can starve; the guarantee is that the *condition is handled safely*, not that a
+  specific warning fires within a specific time. *Basis:* §K(2), §D3.
+
+- **BE-3 — Recovery presumes a readable clock; an unreadable mtime fails safe.**
+  If the lock's mtime cannot be read at all, both ports retry three times, then
+  warn loudly once per process and treat the lock as **not** stale (the mtime floor
+  fails closed to "fresh"): no premature steal, no corruption — but recovery of a
+  genuinely crashed holder is *disabled* and waiters block to `MAX_WAIT` (97).
+  Safety is preserved; recovery is lost and announced. *Coverage planned* (Bucket
+  2 / §4.5). *Basis:* §E3.
+
+- **BE-4 — Logging is best-effort and never blocks the lock.** Every log write
+  ends `|| true`; a failed or unwritable log write is swallowed and the lock works
+  unaffected (the log self-truncates past ~1 MB). *Coverage planned* (Bucket 2 /
+  §4.5, the F2/J1 test). *Basis:* §F2/§J1.
+
+- **BE-5 — The PowerShell 5.1 steal is claim-guarded, not atomic.** Windows
+  PowerShell 5.1 lacks the 3-arg `File.Move` overload, so its steal is
+  unlink-then-move with a transient absent window. Under the claim this is a
+  *fairness loss* (a rival's create can win the recovered path; the claimant backs
+  off cleanly), **never a clobber**. *Basis:* §D1, `git-commit-lock.md:471-476`.
+
+---
+
+## 5. Out of scope (not guaranteed)
+
+The tool deliberately does not promise the following. Where it can, it still fails
+*safely and detectably*; the point of listing them is that the strict guarantees
+of §2 are **not** claimed here.
+
+- **OOS-1 — Network / shared / sync-backed filesystems.** NFS, SMB/CIFS, 9p,
+  Dropbox/OneDrive. These weaken the atomic create/rename the protocol rests on, so
+  exclusion may silently not hold. Documented boundary only — surfaced in the
+  README; **no** FS-type probe is built (decision: `failure-modes.md` §4 item 3).
+  *Basis:* §E1.
+
+- **OOS-2 — Multi-host use / clock skew across hosts.** Rides on OOS-1 (only arises
+  on a shared FS). A *local* clock jump on the single host is **in scope and
+  correctness-safe** (degrades to the detected-98 lane). *Basis:* §E2.
+
+- **OOS-3 — Mixed-version trees.** If contenders run different protocol versions,
+  the no-displacement prevention (G-S3) degrades to detection (98), and old-style
+  stealers can leave `.dead.*` litter. Never silent, but the prevention property is
+  not guaranteed. Deployment rule: **upgrade both implementations together**
+  (`git-commit-lock.md:251-256`; to be surfaced in the README too — Bucket 3).
+  *Basis:* §I2.
+
+- **OOS-4 — PowerShell port on POSIX.** Supported on Windows only; on POSIX it runs
+  solely as a cross-implementation protocol check. Its one residual there
+  (FIFO/device/socket stat as empty and take the empty-orphan lane, capping damage
+  at the one misconfigured inode) is accepted and documented. *Basis:* §D3.
+
+- **OOS-5 — A non-unwinding exit returning 0 while displaced (the no-silent-loss
+  boundary).** G-S1's detection requires the *lock-holding shell* to reach release
+  (E3). If a *displaced* holder is terminated or replaced **without unwinding** —
+  external SIGKILL, an `exec` that replaces the **lock-holding shell itself**, or
+  PowerShell `[Environment]::Exit()` — *and* the resulting process exits **0**, the
+  caller can see success with no 98. The `exec` case is **narrower than it looks**
+  (verified empirically): `lock_run` runs the wrapped command vector in the wrapper
+  shell (`git-commit-lock.sh:1733`), so the bypass needs the exec to run in *that*
+  shell — a **sourced** caller doing `lock_acquire; exec …` in its own shell, or
+  the contrived `run -- exec …` where the wrapped command *is* an exec. An exec
+  **nested in a child** — the normal `run -- bash -c 'exec …'` — does **not**
+  bypass: the child is replaced, the wrapper waits and releases normally. A **plain
+  `exit` is safe** (it unwinds). What keeps the whole class narrow: an external
+  SIGKILL yields a non-zero wait status (POSIX `128+9`), so a caller checking exit
+  codes does not see success; the hole needs a process that *deliberately* replaces
+  or hard-exits the lock-holding shell **and** returns 0 **while displaced**. The
+  *next* holder still recovers via staleness; only the abruptly-exiting one is
+  unwarned. No code change closes this without the handle-based ops the design
+  rejected. *Witness (boundary exercised indirectly):* interop Test 5 (`I:308-334`,
+  ps1 `[Environment]::Exit()`); the bash `exec` lane is a coverage gap
+  (`steering-coverage.md` A4). *Basis:* §H4.
+
+- **OOS-6 — Adversarial / hostile local processes.** The lock is advisory. Against
+  a process actively trying to break it (deleting/overwriting the lock file, or a
+  hostile repo redirecting the git dir), the tool *detects* interference where it
+  can but does not prevent it; damage from a redirected git dir is bounded to the
+  tool's own named files with non-recursive deletion. *Basis:*
+  `git-commit-lock.md:520-551`.
+
+- **OOS-7 — Non-issues, explicitly.** A case-insensitive FS path collision (the
+  lock and claim paths never collide under case folding; two case-differing
+  configured paths resolving to one file is *correct* shared-lock behavior) and
+  memory exhaustion (the scripts allocate trivially). No action. *Basis:* §D5/§F5.
+
+### Things deliberately NOT built (and why)
+
+The design considered and rejected each of these; they are not roadmap items
+(`failure-modes.md` §4 "Things explicitly NOT to do"):
+
+- A **background heartbeat** to refresh the lease — would make the tool more than a
+  single synchronous script; the fail-open-but-detectable lease is the deliberate
+  alternative.
+- A **two-rename compare-and-swap** to prevent the B3 residual — reintroduces crash
+  litter and a sweep, for a failure that is already bounded and false-success-free.
+- **`File.Replace`** in the ps1 port — throws on a read-only destination and has
+  partial-failure states (pinned out by interop Test 16d).
+- **Supporting network/shared filesystems** — correctness rests on local-FS atomic
+  create/rename; this is a boundary to document, not to engineer around.
+
+---
+
+## 6. Staying inside the envelope (operating rules)
+
+- **Hold the lock only to commit.** Decide what to stage, build any patch, and
+  resolve failures *outside* the lock; a normal stage+commit holds it for seconds
+  (the golden rule, `git-commit-lock.md:433-458`). This keeps holds inside the
+  staleness window (E4) so G-S3 applies.
+- **For a deliberately slow hold, raise `AGENT_LOCK_STALE_SECS`** for that
+  invocation rather than risking a fail-open steal.
+- **Keep the lock on a local filesystem** (the default `<gitdir>/commit.lock`
+  almost always is) so E2 holds.
+- **Upgrade both implementations together** (E5) so G-S3's prevention holds.
+- **Never `git stash` in a shared checkout** — it rewrites the working tree and
+  clobbers other agents' edits (orthogonal to the lock, but part of operating in a
+  shared tree).
+
+---
+
+## 7. Verification map
+
+Each guarantee → its witnessing test(s) and the failure-modes section. `U` =
+`tests/git-commit-lock.test.sh`, `I` = `tests/git-commit-lock.interop.test.sh`,
+`integ` = `tests/git-commit-lock.integration.test.sh`. "Coverage planned" marks a
+guarantee that is currently reasoned-correct-but-untested and slated for a
+fault-injection test under Bucket 2 (`failure-modes.md` §4.5, Ben's override to
+add coverage); the *guarantee* is made now, the *test* lands in Phase 3.
+
+| Guarantee | Witness | failure-modes § |
+|---|---|---|
+| G-S1 no silent lost update | U Test 4b + Test 16 (unverifiable lane); I Test 8 (both dirs) | §1, §B5 |
+| G-S2 no corruption / no false hold | U Tests 32/32b (read-back failure); **resource lanes: coverage planned** (F1/F3/F4) | §1, §A1, §F |
+| G-S3 strict exclusion in window + no displacement | U Tests 1/2b/20; I Tests 1/6/16/16b; integ | §A1/§A2/§A3 |
+| G-S4 never destroys non-lock-shaped | U Tests 17/17d/18/22 | §D3/§D4/§G1 |
+| G-S5 truthful exit codes | U Tests 7/8/4b/5/16; I run-verdict tests | §1, §H4 |
+| G-R1 lock-shaped orphans reclaimed | U Tests 2/3/21 | §B1/§C1/§C2/§C3 |
+| G-R2 one stuck agent can't wedge | stale-steal + crashed-claimant lanes | §1 |
+| G-R3 no busy-spin; bounded wait | I Test 14b | §K(4) |
+| G-R4 no unowned lock left behind | U Tests 31/35/36 | §C4 |
+| G-I1 bash⇄pwsh same lock | I suite throughout | §I1 |
+| BE-3 unreadable mtime fails safe | **coverage planned** (E3) | §E3 |
+| BE-4 logging best-effort | **coverage planned** (F2/J1) | §F2/§J1 |
+
+The "coverage planned" rows are exactly the lanes Phase 1c (the steering-coverage
+audit) and Bucket 2 (the new fault-injection tests) exist to close.
diff --git a/docs/steering-coverage.md b/docs/steering-coverage.md
new file mode 100644
index 0000000..dd98461
--- /dev/null
+++ b/docs/steering-coverage.md
@@ -0,0 +1,283 @@
+# Deterministic-steering coverage: audit and gap list
+
+**Status: analysis / work-scoping.** This document maps the protocol's
+race-critical windows and branches to their deterministic-steering tests (or
+gaps), and scopes the test work that closes the gaps. It is the output of Phase
+1c of the [guarantees-and-coverage plan](../.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md)
+(Bucket 7). Gap-*filling* is Phase 3 (bundled with the Bucket 2 fault-injection
+tests); this doc decides *what* to fill and *how*.
+
+**Why steering, not load.** As [`load-testing-strategy.md`](load-testing-strategy.md)
+establishes, the protocol's correctness rests on structural properties (O_EXCL
+create + atomic rename + per-attempt tokens), so the primary coverage lever is
+**in-process function interposition** — the test suite's `clone_fn` mechanism
+shadows internal `_lock_*` functions (and `mv`/`rm`/`touch`) to force an exact
+interleaving deterministically. External load only *probabilistically* widens
+the same windows. This audit therefore measures *steering* coverage, with an
+objective `kcov` line-coverage pass as a cross-check.
+
+---
+
+## 1. Method and headline numbers
+
+Three independent inputs, reconciled below:
+
+1. **Manual window audit — acquire + steal paths.** Every branch/residual mapped
+   to its steering test or a gap.
+2. **Manual window audit — hold + release + discovery + staleness/mtime paths.**
+3. **`kcov` objective line coverage** (the mechanical cross-check) — built from
+   source (kcov v43; no apt package / prebuilt binary exists) and run on the unit
+   suite at FULL fan-out under WSL Ubuntu-24.04. Artifacts (gitignored):
+   `.agent-testing/kcov/` (`cobertura.xml`, merged unit+integration, line-by-line
+   HTML). Repro commands in [§5](#5-kcov-reproduction).
+
+**kcov result: 83.1% line coverage — 451 / 543 instrumented lines; 92 never
+executed.** (kcov does not do real branch coverage on bash — its branch numbers
+are trivially 1.0 and must be ignored.) The integration suite added **zero** lines
+over the unit suite, so the unit suite is the comprehensive measurement.
+
+Of the 92 uncovered lines:
+
+- **~30 are platform-gated and *correctly* unreachable on Linux** — ~23 in the
+  Windows no-delete-share handle lanes (an open handle blocking `unlink`/`rename`,
+  which never happens on POSIX), plus 3 in the macOS/BSD `mv` fallback. These are
+  covered on the **Windows** CI leg (interop Tests 13/31d/33c) and would need a
+  **macOS/BSD** leg for the `mv` fallback. They are **not** Linux gaps. The
+  practical Linux line-coverage ceiling is therefore ~94% ((543−30)/543), not
+  100%.
+- **~62 are Linux-reachable** — the real targets, prioritized in [§3](#3-the-gap-list-prioritized).
+
+**The cross-check earned its place.** kcov objectively corrected **three
+over-credits** in the manual audit — branches the manual reasoning inferred were
+covered, but which `kcov` shows were never executed:
+
+| Branch | Manual audit said | kcov (objective) | Reconciled |
+|---|---|---|---|
+| step-3.3 pre-rename CLAIM-ABORT block (`:1151-1160`) | covered via the step-2 / `deletion-gone` matrix positions | **hits=0** | **GAP** — the step-2 twin is steered, the near-identical step-3.3 twin is not |
+| `foreign` claim-recheck branch (`:1103-1106`) | covered via Test 33b + the matrix | **hits=0** | **GAP** — only the `gone` recheck leg is steered |
+| EXIT-trap no-hold arc-end (`:1009,1017-1018`) | transitively covered | **hits=0** | **GAP** — only the *signal* (TERM) no-hold twin is steered, not the EXIT-while-waiting one |
+
+This is the value of a mechanical pass over correlated manual reasoning: trust the
+instance, verify the output against the tool. Where this doc and a manual claim
+disagree, **kcov's `hits=0` wins**.
+
+(Line numbers below are anchors against the current `ci-stress` tree and may drift
+a few lines; the manual audits re-located everything and found the
+failure-modes.md anchors had moved ~9 lines.)
+
+---
+
+## 2. What is already well covered (for confidence)
+
+The audit confirms the protocol's *delicate* paths are strongly steered, so the
+gaps are at the edges, not the core:
+
+- **The two read-back "twins"** are each independently steered with opposite
+  claim-token gates: the create-path "I twin" (`acquire verification FAILED`,
+  `:1354-1361`) by **Test 32**, and the steal-path "F2 twin" (`steal rename
+  completed but read-back`, `:1171-1179`) by **Test 32b**.
+- **The discovery rule** — the ownership-discovery read on every non-rename exit —
+  by **Test 25**'s 7-position matrix (`step2-fresh`, `recheck-gone`, `touch-gone`,
+  `lock-gone`, `contested`, `deletion-gone`, `source-gone`), each steering a rival
+  install to an exact protocol point.
+- **The two discovery routes** (direct `_lock_discover` vs the per-poll
+  leaked-token-memory check) each independently steered (Test 25 vs Test 31b),
+  with Test 31a deliberately accepting *either* route on the genuine scheduling
+  race between them.
+- **The claim re-verify / touch / lease-reset lane** (Tests 23/24/26/27), the
+  leaked-claim family (Tests 31/35/36), the never-steal guards for dir/symlink/FIFO
+  at both lock and claim paths (Tests 17/22), and the trap-time claim cleanup
+  (Test 33).
+
+---
+
+## 3. The gap list, prioritized
+
+Each gap: location, what it is, how to steer it, and a priority. "Portable
+interposition" = a `clone_fn`/shadow test that runs on every OS (the cheapest,
+most valuable kind). "Fault injection" = needs a real resource/IO failure. "Platform"
+= only reachable / only meaningful on a specific OS leg.
+
+### Tier A — Portable deterministic steering (do these first; no fault injection)
+
+These are new `clone_fn`/shadow tests in the unit suite, runnable on every leg.
+
+- **A1 — `CLAIM-ABORT (rename-refused)`: wrong-type object at the lock path
+  mid-steal** (`:1195-1202`). *Headline gap.* The only acquire/steal **verdict**
+  branch with no steering test, and it has its own log string. (This is the
+  F2-audit #7 lane; the strategy doc's §2 reachability table missed it.) *Steer:*
+  `clone_fn _lock_verify_stale` (or shadow `mv`) to `mkdir` a directory onto the
+  lock path immediately before the rename; assert `rename-refused` + claim deleted
+  + discovery + no false hold. **Highest value.**
+
+- **A2 — step-3.3 pre-rename CLAIM-ABORT block** (`:1151-1160`; kcov-corrected
+  over-credit). The `gone`/`wrongtype`/`fresh` reason map + claim-delete +
+  discovery + `return 1`, near-identical to the step-2 block but separately
+  reachable. *Steer:* a `_lock_verify_stale` shadow with a call-counter that flips
+  to not-stale on the **second** call (step-3.3), the first call (step-2) passing.
+  **High value** (a whole unexercised abort lane).
+
+- **A3 — `foreign` claim-recheck branch** (`:1103-1106`; kcov-corrected
+  over-credit). A clearer removed our claim and a rival re-claimed → leave it,
+  discovery read, back off. *Steer:* shadow the claim read at recheck to return a
+  foreign token. **Medium-high.**
+
+- **A4 — `exec`-bypass of release / the §H4 no-silent-loss boundary** (`lock_run`
+  runs the wrapped command vector in the wrapper shell, `:1733`). No test exercises
+  the bash bypass; the ps1 `[Environment]::Exit()` twin *is* (interop Test 5).
+  **Empirically verified (2026-06-17):** the bypass needs the exec to run in the
+  **lock-holding shell itself** — `run -- exec true` (the wrapped command *is* an
+  exec), or a sourced `lock_acquire; exec true` — **not** `run -- bash -c 'exec
+  true'`, which execs a *child* and lets the wrapper release normally (so that
+  recipe would silently pass without testing anything). *Steer, two parts:* (a)
+  benign — `run -- exec true` (or sourced `lock_acquire; exec …`) and assert no
+  `RELEASED` line / lock left held; (b) the silent-loss — backdate the lease + park
+  a contender so the holder is *displaced*, then exec a 0-exit and assert the caller
+  sees 0 with **no** 98 (pinning [`guarantees.md`](guarantees.md) OOS-5). **High
+  value** — the one interleaving that can silently lose an update. *Note:* this
+  corrected the original audit recipe, which used the non-bypassing `bash -c 'exec'`
+  form — a foreign-model (Codex) review + a 4-line empirical check caught it; the
+  manual audit and a same-model reviewer both had it wrong.
+
+- **A5 — forward clock-jump → premature steal of a live lock** (§E2; age = now −
+  mtime, `:928,1409`). Code-safe (degrades to the detected-98 lane) but untested.
+  *Steer:* `clone_fn _lock_now` to return now+offset on the poll while the real
+  holder's mtime stays current, forcing age ≥ STALE on a live lock; assert the
+  victim's release hits 98 (a clock-driven analogue of Test 4b). **Medium.**
+
+- **A6 — mtime-unreadable fail-safe** (§E3; `:639-645` warn, `:912-926` consume).
+  Only a *negative* assertion exists (the warning must NOT fire under normal
+  contention, Test 1). *Steer:* `clone_fn` the mtime helper (`_lock_path_mtime` /
+  the `stat` shadow) to return empty on a present file; assert the warn-once fires,
+  no steal occurs, and a waiter reaches 97. **Medium** (it is the clean reason
+  recovery is Tier-1-*within-envelope*, so worth pinning).
+
+- **A7 — malformed/unreadable content classification tails** (the `_lock_verify_stale`
+  tail `:940-949`; the in-acquire steal content guard `:1429-1443`; the
+  `_lock_claim_stale_check` content tail `:1240-1249`). The `tok.`-prefixed and
+  empty-orphan lanes are covered; the **non-empty-blank-line-1** (`#18`),
+  **unreadable-content steal-skip** (`#17`), and **vanished-mid-check** sibling
+  branches are not. *Steer:* fabricate a line-1-whitespace file and a
+  read-fault shadow; backdate; assert no-steal + the right warning. **Low-medium,
+  cheap** (several branches per small test).
+
+- **A8 — socket & device-node wrong-type arms** (`:1474-1475` claim path,
+  `:1561-1562` lock path; kcov-new). The dir/symlink/FIFO arms are tested; the
+  socket (`-S`) and device (`-b/-c`) arms are not. *Steer:* bind a unix socket /
+  reference a device node (`/dev/null`) at the path; assert refusal. **Low, cheap**
+  (sibling arms of a tested guard; both creatable on Linux).
+
+- **A9 — log rotation past 1 MB** (`:558-559`; kcov-new). *Steer:* pre-write a
+  >1 MB log, trigger a log call, assert truncate-restart. **Low, trivial** (no
+  fault injection).
+
+- **A10 — EXIT-trap no-hold arc-end** (`:1009,1017-1018`; kcov-corrected
+  over-credit). EXIT while *waiting* without a hold or in-flight claim. *Steer:* a
+  sourced `lock_acquire` that exits while still blocked; assert the no-hold
+  cleanup/restore path runs. **Low.**
+
+- **A11 — `mv -T` fallback forced on** (`:969,976-977`). Naturally hit only on
+  BSD/macOS, but **made Linux-steerable** by forcing `_LOCK_MVT=0` (or shadowing
+  the probe's `mv -T` to fail) in a sourced steering shell, then running a steal —
+  and a steal-into-a-directory to hit the `[ -d ]` guard (dovetails with A1).
+  **Low-medium** (closes a real engine lane on the common leg instead of waiting
+  for a BSD runner).
+
+### Tier B — Fault injection (real resource/IO failures; mostly POSIX-only)
+
+These are the [`failure-modes.md`](failure-modes.md) §4.5 lanes (Ben's override to
+add coverage) plus the read-fault siblings. They need a real failure, not
+interposition; guard by platform and **flag any that can't be injected portably
+rather than shipping a flake** (per the §4.5 decision).
+
+- **B1 — Unwritable lock dir/parent → clean 97** (F4). `chmod` the dir.
+  POSIX; the cheapest and highest-value fault-injection test. **High.**
+- **B2 — Unwritable/failing log path → lock still works, log swallowed** (F2/J1).
+  Bad/again-`chmod`'d log path. POSIX. **Medium-high.**
+- **B3 — ENOSPC during claim/lock create+write** (F1; the create write-fail branch
+  `#5` and the read-fault lanes `:848,871-873`). Small dedicated tmpfs/quota.
+  Linux-friendliest; flag if not portable. **Medium.**
+- **B4 — FD exhaustion via `ulimit -n`** (F3). Portable POSIX; inode exhaustion
+  only if cleanly injectable. **Medium.**
+
+### Tier C — Platform-only (verify off-Linux; not a Linux gap)
+
+- **C1 — Windows no-delete-share handle lanes** (~23 lines: `:881-890,993,
+  1639-1647,1700-1712`). Already covered by interop Tests 13/31d/33c on the Windows
+  CI leg. *Action:* confirm the Windows leg's coverage exercises them (it does by
+  construction); no Linux work. Consider a kcov-equivalent on Windows is
+  impractical — rely on the explicit interop tests.
+- **C2 — macOS/BSD `mv` fallback real path** (`:969,976-977`). A11 makes this
+  Linux-steerable by forcing the probe off; a *genuine* BSD `mv` exercise needs a
+  macOS leg. *Action:* prefer A11 (portable) and treat a macOS leg as optional
+  per the load-strategy matrix.
+
+### Tier D — Bounded residuals: document, don't test
+
+Low-value, bounded, detected, or self-healing; the manual audits rate these
+not worth a dedicated test. *Action:* ensure each is named in the code header /
+`guarantees.md` as an accepted residual; fold into a broader test opportunistically
+if cheap, but do not build bespoke tests.
+
+- **D1 — residual-1** (verify→rename: our rename clobbers a freshly-created rival
+  lock → victim detects 98). Detection is covered structurally; the specific
+  interleaving is bounded + detected.
+- **D2 — residual-3** (claimant suspended between touch and rename installs an
+  aged-mtime lock). Bounded shortfall, self-healing; the *positive* lease-reset is
+  covered (Test 26).
+- **D3 — leaked-resolve rare arc-end legs** (`:755-758,1260-1262`) and the
+  release boundary-re-read in isolation (`R2`). Reachable only with a non-empty
+  leaked set; transitively exercised.
+
+---
+
+## 4. Scoping summary for Phase 2
+
+- **Tier A (11 tests, portable interposition)** is the bulk of the value and the
+  bulk of the work — all runnable on every CI leg, no fault-injection fragility.
+  A1, A2, A4 are the high-value three (a real verdict branch, a whole unexercised
+  abort lane, and the single silent-loss boundary). Bundle these into the unit
+  suite alongside the Bucket-2 work.
+- **Tier B (4 tests, fault injection)** is the failure-modes §4.5 set; platform-gate
+  them and flag any non-portable lane in the Phase-2 plan rather than shipping a
+  flake.
+- **Tier C** is verification on the Windows leg (already covered) + an optional
+  macOS leg; **Tier D** is documentation, not tests.
+- **Expected effect:** closing Tier A + the Linux-injectable parts of Tier B should
+  take Linux line coverage from 83.1% toward the ~94% platform ceiling; the
+  remaining ~6% is the Windows/BSD platform-gated lanes covered on their own legs.
+- **Harness ergonomics (Bucket 8)** pay off here: a `GCL_TEST_ONLY=<regex>`
+  selector and TAP output make iterating on ~15 new steered tests far cheaper —
+  schedule them before/with the test build.
+
+---
+
+## 5. kcov reproduction
+
+For re-running the objective coverage measurement (per the reproducible-experiments
+principle). All from Git Bash; `MSYS_NO_PATHCONV=1` stops Git Bash mangling a
+leading `/tmp` arg into a Windows path before WSL sees it.
+
+```bash
+# Build kcov v43 (no apt package; upstream ships no prebuilt binary):
+wsl.exe -d Ubuntu-24.04 -e bash -c 'sudo apt-get install -y cmake libdw-dev libelf-dev \
+  binutils-dev libcurl4-openssl-dev zlib1g-dev libiberty-dev'
+wsl.exe -d Ubuntu-24.04 -e bash -c '
+  cd /tmp && curl -fsSL https://github.com/SimonKagstrom/kcov/archive/refs/tags/v43.tar.gz \
+    | tar xz && mkdir kcov-build && cd kcov-build && cmake ../kcov-43 && make -j"$(nproc)"'
+
+# Run the unit suite under kcov (FULL fan-out) and list never-executed lines:
+MSYS_NO_PATHCONV=1 wsl.exe -d Ubuntu-24.04 -e bash -c '
+  cd /mnt/c/agent_data/commit-lock/worktrees/ci-stress &&
+  GCL_TEST_FULL=1 /tmp/kcov-build/src/kcov --include-path=git-commit-lock.sh \
+    /tmp/gcl-cov tests/git-commit-lock.test.sh'
+MSYS_NO_PATHCONV=1 wsl.exe -d Ubuntu-24.04 -e bash -c '
+  F=/tmp/gcl-cov/git-commit-lock.test.sh.*/cobertura.xml;
+  grep -oE "<line number=\"[0-9]+\" hits=\"[0-9]+\"/>" $F |
+    sed -E "s/.*number=\"([0-9]+)\" hits=\"([0-9]+)\".*/\1 \2/" |
+    awk "\$2==0 {print \$1}" | sort -n'
+```
+
+When the kcov pass becomes a permanent CI leg (Phase 3 / Bucket 7), it runs on the
+Linux runner against the unit suite at FULL, and the platform-gated ~30 lines (§1)
+are expected-uncovered there by design.

From b504f873308645a10b7b31f7cf4394f68730407f Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 20:21:46 +1000
Subject: [PATCH 26/76] Phase 2 plan: implementation plan for Buckets 2/3/4/6/8
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Concrete build plan derived from the Phase 1 outputs (guarantees.md,
steering-coverage.md) + the accepted failure-modes §4 and load-strategy §9
decisions. Sections:
- 2A Tier-A steering tests (A1-A11; per-test mechanism / assertion / platform /
  priority — the audit gave each steering technique).
- 2B Tier-B fault-injection (F4 + F2/J1 first cut; F1 gated-or-doc, F3 doc-only)
  — each injection empirically prototyped; refines the original D-b (F3 was not
  deterministically injectable; ulimit -f is a SIGXFSZ trap).
- 3 doc edits (exact text: envelope + single-clock in the design doc; network-FS
  boundary + upgrade-both in the README).
- 4 GCL_ENVELOPE_TIER=relax mechanism + the 3 downgrade sites (D-c).
- 6 three-tier CI (Required / Nightly / Deep), event-conditional concurrency
  (keeps the deep-sweep group off the required gate), kcov coverage job, nightly
  auto-triage, paths-ignore-on-required fix, and a refined do-not-merge
  disposition (with-load.sh graduates calibrated).
- 8 TAP + 1..N + the silent-undercount sentinel fix, GCL_TEST_ONLY selector
  (integration excluded by design), tests/_harness.sh extraction.

Bucket 2B/6/8 designs feasibility-validated by parallel agents (prototypes in
.agent-testing/, gitignored). Awaiting Ben's Phase 2 gate.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-17-ci-stress-phase2-build-plan.md | 396 ++++++++++++++++++
 1 file changed, 396 insertions(+)
 create mode 100644 .plans/2026-06-17-ci-stress-phase2-build-plan.md

diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
new file mode 100644
index 0000000..a7c2edf
--- /dev/null
+++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
@@ -0,0 +1,396 @@
+# Phase 2 plan: implement the guarantees-and-coverage build (Buckets 2/3/4/6/8)
+
+Status: **PROPOSAL — Phase 2 of the [guarantees-and-coverage
+plan](2026-06-17-ci-stress-guarantees-and-coverage-plan.md).** Awaiting Ben's
+gate. No implementation (Phase 3) until approved.
+
+## What this plans
+The concrete build that follows from the (committed, queued) Phase 1 outputs:
+- `docs/guarantees.md` — the normative contract (Phase 1a).
+- `docs/steering-coverage.md` — the prioritized steering-coverage gap list (Phase 1c).
+- `docs/failure-modes.md` §4 — the accepted scope decisions (incl. Ben's §4.5
+  override to add fault-injection coverage).
+- `docs/load-testing-strategy.md` §9 — accepted load/matrix recommendations.
+
+It turns those into: new tests (Bucket 2 — the Tier-A steering + Tier-B
+fault-injection gaps), documentation edits (Bucket 3), the correctness/envelope
+test split (Bucket 4 / D-c, via `GCL_ENVELOPE_TIER=relax`), the CI matrix wiring
+(Bucket 6), and harness ergonomics (Bucket 8). **Verification is CI-first** (the
+new tests run across the matrix); local runs are allowed but the box lags under
+heavy fan-out.
+
+Each section gives per-item designs concrete enough for Phase 3 to implement
+directly. Three sections (Bucket 2 Tier-B, Bucket 6, Bucket 8) are being
+feasibility-validated by parallel design agents and are integrated below.
+
+---
+
+## Bucket 2A — Tier-A steering tests (portable, deterministic; the bulk of the value)
+
+From `steering-coverage.md` §3 Tier A. All are new `clone_fn`/shadow tests in
+`tests/git-commit-lock.test.sh` (unit suite), runnable on every CI leg — no
+fault-injection fragility. The audit already established each steering technique;
+line anchors are current-tree and may drift (re-locate at build).
+
+| ID | Gap (location) | Steering mechanism | Asserts | Platform | Priority |
+|---|---|---|---|---|---|
+| **A1** | `CLAIM-ABORT (rename-refused)` — wrong-type object at the lock path mid-steal (`:1195-1202`) | `clone_fn _lock_verify_stale` (or shadow `mv`) to `mkdir` a directory onto `$AGENT_LOCK_PATH` immediately before the rename | `CLAIM-ABORT (rename-refused)` + "non-file at the lock path" log; claim deleted; discovery read; **no false hold**; ghost handled | all | **HIGH** — the only acquire/steal *verdict* branch with no test; its own log string |
+| **A2** | step-3.3 pre-rename CLAIM-ABORT block (`:1151-1160`; kcov hits=0) | `_lock_verify_stale` shadow with a **call-counter**: pass on call 1 (step-2), flip to `not stale` (gone/wrongtype/fresh) on call 2 (step-3.3) | the step-3.3 abort reason-map fires; claim-delete + discovery + `return 1`; no false hold | all | **HIGH** — a whole unexercised abort lane |
+| **A3** | `foreign` claim-recheck branch (`:1103-1106`; kcov hits=0) | shadow the claim read at recheck to return a *foreign* token (a clearer removed our claim, a rival re-claimed) | leave the foreign claim; discovery read; back off; no 98-on-mere-claim | all | MED-HIGH |
+| **A4** | `exec`-bypass / §H4 no-silent-loss boundary (`lock_run` runs `"$@"` in the wrapper shell, `:1733`) | **(corrected, verified empirically)** the exec must run in the lock-holding shell: `run -- exec true` or sourced `lock_acquire; exec true` — **NOT** `run -- bash -c 'exec true'` (that execs a child, releases normally) | (a) benign: no `RELEASED` line / lock left held; (b) displaced (backdated lease + parked contender) + exec 0 → caller sees 0 with **no** 98 — pins `guarantees.md` OOS-5 | all (bash) | **HIGH** — the one silent-loss boundary |
+| **A5** | forward clock-jump → premature steal of a live lock (§E2; `:928,1409`) | `clone_fn _lock_now` to return now+offset on the poll while the live holder's mtime stays current | the live lock is judged stale and stolen; the victim's release hits **98** (clock-driven analogue of Test 4b) | all | MED |
+| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn` the mtime helper (`_lock_path_mtime` / its `stat` shadow) to return empty on a *present* file | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED |
+| **A7** | malformed/unreadable content classification tails (`_lock_verify_stale` `:940-949`; in-acquire steal guard `:1429-1443`; claim-stale-check `:1240-1249`) | fabricate a line-1-whitespace file (non-empty blank line 1 = `#18`); shadow a read-fault (`#17`) | no steal; the right `not a lock/claim file` / `unreadable` warning; covers several sibling branches per test | all | LOW-MED (cheap, multi-branch) |
+| **A8** | socket & device-node wrong-type arms (`:1474-1475` claim, `:1561-1562` lock; kcov-new) | bind a unix socket / reference a device node (`/dev/null`) at the path | refusal (never stolen/deleted); the `-S`/`-b`/`-c` arms execute | POSIX | LOW (cheap; sibling of tested guard) |
+| **A9** | log rotation past 1 MB (`:558-559`; kcov-new) | pre-write a >1 MB `$AGENT_LOCK_LOG`, trigger a log call | truncate-restart (log shrinks; lock unaffected) | all | LOW (trivial, no injection) |
+| **A10** | EXIT-trap no-hold arc-end (`:1009,1017-1018`; kcov hits=0) | a sourced `lock_acquire` that `exit`s while still *waiting* (no hold, no in-flight claim) | the no-hold cleanup/restore path runs (vs the TERM twin already tested) | all | LOW |
+| **A11** | `mv -T` fallback forced on (`:969,976-977`) | pre-set `_LOCK_MVT=0` (or shadow the probe's `mv -T` to fail) in a sourced steering shell, then run a steal + a steal-into-a-directory | the BSD/macOS unlink+bare-`mv` lane + the `[ -d ]` last-instant guard execute on Linux/MINGW | all (forces the lane) | LOW-MED (closes an engine lane on the common leg) |
+
+**Sequencing:** A1/A2/A4 first (high value, real verdict/abort/silent-loss lanes);
+A3/A5/A6 next; A7-A11 as a cheap batch. Each is a self-contained unit test using
+the existing fabricate + backdate + `clone_fn` idioms.
+
+---
+
+## Bucket 2B — Tier-B fault-injection tests (empirically feasibility-validated)
+
+Each injection was prototyped against the real `git-commit-lock.sh` (Git Bash + WSL).
+The §4.5 discipline applies: **ship only lanes that inject portably/deterministically;
+flag the rest rather than ship a flake.** This **refines the original D-b** (which had
+F3 in the first cut) based on the feasibility results.
+
+| Lane | Injection | Asserts | Guard | Status |
+|---|---|---|---|---|
+| **F4 — unwritable lock dir → 97** | `chmod 0555` the lock dir; create fails O_EXCL every poll. Cap `MAX_WAIT=1-2`, `POLL=0.1`. | `rc==97`; command never ran (no marker); no lock created; log `WAITING` then `TIMEOUT after Ns` | **POSIX-only** (guard is **load-bearing**: `chmod 0555` is a *no-op for writes* on Git Bash/NTFS → would falsely pass rc=0; skip-with-note like Test 17's symlink branch) | **First cut.** Deterministic (5/5 rc=97 on WSL). The §F4 highest-value lane (most likely real misconfig). |
+| **F2/J1 — failing log → lock works, write swallowed** | Point `AGENT_LOCK_LOG` at `<regular-file>/x.log` so every append fails **ENOTDIR** (portable; no chmod/perms). | `rc==0`; command ran (marker); lock cleaned up (gone); log **not written** (`[ ! -s "$LOG" ]` / uncreated). Covers F2 **and** J1 in one test. | **Portable — no guard.** | **First cut.** Deterministic, both platforms. **Caveat:** bash's redirection-open failure leaks to stderr (the `||true` is on the write, not the open) — do **not** assert clean stderr, and do **not** `grep RELEASED "$LOG"` (nothing is written). |
+| **F1 — ENOSPC on create/write** | Real full FS only: `sudo mount -t tmpfs -o size=400k` + `dd` fill, point the lock there. | `rc==97`; command never ran; an **empty-orphan lock left behind** (create 0-byte, write failed — matches §F1) | **Linux-only AND needs root/sudo** | **Second cut — gated, or document-only.** Behavior validated end-to-end on WSL. **`ulimit -f 0` is a trap** — it raises SIGXFSZ (rc=153) killing the *wrapper*, not the create. **No portable injection.** |
+| **F3 — FD / inode exhaustion** | (intended `ulimit -n` / small-inode FS) | (intended `rc==97`, create-fail→wait) | Linux-only; inode→root | **Document-only.** **Cannot inject deterministically:** the create uses **~1 FD**, so any `ulimit -n` low enough to fail *it* first starves bash's own startup (machine-/load-dependent harness corruption, not the lib's 97 lane). Inode exhaustion needs root. §F3 is already reasoned-correct (same shape as F1). |
+
+**D-b tier split (refined by feasibility):**
+- **First cut (implement now):** F4 (POSIX-guarded) + F2/J1 (portable). Both deterministic,
+  single-shot (no fan-out), ~3-4 s total. These close the resource-lane coverage on every
+  leg with zero flake risk.
+- **Second cut:** F1 — **recommend** a Linux-only test gated behind both `uname`==Linux
+  **and** a `sudo -n true` capability probe that **skips-with-note** when sudo is
+  unavailable (never fails the suite), with `sudo umount` in cleanup (GitHub `ubuntu-*`
+  runners have passwordless sudo). *Alternative:* document-only, since the behavior is
+  validated. *(Decision point for Ben — see Open decisions.)*
+- **Document-only:** F3 (and F1 if Ben prefers zero root in the suite). Note the validated
+  behavior in `failure-modes.md` §F1/§F3 (the empty-orphan→97 path) rather than shipping a
+  flaky/non-portable test.
+
+**Implementation notes (match existing idioms):** use the `LOCK`/`LOG`/`AGENT_LOCK_*` env
+vocabulary and the `rc=$?; [ "$rc" = 97 ] && ok … || bad …` + `grep -q "TIMEOUT after"`
+pattern; mirror Test 17's `2> "$WORK/tNN.err"` capture and skip-with-note. **F4 cleanup is
+load-bearing:** a `chmod 0555` dir blocks `rm -rf` of its *contents* — keep that lock dir
+**empty** (nothing is created in it) so the suite's `cleanup()` `rm -rf "$WORK"` succeeds.
+**F2 assertion polarity** is inverted: assert the log was **not** written; the lock-success
+signal is `rc==0` + the command's marker + lock-file-gone, not a log line.
+
+---
+
+## Bucket 3 — Documentation edits (exact text)
+
+Small, concrete edits surfacing the boundaries the analysis decided to document.
+
+### C-envelope (§4.1) → `docs/git-commit-lock.md`
+Add, near the staleness/clock discussion (after the "One caveat on the mtime
+clock" block, ~`:283-293`), a short **operating-envelope** statement:
+> **Correctness is load-independent; latency is not.** Exclusion, no-silent-loss,
+> and eventual recovery rest on atomic create/rename + per-attempt tokens and hold
+> under any load. The wall-clock bounds — recovery latency (≈ STALE + poll
+> cadence), the `MAX_WAIT` timeout, and the ~1.3 s read-retry ladder — are
+> best-effort and scale with scheduling: under CPU oversubscription or a slow FS
+> they stretch, but the protocol still recovers and never loses an update.
+
+### C-clock (§4.2) → `docs/git-commit-lock.md`
+One sentence in the same caveat block:
+> The tool assumes a **single time source** — single-host use (the common case,
+> all contenders share one checkout hence one clock), or a shared FS with one
+> server clock. A local clock jump is correctness-safe: a forward jump can make a
+> live lock look stale and be prematurely stolen, but that degrades to the
+> detected exit-98 lane, never a silent double-commit.
+
+### C-netfs (§4.3) → `README.md`
+The boundary is in the design doc (`git-commit-lock.md:122-126`) but not the
+README, where operators look. Add to "How it works" (after the atomic-create
+sentence, ~`README.md:57`):
+> The protocol's correctness rests on these operations being atomic, which holds
+> on local filesystems (ext4, APFS, NTFS, and kin) but **not** on network or
+> sync-backed storage — NFS, SMB shares, Dropbox/OneDrive-synced directories —
+> where exclusion may silently fail. Keep the repo (and so its `.git/`) on a local
+> disk. (The default lock lives in `.git`, which almost always is.)
+
+### C-mixedver (§I2) → `README.md`
+The "upgrade both together" rule is design-doc-only (`git-commit-lock.md:251-256`).
+Add to the two-implementations section (~`README.md:82-95`):
+> **Upgrade both implementations together.** Older releases stole with an
+> unserialized move-aside instead of the claim protocol, so the
+> no-displacement-during-recovery guarantee holds only when every party in a tree
+> runs a current version; a mixed-version tree degrades that prevention to
+> detection (exit 98) and can leave `.dead.*` files current versions don't clean.
+
+### C-misc (§4.6, optional) → `docs/git-commit-lock.md`
+One line each (low priority): case-insensitive FS is a non-issue (the lock/claim
+paths never collide under case folding); the mixed-version `.dead.*` litter note
+cross-referenced.
+
+---
+
+## Bucket 4 — Correctness/envelope test split (D-c; `GCL_ENVELOPE_TIER=relax`)
+
+D-c is implemented as a **tagged assertion downgrade**, not a physical file split
+(a file split would duplicate Test 21/29's heavy `clone_fn` setup and break the
+single-suite kcov measurement). Add an `ok`/`bad`-adjacent helper pair (in
+`tests/_harness.sh` once Bucket 8 item 3 lands; inline in the unit suite until
+then — same signature, so the later move is mechanical):
+
+```bash
+ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}"   # default strict; nightly/deep set relax
+ENV_WARN=0
+ok_envelope()  { echo "PASS[env]: $*"; PASS=$((PASS+1)); }
+bad_envelope() {   # the FAIL branch of a wall-clock/poll-count (Tier-2) assertion only
+  if [ "$ENVELOPE_TIER" = relax ]; then echo "WARN[env-relaxed]: $*"; ENV_WARN=$((ENV_WARN+1))
+  else echo "FAIL: $*"; FAIL=$((FAIL+1)); fi
+}
+```
+
+- **`ok`/`bad` = the strict-correctness tier** (always hard, both tiers);
+  **`ok_envelope`/`bad_envelope` = the latency/envelope tier** (hard in `strict`,
+  warn-only in `relax`). Exit code is driven by real `FAIL` only — `ENV_WARN` never
+  reds a run; the summary prints the `ENV_WARN` count so it's visible.
+- **The three (and only three) downgraded call sites** — swap `ok`/`bad` →
+  `*_envelope` on the *wall-clock* assertion only; every neighbouring correctness
+  assertion (rc=97, no-steal, dir-untouched, STOLE-BY-CLAIM, …) **keeps `ok`/`bad`**:
+  - **Test 21** `:1144` — recovery latency `≤20s`.
+  - **Test 22a** `:1167` (warning fired — relies on two-poll-confirm headroom),
+    `:1170` (fired exactly once), and `:1168` (warning names the type — contingent
+    on the same starved warning). The never-steal / never-delete assertions stay strict.
+  - **Test 29** `:1531` — `≥2` CLAIM lines (poll-count).
+- **Required CI sets `strict` (or leaves it unset)** — at zero artificial load the
+  three pass comfortably, so the gate behavior is unchanged; **nightly/deep set
+  `relax`** so an oversubscribed runner can't turn an envelope miss into a red.
+- Anchors are current-tree; re-locate the three sites at build (each is the single
+  `-le 20` / warning-count / `-ge 2` line).
+
+---
+
+## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions)
+
+**Two-workflow structure:** keep `tests.yml` for **Tier R (required)** + **Tier D
+(deep dispatch)**; add a new `nightly.yml` for **Tier N (nightly)** + the kcov job +
+triage. Rationale: the nightly tier is non-blocking and must never be a required
+check, so a separate workflow keeps its `concurrency`, `issues: write` permission,
+and schedule independent of the gate.
+
+**Tier R — Required / per-PR (blocking), `tests.yml`.** The current 4 cells
+unchanged (ubuntu all / macos all / windows unit / windows interop+integration),
+**no load**, `GCL_ENVELOPE_TIER=strict` (default — the 3 wall-clock assertions pass
+comfortably at zero load), `GCL_TEST_FULL=1`. Diff from today: **revert** the
+per-run-unique concurrency group (`980856b`) → `group: ${{ github.workflow }}-${{
+github.ref }}` + `cancel-in-progress`; **drop** the `GCL_STRESS_*` env + `with-load.sh`
+wrap + raised timeouts from the required job (`b430d73`'s workflow half); restore the
+original step/job timeouts. Target < ~8 min. A red here is therefore never a
+stress-manufactured flake.
+
+**Tier N — Nightly (non-blocking, triaged), new `nightly.yml`.** `schedule` (daily,
+off-peak) + `workflow_dispatch`; one oversubscribed level **R≈2**;
+`GCL_ENVELOPE_TIER=relax` + `GCL_TEST_SWEEP=1`; `concurrency: nightly` + cancel
+(one run at a time). **6 explicit cells** (`matrix.include`): N1 ubuntu/cpu, N2
+ubuntu/disk, N3 ubuntu/both, N4 macos/disk (the single harsh macOS cell — scarce/slow/
+5-job sub-limit), N5 windows interop+integration/disk (highest-value: delete-pending
+ghosts + 5.1 unlink-then-move under churn), N6 windows unit/both. 6 cells + kcov +
+triage ≈ 8 jobs → one wave under the ~20/5 ceiling. Nightly steps keep the raised
+timeouts (correct here).
+
+**Tier D — Deep sweep (on-demand, never gates), `tests.yml`.** `workflow_dispatch`
+only, inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax).
+**The key mechanism that lets Deep + Required coexist in one file** — an
+event-conditional concurrency group so the per-run-unique group never leaks onto the
+gate:
+```yaml
+concurrency:
+  group: >-
+    ${{ github.event_name == 'workflow_dispatch'
+        && format('{0}-deep-{1}', github.workflow, github.run_id)
+        || format('{0}-{1}', github.workflow, github.ref) }}
+  cancel-in-progress: ${{ github.event_name != 'workflow_dispatch' }}
+```
+
+**Axis-A waiter-count sweep {4,12,24}** under `GCL_TEST_SWEEP=1` (nightly/deep only;
+unset per-PR → today's floor `N=4`, deterministic). A `T_AXIS_A` list read at suite
+top; each of **Test 2b / Test 20 / interop Test 16** loops `N` over it, naming `N` in
+every message. Anti-flake discipline baked into the loop: keep correctness assertions
+config-independent (hold `STALE ≫ hold` so "zero-98 / one-steal" holds at every N —
+these stay `ok`/`bad` strict, *not* `_envelope`), and **scale `MAX_WAIT` with N** so a
+large-N run doesn't time out and look like a product failure. Mechanism generalizes to
+Axis B/C later (deferred per §9.4).
+
+**kcov coverage job** (nightly.yml, Linux-only): build kcov v43 from source (no
+apt/prebuilt), run the **unit suite at FULL, strict, no-load** (`--include-path=git-
+commit-lock.sh`), upload HTML + cobertura (30-day retention), and gate on a
+**conservative line-coverage floor of 0.80** (below the current 83.1%, above noise;
+the Linux ceiling is ~94% because ~30 lines are platform-gated). **Ratchet the floor up
+toward ~0.90 as Bucket-2 lands the Tier-A tests** — the floor tracks achieved coverage,
+it doesn't lead it.
+
+**Nightly issue auto-triage** (nightly.yml, `if: always()`, `issues: write`): parse the
+preserved logs — `^FAIL:` and/or job `failure` → **correctness** (file/append a
+labelled issue, investigate); no FAIL but `WARN[env-relaxed]` and job `success` →
+**envelope-flake** (tracked, no action); timeout/checkout failure → **infra**.
+Idempotent (search-then-append, one issue per (date, class); no all-green spam).
+**Empty-round guard (learned-once):** every cell's artifact missing / workflow errored
+before any suite ran is an **infra** failure — do NOT read "0 FAIL across 0 logs" as
+green. Upload nightly logs on success too (need the negatives to read the positives).
+
+**Load calibration** (`with-load.sh` graduates from scaffolding): express load as
+oversubscription ratio `R = stressors/nproc` (cap `R_total`), prefer `stress-ng`
+(Windows spinner fallback) and a **probe-gated** Linux cgroup CPU-quota path for the
+calibrated envelope leg (IO throttling experimental — don't rely on it); emit a per-run
+**load-manifest** artifact (`{kind, R, nproc, achieved-slowdown, tool versions, os/arch,
+sha}`) uploaded on success too.
+
+**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):**
+- **Graduate to `main`:** the calibrated `with-load.sh` (strip the do-not-merge banner;
+  add ratio calibration + load-manifest); `ok_envelope`/`bad_envelope` + the 3
+  reassigned assertions; `GCL_TEST_SWEEP` + Axis-A loop (default-off → per-PR identical
+  to today); the new `nightly.yml`; the `tests.yml` event-conditional-concurrency edit +
+  dispatch inputs. So `b430d73` is **not** wholly do-not-merge — its `with-load.sh`
+  payload graduates; only its *required-job wiring* is dropped.
+- **Revert / drop:** `980856b` (flat per-run-unique group); `b430d73`'s load-wrap +
+  raised-timeouts **on the required job** (they move to nightly.yml).
+
+**§7 GitHub-Actions gotchas the diff MUST honor:**
+- **`paths-ignore` on a *required* check blocks doc-only PRs** (skipped workflow → checks
+  Pending → merge blocked). The current `tests.yml` has both `paths-ignore` and the
+  required jobs. **Fix (required, not optional):** keep the workflow always-running and
+  path-filter only the expensive `test`/`lint` *steps*, with a tiny always-green job
+  satisfying the required check on doc-only PRs (recommended), or make a separate cheap
+  job the required check.
+- **`max-parallel` is intra-matrix only** — bound Deep/Nightly with workflow-level
+  `concurrency` groups (done), never `max-parallel`.
+- **`schedule` auto-disables after ~60 days of repo inactivity** — note in `nightly.yml`;
+  rely on `workflow_dispatch` to re-trigger. A successor should know an empty nightly
+  history may mean "disabled," not "passing."
+- **Artifact names** unique per `(os, leg, kind)`; keep `include-hidden-files: true`
+  (the lock logs live under the scratch `.git/`). `fail-fast: false` stays (per-OS
+  signal + triage needs every cell's verdict). 256-job cap irrelevant at this scale.
+
+---
+
+## Bucket 8 — Harness ergonomics (zero-dep; prototype-validated)
+
+Tests are straight-line `echo "== Test N: … =="` blocks (no registry): **43** in the
+unit suite (the "~36" figure was stale), 25 interop, 2+1 integration. Sequencing is
+**TAP → selector → extract** (each its own commit).
+
+**Item 1 — TAP + `1..N` plan line + the undercount fix (do FIRST, ~20 lines/suite).**
+The bug: under `set -uo pipefail` (no `-e`), an early `exit`/crash terminates the
+suite before the final `echo RESULT` + `[ "$FAIL" = 0 ]`, dropping later assertions
+from the count — and a stray `exit 0` after a recorded FAIL exits **0 with no RESULT
+line** (a *silent green*). Fix, three parts (all prototype-validated):
+- Make `ok`/`bad` TAP-aware, gated by `GCL_TAP=1` (dev runs byte-unchanged): bump a
+  running `TAPN` and emit `ok N - desc` / `not ok N - desc`; keep the `return 0` that
+  the `A && ok || bad` idiom needs.
+- Emit a **trailing `1..$TAPN`** plan line before the verdict — a consumer fails on a
+  short count.
+- A **"reached-the-end" sentinel**: `DONE=0` set to `1` as the last action before the
+  verdict; a `finish` EXIT trap (wrapping the existing per-suite `cleanup`) that, if it
+  fires with `DONE!=1`, prints `Bail out!` and **`exit 1`**. (Key validated detail: a
+  bare trap *return* is ignored — the script keeps its pre-trap code — so the guard
+  needs an explicit `exit 1`; this is what converts the silent early-`exit 0`-after-FAIL
+  into a red.) No hand-maintained expected-count constant — the sentinel catches *any*
+  premature termination with zero upkeep. Apply to all three suites.
+
+**Item 2 — `GCL_TEST_ONLY=<regex>` selector (SECOND; 43 mechanical header rewrites).**
+Wrap each block: `echo "== Test N: … =="` → `if section "Test N: …"; then … fi`, where
+`section` echoes the header and returns success iff `GCL_TEST_ONLY` is unset or its
+regex matches the label. **Care point:** a few blocks do trailing cleanup *after* the
+last assertion before the next header — those lines must move *inside* the `fi`.
+**Integration is EXCLUDED by design:** its Tests 1-3 share one repo + `ALL_IDS`
+accumulator (Test 3 audits 1+2's output), so it is one indivisible scenario — it
+must *note-and-ignore* `GCL_TEST_ONLY` (loud stderr note), never per-block select.
+Unit first; interop the same treatment (lower priority). Anchoring tip for docs:
+`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`.
+
+**Item 3 — extract `tests/_harness.sh` (LAST; pure dedup, largest diff).** Source one
+shared file from each suite. Tier 1 (all three): the `PASS/FAIL/TAPN/DONE` inits +
+`GCL_TAP`/`GCL_TEST_ONLY` reads, `ok`/`bad`, `section`, the `finish`/sentinel helper,
+and the shared shellcheck disables. Tier 2 (unit+interop only — integration uses none):
+`epoch_to_stamp`, `backdate`, `backdate_ghost`, `sync_waiting_fresh`, `fabricate_lock`,
+`wait_for_grep`, `clone_fn` + its `export -f` line. Tier 3: keep **both** poll helpers
+under their existing names/semantics (`wait_for_file` `$2`=seconds, interop's `wait_for`
+`$2`=50ms-iterations) — do *not* unify signatures this pass (would touch every call site
+on the most fragile timing axis). **Do NOT extract `cleanup`** — it closes over each
+suite's `$WORK` and interop's body genuinely differs; the shared `finish` just calls the
+suite-local `cleanup`. Do it last so the final TAP/selector code is extracted once.
+Verify byte-identical behavior by diffing a FULL run's sorted `PASS:`/`FAIL:` set
+before/after (CI or local).
+
+Prototypes (gitignored, `.agent-testing/bucket8-proto/`) validate TAP emission, the
+trailing plan, selector matching, TAP+selector composition, and the sentinel closing
+the exact silent-green bug.
+
+---
+
+## Phasing for Phase 3 (the build)
+
+Order chosen so cheap, enabling work lands first and each step is CI-verifiable:
+
+1. **Bucket 8 items 1-2 first** (TAP + `GCL_TEST_ONLY`) — they make iterating on
+   ~15 new tests far cheaper and give machine-readable CI output to read the new
+   tests' results back from. (Per the harness design's safe-increment order.)
+2. **Bucket 3 doc edits** — independent, low-risk, can land anytime; do early so
+   the docs match the contract.
+3. **Bucket 4 envelope switch** (`GCL_ENVELOPE_TIER`) — needed before the nightly
+   CI tier and before scoping Test 21/22a/29.
+4. **Bucket 2A steering tests** (A1/A2/A4 first, then the rest) — the coverage core.
+5. **Bucket 2B fault-injection tests** (the feasible D-b first cut; flag/defer any
+   non-portable lane).
+6. **Bucket 8 item 3** (`_harness.sh` extraction) — after the new tests exist, so
+   the shared helpers are settled.
+7. **Bucket 6 CI matrix** — wire the three tiers + kcov leg + parametrization last,
+   once the tests and the envelope switch exist for it to orchestrate.
+
+Each step commits incrementally under the commit-lock; verification dispatches
+`tests.yml` on `ci-stress`. **Build vs Workflow:** decide hand-run vs a Claude Code
+Workflow once the final test count is known (plan D-e) — likely a Workflow for the
+~15 steering tests (fan-out write + per-test CI verify).
+
+## Logging / observability design (per engineering practices)
+- **New tests** assert on the product's existing protocol log strings (the coverage
+  proxy the audit used) — every new steering test greps a specific log line, so a
+  silent behavior change is caught.
+- **TAP output** (Bucket 8) makes each assertion's pass/fail individually visible in
+  CI logs, and the `1..N` plan line makes a truncated run fail loudly (closing the
+  silent-undercount gap).
+- **The load-manifest artifact** (Bucket 6) records `{kind, R, nproc,
+  achieved-slowdown, tool versions, runner os/arch, git sha}` per nightly/deep run,
+  uploaded on success too, so any flake is reproducible (the reproducible-experiments
+  requirement).
+- **kcov coverage artifact** (Bucket 6) uploaded per Linux run; the gap list in
+  `steering-coverage.md` is the baseline to diff against.
+- **Nightly auto-triage** tags a failing scheduled run `correctness` (investigate)
+  vs `envelope` (expected under load), so scheduled reds are visible, not silent.
+
+## Open decisions for Ben
+- **D-b tiering (confirm):** build all of Tier A (A1-A11) + the Tier-B first cut
+  (F4, F2/J1) now? The original D-b's "second tier" items are all accounted for —
+  E3 → **A6** (steering, not fault-injection), F2-audit #7 (rename-refused) → **A1**,
+  #8 (Windows blocked-unlink) → **Tier C** (platform-only, verified on the Windows
+  leg); only **F1/F3** are genuinely not portably injectable. (Recommend: yes — Tier A
+  is all portable; defer only F1/F3.)
+- **F1 (ENOSPC) — gated test vs document-only:** F1's behavior is validated but its
+  injection needs Linux root (`mount`). Ship as a Linux-only test gated behind a
+  `sudo -n` capability probe (skip-with-note elsewhere, `sudo umount` in cleanup), or
+  document-only? (Recommend: the **gated test** — GitHub `ubuntu-*` runners have
+  passwordless sudo so it actually runs there and skips cleanly everywhere else; falls
+  back to document-only if you'd rather keep zero root in the suite.) **F3 is
+  document-only either way** (no deterministic injection exists — the create needs ~1 FD).
+- **Build mechanism (D-e):** hand-run Phase 3, or a Claude Code Workflow for the test
+  fan-out? (Recommend: decide once the count is final — ~13 steering + 2-3 fault tests;
+  lean Workflow for the steering batch, hand-run the CI/doc edits.)
+- Anything else needing a call is surfaced inline in the integrated sections.
+
+## Changelog
+(empty — Phase 2 planning; implementation changelog starts in Phase 3.)

From 26c9c29fbf11e4149d40bb8e9c97fe843c7a9871 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Wed, 17 Jun 2026 20:36:13 +1000
Subject: [PATCH 27/76] Phase 2 plan: fold review round (Claude + Codex); both
 verdict sound-to-gate
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI (Codex): split into THREE workflows — tests.yml (required) + a stable
tests-passed aggregator as the ONLY required context; nightly.yml; deep-sweep.yml
with distinct job names. This fixes the workflow_dispatch-publishes-check-contexts
gating risk AND the paths-ignore-on-required gotcha, and drops the event-conditional
concurrency expression. Made ok_envelope/bad_envelope TAP-aware (Bucket 8 item 1
lands first, so TAPN/GCL_TAP exist). Added a GCL_TEST_ONLY zero-match guard.

Tests (Claude): A6 must shadow the INNER _lock_stat_mtime (:606), NOT _lock_path_mtime
(:639-643), which is the function that emits the warn-once the test asserts — verified
against the code. Test 22a downgrade refined to only the warning-fired-at-all
assertion (keep the warn-once dedup n<=1 and names-type strict). A reviewer's Test-22a
line numbers were a mislocation — the plan mapping (:1167/:1168/:1170) is verified
correct against the tree. F3 reclassified document-only supersedes steering-coverage
B4's "portable POSIX" rating (ulimit -n can't fail the ~1-FD create without starving
bash startup) — steering-coverage.md B4 corrected to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-17-ci-stress-phase2-build-plan.md | 96 ++++++++++++-------
 docs/steering-coverage.md                     | 15 ++-
 2 files changed, 72 insertions(+), 39 deletions(-)

diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
index a7c2edf..cb0e1a1 100644
--- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md
+++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
@@ -39,7 +39,7 @@ line anchors are current-tree and may drift (re-locate at build).
 | **A3** | `foreign` claim-recheck branch (`:1103-1106`; kcov hits=0) | shadow the claim read at recheck to return a *foreign* token (a clearer removed our claim, a rival re-claimed) | leave the foreign claim; discovery read; back off; no 98-on-mere-claim | all | MED-HIGH |
 | **A4** | `exec`-bypass / §H4 no-silent-loss boundary (`lock_run` runs `"$@"` in the wrapper shell, `:1733`) | **(corrected, verified empirically)** the exec must run in the lock-holding shell: `run -- exec true` or sourced `lock_acquire; exec true` — **NOT** `run -- bash -c 'exec true'` (that execs a child, releases normally) | (a) benign: no `RELEASED` line / lock left held; (b) displaced (backdated lease + parked contender) + exec 0 → caller sees 0 with **no** 98 — pins `guarantees.md` OOS-5 | all (bash) | **HIGH** — the one silent-loss boundary |
 | **A5** | forward clock-jump → premature steal of a live lock (§E2; `:928,1409`) | `clone_fn _lock_now` to return now+offset on the poll while the live holder's mtime stays current | the live lock is judged stale and stolen; the victim's release hits **98** (clock-driven analogue of Test 4b) | all | MED |
-| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn` the mtime helper (`_lock_path_mtime` / its `stat` shadow) to return empty on a *present* file | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED |
+| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn _lock_stat_mtime` (the **inner** stat probe at `:606`) to return empty on a *present* file — **NOT** `_lock_path_mtime`, which is the function that *emits* the warn-once (`:639-643`); shadowing it would defeat the assertion | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED |
 | **A7** | malformed/unreadable content classification tails (`_lock_verify_stale` `:940-949`; in-acquire steal guard `:1429-1443`; claim-stale-check `:1240-1249`) | fabricate a line-1-whitespace file (non-empty blank line 1 = `#18`); shadow a read-fault (`#17`) | no steal; the right `not a lock/claim file` / `unreadable` warning; covers several sibling branches per test | all | LOW-MED (cheap, multi-branch) |
 | **A8** | socket & device-node wrong-type arms (`:1474-1475` claim, `:1561-1562` lock; kcov-new) | bind a unix socket / reference a device node (`/dev/null`) at the path | refusal (never stolen/deleted); the `-S`/`-b`/`-c` arms execute | POSIX | LOW (cheap; sibling of tested guard) |
 | **A9** | log rotation past 1 MB (`:558-559`; kcov-new) | pre-write a >1 MB `$AGENT_LOCK_LOG`, trigger a log call | truncate-restart (log shrinks; lock unaffected) | all | LOW (trivial, no injection) |
@@ -77,7 +77,10 @@ F3 in the first cut) based on the feasibility results.
   validated. *(Decision point for Ben — see Open decisions.)*
 - **Document-only:** F3 (and F1 if Ben prefers zero root in the suite). Note the validated
   behavior in `failure-modes.md` §F1/§F3 (the empty-orphan→97 path) rather than shipping a
-  flaky/non-portable test.
+  flaky/non-portable test. **This supersedes `steering-coverage.md` §3 B4's "portable POSIX"
+  rating and the failure-modes §4.5/Q5 "`ulimit -n` for FDs" suggestion** — the empirical
+  check shows the create needs ~1 FD, so no `ulimit -n` fails it without first starving
+  bash's own startup (harness corruption). `steering-coverage.md` B4 is corrected to match.
 
 **Implementation notes (match existing idioms):** use the `LOCK`/`LOG`/`AGENT_LOCK_*` env
 vocabulary and the `rc=$?; [ "$rc" = 97 ] && ok … || bad …` + `grep -q "TIMEOUT after"`
@@ -148,11 +151,20 @@ then — same signature, so the later move is mechanical):
 ```bash
 ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}"   # default strict; nightly/deep set relax
 ENV_WARN=0
-ok_envelope()  { echo "PASS[env]: $*"; PASS=$((PASS+1)); }
-bad_envelope() {   # the FAIL branch of a wall-clock/poll-count (Tier-2) assertion only
-  if [ "$ENVELOPE_TIER" = relax ]; then echo "WARN[env-relaxed]: $*"; ENV_WARN=$((ENV_WARN+1))
-  else echo "FAIL: $*"; FAIL=$((FAIL+1)); fi
-}
+# TAP-aware (Bucket 8 item 1 lands FIRST, so TAPN/GCL_TAP already exist — review catch).
+# An envelope PASS is a normal `ok`; an envelope FAIL is a hard `bad` in strict, but in
+# relax it is a TAP-passing line with a `# env-relaxed` directive — it counts toward the
+# 1..N plan and bumps ENV_WARN (for triage), and NEVER reds the run.
+ok_envelope()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS[env]: $*"
+                 [ "${GCL_TAP:-0}" = 1 ] && echo "ok $TAPN - $*"; return 0; }
+bad_envelope() {
+  if [ "$ENVELOPE_TIER" = relax ]; then
+    ENV_WARN=$((ENV_WARN+1)); TAPN=$((TAPN+1)); echo "WARN[env-relaxed]: $*"
+    [ "${GCL_TAP:-0}" = 1 ] && echo "ok $TAPN - $* # env-relaxed"
+  else
+    FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
+    [ "${GCL_TAP:-0}" = 1 ] && echo "not ok $TAPN - $*"
+  fi; return 0; }
 ```
 
 - **`ok`/`bad` = the strict-correctness tier** (always hard, both tiers);
@@ -163,9 +175,16 @@ bad_envelope() {   # the FAIL branch of a wall-clock/poll-count (Tier-2) asserti
   `*_envelope` on the *wall-clock* assertion only; every neighbouring correctness
   assertion (rc=97, no-steal, dir-untouched, STOLE-BY-CLAIM, …) **keeps `ok`/`bad`**:
   - **Test 21** `:1144` — recovery latency `≤20s`.
-  - **Test 22a** `:1167` (warning fired — relies on two-poll-confirm headroom),
-    `:1170` (fired exactly once), and `:1168` (warning names the type — contingent
-    on the same starved warning). The never-steal / never-delete assertions stay strict.
+  - **Test 22a** — downgrade ONLY the *warning-fired-at-all* assertion (`:1167`,
+    `grep -q "is not a claim file"`, i.e. count `≥1`), which depends on two-poll-confirm
+    headroom under load. Keep the warn-once **correctness** strict: **split** the current
+    `n==1` check (`:1170`) into `n≥1` (→ `bad_envelope`, timing) **+** `n≤1` (→ `bad`,
+    strict — the dedup property: never warns twice), and **guard** "names the type"
+    (`:1168`) on a warning having fired (assert strictly only when `n≥1`). So a real
+    warn-once regression (n≥2, or wrong type) stays a hard FAIL even under `relax`.
+    (Mapping `:1167`/`:1168`/`:1170` verified against the current tree — a reviewer's
+    alternate line numbers were a mislocation; re-confirm at build.) The never-steal /
+    never-delete assertions (`:1171`/`:1172`) stay strict.
   - **Test 29** `:1531` — `≥2` CLAIM lines (poll-count).
 - **Required CI sets `strict` (or leaves it unset)** — at zero artificial load the
   three pass comfortably, so the gate behavior is unchanged; **nightly/deep set
@@ -177,11 +196,24 @@ bad_envelope() {   # the FAIL branch of a wall-clock/poll-count (Tier-2) asserti
 
 ## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions)
 
-**Two-workflow structure:** keep `tests.yml` for **Tier R (required)** + **Tier D
-(deep dispatch)**; add a new `nightly.yml` for **Tier N (nightly)** + the kcov job +
-triage. Rationale: the nightly tier is non-blocking and must never be a required
-check, so a separate workflow keeps its `concurrency`, `issues: write` permission,
-and schedule independent of the gate.
+**Three-workflow structure** (revised after review — a `workflow_dispatch` run
+publishes check contexts on the head SHA, so keeping Deep in `tests.yml` under shared
+job names risks a failed Deep run gating a PR; separate files + a stable required
+aggregator remove that risk *and* the event-conditional concurrency):
+- **`tests.yml`** — Tier R (required): the 4-cell `test` matrix + `lint` + a single
+  stable **`tests-passed` aggregator** (`needs: [test, lint]`, `if: always()`, succeeds
+  iff every needed job *succeeded or was skipped*). **Branch protection requires ONLY
+  `tests-passed`**, not the per-cell matrix contexts. Concurrency: `group: ${{
+  github.workflow }}-${{ github.ref }}` + `cancel-in-progress`.
+- **`nightly.yml`** — Tier N + the kcov job + triage (`issues: write`, `schedule`, its
+  own `concurrency: nightly`).
+- **`deep-sweep.yml`** — Tier D (`workflow_dispatch` only), with **distinct job names**
+  (`deep-*`) so it never publishes the `tests-passed` context, and per-run-unique
+  concurrency.
+This also fixes the **`paths-ignore`-on-required gotcha** cleanly: path-filter the
+expensive `test`/`lint` jobs (they *skip* on doc-only PRs) while `tests-passed` always
+runs and reports green (its needs were skipped, not failed) — so a doc-only PR satisfies
+the one required context without the expensive jobs running.
 
 **Tier R — Required / per-PR (blocking), `tests.yml`.** The current 4 cells
 unchanged (ubuntu all / macos all / windows unit / windows interop+integration),
@@ -203,19 +235,13 @@ ghosts + 5.1 unlink-then-move under churn), N6 windows unit/both. 6 cells + kcov
 triage ≈ 8 jobs → one wave under the ~20/5 ceiling. Nightly steps keep the raised
 timeouts (correct here).
 
-**Tier D — Deep sweep (on-demand, never gates), `tests.yml`.** `workflow_dispatch`
-only, inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax).
-**The key mechanism that lets Deep + Required coexist in one file** — an
-event-conditional concurrency group so the per-run-unique group never leaks onto the
-gate:
-```yaml
-concurrency:
-  group: >-
-    ${{ github.event_name == 'workflow_dispatch'
-        && format('{0}-deep-{1}', github.workflow, github.run_id)
-        || format('{0}-{1}', github.workflow, github.ref) }}
-  cancel-in-progress: ${{ github.event_name != 'workflow_dispatch' }}
-```
+**Tier D — Deep sweep (`deep-sweep.yml`, `workflow_dispatch` only, never gates).**
+Inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax). Its
+jobs use **distinct names** (`deep-*`) so a failed dispatch never publishes the
+`tests-passed` required context (the review catch), with per-run-unique concurrency
+(`group: deep-${{ github.run_id }}`, `cancel-in-progress: false`) so many parallel
+dispatches each run and accept queue waves. Living in its own file removes any need for
+an event-conditional concurrency expression.
 
 **Axis-A waiter-count sweep {4,12,24}** under `GCL_TEST_SWEEP=1` (nightly/deep only;
 unset per-PR → today's floor `N=4`, deterministic). A `T_AXIS_A` list read at suite
@@ -262,11 +288,10 @@ sha}`) uploaded on success too.
 
 **§7 GitHub-Actions gotchas the diff MUST honor:**
 - **`paths-ignore` on a *required* check blocks doc-only PRs** (skipped workflow → checks
-  Pending → merge blocked). The current `tests.yml` has both `paths-ignore` and the
-  required jobs. **Fix (required, not optional):** keep the workflow always-running and
-  path-filter only the expensive `test`/`lint` *steps*, with a tiny always-green job
-  satisfying the required check on doc-only PRs (recommended), or make a separate cheap
-  job the required check.
+  Pending → merge blocked). **Fixed** by the `tests-passed` aggregator above: it is the
+  sole required context and always runs (green when the path-filtered `test`/`lint` jobs
+  skip), so doc-only PRs merge. Branch protection must require **`tests-passed`**, NOT the
+  per-cell matrix contexts (else skipped cells sit Pending).
 - **`max-parallel` is intra-matrix only** — bound Deep/Nightly with workflow-level
   `concurrency` groups (done), never `max-parallel`.
 - **`schedule` auto-disables after ~60 days of repo inactivity** — note in `nightly.yml`;
@@ -311,7 +336,10 @@ last assertion before the next header — those lines must move *inside* the `fi
 accumulator (Test 3 audits 1+2's output), so it is one indivisible scenario — it
 must *note-and-ignore* `GCL_TEST_ONLY` (loud stderr note), never per-block select.
 Unit first; interop the same treatment (lower priority). Anchoring tip for docs:
-`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`.
+`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`. **Zero-match
+guard (review catch):** `section` bumps a `SECTIONS_RUN` counter when it runs a block;
+at the end, if `GCL_TEST_ONLY` is set and `SECTIONS_RUN==0`, fail loudly — a typo'd regex
+must not report a vacuous `PASS=0 FAIL=0` green (same spirit as the undercount sentinel).
 
 **Item 3 — extract `tests/_harness.sh` (LAST; pure dedup, largest diff).** Source one
 shared file from each suite. Tier 1 (all three): the `PASS/FAIL/TAPN/DONE` inits +
diff --git a/docs/steering-coverage.md b/docs/steering-coverage.md
index dd98461..8abaa03 100644
--- a/docs/steering-coverage.md
+++ b/docs/steering-coverage.md
@@ -193,12 +193,17 @@ rather than shipping a flake** (per the §4.5 decision).
 - **B1 — Unwritable lock dir/parent → clean 97** (F4). `chmod` the dir.
   POSIX; the cheapest and highest-value fault-injection test. **High.**
 - **B2 — Unwritable/failing log path → lock still works, log swallowed** (F2/J1).
-  Bad/again-`chmod`'d log path. POSIX. **Medium-high.**
+  *Phase-2 feasibility:* use the **ENOTDIR trick** (`AGENT_LOCK_LOG` under a regular
+  file) — **portable, no chmod/guard**. **First cut.**
 - **B3 — ENOSPC during claim/lock create+write** (F1; the create write-fail branch
-  `#5` and the read-fault lanes `:848,871-873`). Small dedicated tmpfs/quota.
-  Linux-friendliest; flag if not portable. **Medium.**
-- **B4 — FD exhaustion via `ulimit -n`** (F3). Portable POSIX; inode exhaustion
-  only if cleanly injectable. **Medium.**
+  `#5` and the read-fault lanes `:848,871-873`). *Phase-2 feasibility:* real injection
+  needs `mount` (Linux **root**); `ulimit -f` is a SIGXFSZ trap (wrong lane). **Second
+  cut — Linux + `sudo -n` probe-gated, or document-only.**
+- **B4 — FD exhaustion via `ulimit -n`** (F3). **Corrected (Phase-2 feasibility,
+  supersedes the earlier "portable POSIX" rating):** NOT portably/deterministically
+  injectable — the create needs only ~1 FD, so any `ulimit -n` low enough to fail it
+  first starves bash's own startup (machine-dependent harness corruption); inode
+  exhaustion needs root. **Document-only.**
 
 ### Tier C — Platform-only (verify off-Linux; not a Linux gap)
 

From 3789be9a8a79daa88c841ac26edf775eaaccadf8 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 00:58:06 +1000
Subject: [PATCH 28/76] Bucket 8 item 1: TAP output + 1..N plan line +
 undercount sentinel
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

All three suites: ok/bad now emit TAP13 (`ok`/`not ok N - desc`) gated by
GCL_TAP=1 (dev runs byte-unchanged); a trailing `1..N` plan line lets a consumer
fail on a short count; and a DONE sentinel + a finish() EXIT-trap wrapper turn any
early exit/crash into a loud `Bail out!` + exit 1 — closing the silent-undercount
gap (a stray `exit 0` after a recorded FAIL no longer reports green). A bare trap
`return` is ignored by bash, so the guard uses an explicit `exit 1`.

Validated: unit suite REDUCED + GCL_TAP=1 -> 220/220, the `1..220` plan line
matches the assertion count, exit 0, sentinel does not false-fire. interop +
integration syntax-checked here; full runs verify via CI.

Phase 3, step 1 of the Phase 2 build plan (Bucket 8 item 1).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.integration.test.sh | 29 +++++++++++++++++++----
 tests/git-commit-lock.interop.test.sh     | 29 +++++++++++++++++++----
 tests/git-commit-lock.test.sh             | 29 +++++++++++++++++++----
 3 files changed, 75 insertions(+), 12 deletions(-)

diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh
index a142bba..579a5da 100644
--- a/tests/git-commit-lock.integration.test.sh
+++ b/tests/git-commit-lock.integration.test.sh
@@ -59,11 +59,30 @@ cleanup() {
     rm -rf "$WORK" 2>/dev/null || true
   fi
 }
-trap cleanup EXIT
+# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
+# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
+# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
+# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
+finish() {
+  cleanup
+  if [ "${DONE:-0}" != 1 ]; then
+    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
+    exit 1
+  fi
+}
+trap finish EXIT
 
-PASS=0; FAIL=0
-ok()  { echo "PASS: $*"; PASS=$((PASS+1)); }
-bad() { echo "FAIL: $*"; FAIL=$((FAIL+1)); }
+PASS=0; FAIL=0; TAPN=0; DONE=0
+GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
+# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
+# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
+# just before the verdict) lets a TAP consumer fail on a short count; together with the
+# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the
+# "ok/bad cannot fail" property the `<assert> && ok ... || bad ...` idiom relies on.
+ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
+        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
+bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
+        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
 # --- sizing ------------------------------------------------------------------
 # Commits serialise (that's the whole point), so wall time ≈ workers x commit
@@ -301,5 +320,7 @@ done
                   || bad "$n_next leftover claim file(s) beside the lock"
 
 echo
+DONE=1
 echo "==== INTEGRATION RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="
+[ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
 [ "$FAIL" = 0 ]
diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh
index 8d2a566..a638005 100644
--- a/tests/git-commit-lock.interop.test.sh
+++ b/tests/git-commit-lock.interop.test.sh
@@ -67,9 +67,17 @@ WORK="$(pwsh -NoProfile -Command '[IO.Path]::Combine([IO.Path]::GetTempPath(), "
 WORK="${WORK//\\//}"
 mkdir -p "$WORK"
 
-PASS=0; FAIL=0
-ok()  { echo "PASS: $*"; PASS=$((PASS+1)); }
-bad() { echo "FAIL: $*"; FAIL=$((FAIL+1)); }
+PASS=0; FAIL=0; TAPN=0; DONE=0
+GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
+# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
+# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
+# just before the verdict) lets a TAP consumer fail on a short count; together with the
+# DONE sentinel below this closes the silent-undercount gap. `return 0` preserves the
+# "ok/bad cannot fail" property the `<assert> && ok ... || bad ...` idiom relies on.
+ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
+        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
+bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
+        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
 # Failure post-mortems need the logs: keep $WORK when anything failed, and
 # honour GCL_TEST_PRESERVE_DIR (the CI preserve-logs knob) by copying
@@ -86,7 +94,18 @@ cleanup() {
   fi
   rm -rf "$WORK" 2>/dev/null || true
 }
-trap cleanup EXIT
+# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
+# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
+# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
+# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
+finish() {
+  cleanup
+  if [ "${DONE:-0}" != 1 ]; then
+    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
+    exit 1
+  fi
+}
+trap finish EXIT
 
 # Poll for a marker file: ready-markers replace fixed head-start sleeps so a
 # slow pwsh cold-start (1-3s+ under load) can't fake an ordering failure.
@@ -1380,5 +1399,7 @@ else
 fi
 
 echo
+DONE=1
 echo "==== INTEROP RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="
+[ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
 [ "$FAIL" = 0 ]
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index b5ca5ee..5491768 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -51,11 +51,30 @@ cleanup() {
     rm -rf "$WORK" 2>/dev/null || true
   fi
 }
-trap cleanup EXIT
+# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
+# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
+# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
+# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
+finish() {
+  cleanup
+  if [ "${DONE:-0}" != 1 ]; then
+    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
+    exit 1
+  fi
+}
+trap finish EXIT
 
-PASS=0; FAIL=0
-ok()  { echo "PASS: $*"; PASS=$((PASS+1)); }
-bad() { echo "FAIL: $*"; FAIL=$((FAIL+1)); }
+PASS=0; FAIL=0; TAPN=0; DONE=0
+GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
+# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
+# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
+# just before the verdict) lets a TAP consumer fail on a short count; together with the
+# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the
+# "ok/bad cannot fail" property the `<assert> && ok ... || bad ...` idiom relies on.
+ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
+        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
+bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
+        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
 # Backdate a path's mtime by $2 seconds — the lock's staleness clock is the
 # lock FILE's own mtime (stamped by the creating write), so this is how a
@@ -2175,6 +2194,8 @@ rm -f "$LOCK" "$LOCK.next"
 #   Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by
 #   Test 32b.
 
+DONE=1
 echo
 echo "==== RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="
+[ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
 [ "$FAIL" = 0 ]

From dbecc0201c1ec1cb31478d717089b0d228cae500 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 01:00:07 +1000
Subject: [PATCH 29/76] Bucket 3: documentation edits (envelope, single-clock,
 network-FS, upgrade-both)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

docs/git-commit-lock.md: add the operating-envelope statement (correctness is
load-independent; the wall-clock bounds are best-effort and scale with scheduling)
and the single-time-source note (a local clock jump is correctness-safe, degrading
to the detected exit-98 lane), both beside the mtime-clock caveat; cross-ref
guarantees.md.

README.md: surface the network/sync-FS boundary in "How it works" (exclusion may
silently fail off a local FS), and add the "upgrade both implementations together"
deployment note (a mixed-version tree degrades prevention to detection).

C-misc (the optional case-insensitive-FS / mixed-version one-liners) skipped as
low-value — the design doc already covers mixed-version.

Plan changelog updated: Phase 3 step 1 done; the Bucket 8 item-2 selector is
deferred to bundle with item 3 (revised phasing recorded).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .plans/2026-06-17-ci-stress-phase2-build-plan.md | 14 ++++++++++++--
 README.md                                        | 12 +++++++++++-
 docs/git-commit-lock.md                          | 16 ++++++++++++++++
 3 files changed, 39 insertions(+), 3 deletions(-)

diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
index cb0e1a1..69b3bb6 100644
--- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md
+++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
@@ -420,5 +420,15 @@ Workflow once the final test count is known (plan D-e) — likely a Workflow for
   lean Workflow for the steering batch, hand-run the CI/doc edits.)
 - Anything else needing a call is surfaced inline in the integrated sections.
 
-## Changelog
-(empty — Phase 2 planning; implementation changelog starts in Phase 3.)
+## Changelog (Phase 3 implementation)
+- **Step 1 (commit `3789be9`) — Bucket 8 item 1 done.** TAP + `1..N` + the
+  `DONE`/`finish` undercount sentinel in all three suites. Unit validated locally
+  (220/220 REDUCED + matching plan line, exit 0, sentinel does not false-fire);
+  interop/integration syntax-checked, full runs via CI.
+- **Deviation — defer Bucket 8 item 2 (the `GCL_TEST_ONLY` selector).** Wrapping 43
+  blocks in `if section …; then … fi` is a large, boundary-sensitive change whose only
+  benefit is per-test iteration speed; for this batch the steering tests are validated
+  by a full-suite run, so it doesn't justify front-loading its risk. Bundled with item 3
+  (`_harness.sh` extraction — also a large harness change) into one validated
+  harness-restructure step near the end. **Revised phasing: 8.1 → 3 → 4 → 2A → 2B →
+  (8.2 + 8.3 together) → 6.**
diff --git a/README.md b/README.md
index 5bebc3a..9c7d595 100644
--- a/README.md
+++ b/README.md
@@ -57,7 +57,11 @@ atomic create-or-fail open (`O_CREAT|O_EXCL` / `FileMode.CreateNew`) — atomic
 on local POSIX filesystems and NTFS alike, with no dependency on `flock` —
 whose content is the holder's unique token. Every worktree has its own git
 dir, so independent worktrees get independent locks, while all agents sharing
-one checkout contend on the same lock. The lock is deliberately a stealable
+one checkout contend on the same lock. The protocol's correctness rests on these
+operations being atomic, which holds on local filesystems (ext4, APFS, NTFS, and
+kin) but **not** on network or sync-backed storage — NFS, SMB shares,
+Dropbox/OneDrive-synced directories — where exclusion may silently fail. Keep the
+repo (and so its `.git/`) on a local disk. The lock is deliberately a stealable
 **lease**, not a kernel lock: in unattended agent fleets a hung-but-alive
 holder is at least as common as a crashed one, and a lock that can't be taken
 from a stuck holder halts the whole run — while a rare collision costs little
@@ -94,6 +98,12 @@ against each other on all three OSes — not as platform support, but because
 two independent implementations hammering one lock is cheap adversarial
 verification of the protocol.
 
+**Upgrade both implementations together.** Older releases stole with an
+unserialized move-aside instead of the claim protocol, so the
+no-displacement-during-recovery guarantee holds only when every party in a tree
+runs a current version; a mixed-version tree degrades that prevention to
+detection (exit 98) and can leave `.dead.*` files current versions don't clean.
+
 ## Suggested agent instructions
 
 Agents only benefit from the lock if their instructions tell them to use it.
diff --git a/docs/git-commit-lock.md b/docs/git-commit-lock.md
index 828cfc4..f47fbb8 100644
--- a/docs/git-commit-lock.md
+++ b/docs/git-commit-lock.md
@@ -292,6 +292,22 @@ settles in milliseconds. The
 same floor governs the claim file's ageout: a sub-floor claim mtime reads as
 "just created", never "ancient — clear".
 
+**The operating envelope — correctness is load-independent; latency is not.**
+Exclusion, no-silent-loss, and eventual recovery rest on atomic create/rename
+plus per-attempt tokens, and hold under any load. The wall-clock bounds —
+recovery latency (≈ `STALE_SECS` + poll cadence), the `MAX_WAIT` timeout, and the
+~1.3 s read-retry ladder — are best-effort and scale with scheduling: under CPU
+oversubscription or a slow filesystem they stretch, but the protocol still
+recovers and never loses an update. (For the precise guarantee/scope split, see
+[`guarantees.md`](guarantees.md).)
+
+**One time source.** The tool assumes a single clock — single-host use (the
+common case: all contenders share one checkout, hence one machine and one clock),
+or a shared filesystem with one server clock. A local clock jump is
+correctness-safe: a forward jump can make a live lock look stale and be
+prematurely stolen, but that degrades to the detected exit-98 lane (the robbed
+holder's release fails loudly), never a silent double-commit.
+
 ## The PowerShell port (`git-commit-lock.ps1`)
 
 Some agents (Codex on Windows, for example) run their commands in

From 750be3cdb9d534da098d99036c515b350ab1f634 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 01:06:53 +1000
Subject: [PATCH 30/76] Bucket 4: correctness/envelope test split (D-c;
 GCL_ENVELOPE_TIER)

Add TAP-aware ok_envelope/bad_envelope to the unit suite: default 'strict' is
identical to ok/bad; GCL_ENVELOPE_TIER=relax downgrades an envelope FAIL to a WARN
that never reds the run (ENV_WARN counted + reported in the summary). Reassign the
three load-sensitive wall-clock/poll-count assertions to the envelope tier, keeping
every neighbouring correctness assertion strict:
- Test 21: recovery <=20s
- Test 22a: "warning fired at all" (n>=1) -> envelope; the warn-once dedup (n<=1)
  and the type-naming stay STRICT (names-type guarded on a warning having fired)
- Test 29: >=2 CLAIM lines

Validated: strict (default) -> 220/220, 0 envelope warnings, 1..220 consistent, the
3 sites report PASS[env]. Downgrade logic checked deterministically (relax->WARN,
strict->FAIL). No product (git-commit-lock.sh) change.

Phase 3, step 3 (Bucket 4). Required CI will set strict; nightly/deep set relax.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.test.sh | 40 +++++++++++++++++++++++++++++------
 1 file changed, 33 insertions(+), 7 deletions(-)

diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 5491768..eb8b662 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -76,6 +76,26 @@ ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
 bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
         [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
+# Envelope-tier assertions (Bucket 4 / decision D-c). A wall-clock or poll-count
+# bound is a Tier-2 (best-effort latency) property, NOT a correctness one (see
+# guarantees.md BE-1). In the default 'strict' tier these behave exactly like
+# ok/bad. Under GCL_ENVELOPE_TIER=relax (nightly/deep stress runs) an envelope FAIL
+# becomes a WARN that does NOT increment FAIL — so an oversubscribed runner can't
+# turn a latency miss into a red — while every CORRECTNESS assertion keeps ok/bad
+# and stays hard in both tiers. TAP-aware so envelope assertions still count toward 1..N.
+ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}"
+ENV_WARN=0
+ok_envelope()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS[env]: $*"
+                 [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
+bad_envelope() {
+  if [ "$ENVELOPE_TIER" = relax ]; then
+    ENV_WARN=$((ENV_WARN+1)); TAPN=$((TAPN+1)); echo "WARN[env-relaxed]: $*"
+    [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $* # env-relaxed"
+  else
+    FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
+    [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"
+  fi; return 0; }
+
 # Backdate a path's mtime by $2 seconds — the lock's staleness clock is the
 # lock FILE's own mtime (stamped by the creating write), so this is how a
 # test fakes a stale lock. Portable: BSD touch has no `-d @epoch`, so convert
@@ -1160,7 +1180,7 @@ t21_t1=$(date +%s)
 [ "$rc" = 0 ] && ok "waiter recovered through a crashed claimant's claim (rc 0)" || bad "rc=$rc behind a crashed claim"
 grep -q "CLAIM-STALE-CLEARED" "$LOG" && ok "aged claim cleared (CLAIM-STALE-CLEARED logged, with age)" || bad "no CLAIM-STALE-CLEARED entry"
 grep -q "STOLE-BY-CLAIM" "$LOG" && ok "steal completed after the clear" || bad "no STOLE-BY-CLAIM after clearing the crashed claim"
-[ $((t21_t1 - t21_t0)) -le 20 ] && ok "recovery latency bounded ($((t21_t1 - t21_t0))s)" || bad "recovery took $((t21_t1 - t21_t0))s (>20s)"
+[ $((t21_t1 - t21_t0)) -le 20 ] && ok_envelope "recovery latency bounded ($((t21_t1 - t21_t0))s)" || bad_envelope "recovery took $((t21_t1 - t21_t0))s (>20s)"
 [ -e "$LOCK.next" ] && bad "claim leftover after recovery" || ok "claim path clean after recovery"
 # (b) an EMPTY claim file (claimant died between create and write): same lane.
 LOCK="$WORK/ccempty.lock"; LOG="$WORK/ccempty.log"; : > "$LOG"
@@ -1183,10 +1203,16 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
   bash "$LIB" run -- bash -c 'true' 2> "$WORK/t22a.err"; rc=$?
 [ "$rc" = 97 ] && ok "dir at claim path: steals blocked, waiter timed out (97)" || bad "dir at claim path: rc=$rc (want 97)"
 [ -f "$LOCK.next/sub/f" ] && ok "directory at claim path untouched" || bad "directory at claim path was damaged!"
-grep -q "is not a claim file" "$WORK/t22a.err" && ok "loud claim-path config warning on stderr" || bad "no claim-path config warning"
-grep -q "it is a directory" "$WORK/t22a.err" && ok "claim warning names the detected type (directory)" || bad "claim warning does not name the type"
 n="$(grep -c "is not a claim file" "$WORK/t22a.err")"
-[ "$n" = 1 ] && ok "claim-path warning fired exactly once (got $n)" || bad "claim-path warning fired $n times (want 1)"
+# "warning fired at all" is timing-dependent (the two-poll confirmation needs poll
+# headroom before MAX_WAIT, which an oversubscribed runner can starve) -> envelope.
+# The warn-once dedup (never >1) and the type-naming are CORRECTNESS -> strict (the
+# latter only asserted when a warning actually fired).
+[ "$n" -ge 1 ] && ok_envelope "claim-path config warning fired (got $n)" || bad_envelope "no claim-path config warning (n=$n)"
+[ "$n" -le 1 ] && ok "claim-path warning not duplicated (n=$n)" || bad "claim-path warning fired $n times (warn-once broken)"
+if [ "$n" -ge 1 ]; then
+  grep -q "it is a directory" "$WORK/t22a.err" && ok "claim warning names the detected type (directory)" || bad "claim warning does not name the type"
+fi
 grep -q "STOLE-BY-CLAIM" "$LOG" && bad "stole despite a squatted claim path" || ok "no steal through a squatted claim path"
 [ -f "$LOCK" ] && ok "stale lock left in place (cannot be stolen safely)" || bad "lock vanished behind a squatted claim path"
 # (b) a free LOCK path is UNaffected by claim-path junk: normal acquire works.
@@ -1547,8 +1573,8 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
   ' _ "$LIB" 2>/dev/null; rc=$?
 [ "$rc" = 97 ] && ok "blocked-steal waiter honoured MAX_WAIT (97)" || bad "blocked-steal rc=$rc (want 97)"
 nclaim="$(grep -c "] CLAIM " "$LOG")"
-[ "$nclaim" -ge 2 ] && ok "claim re-created on later attempts (x$nclaim) — deleted immediately, no ageout penalty" \
-                    || bad "only $nclaim CLAIM line(s) — the failed steal's claim was left to age out (60s-class penalty)"
+[ "$nclaim" -ge 2 ] && ok_envelope "claim re-created on later attempts (x$nclaim) — deleted immediately, no ageout penalty" \
+                    || bad_envelope "only $nclaim CLAIM line(s) — the failed steal's claim was left to age out (60s-class penalty)"
 grep -q "steal FAILED" "$LOG" && ok "blocked rename logged (damped steal FAILED)" || bad "no steal FAILED log line"
 [ -e "$LOCK.next" ] && bad "claim leftover after the blocked steal attempts" || ok "no claim leftover at exit"
 [ -f "$LOCK" ] && ok "squatted lock left in place" || bad "lock vanished in the blocked lane"
@@ -2196,6 +2222,6 @@ rm -f "$LOCK" "$LOCK.next"
 
 DONE=1
 echo
-echo "==== RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="
+echo "==== RESULT: $PASS passed, $FAIL failed, $ENV_WARN envelope warning(s) (fan-out: $GCL_MODE) ===="
 [ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
 [ "$FAIL" = 0 ]

From cbc1eca65bb5db3ac5d6a38cb20833530afb8220 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 01:27:50 +1000
Subject: [PATCH 31/76] Bucket 2A wave 1: steering tests 37-40 (rename-refused,
 step-3.3, foreign-recheck, exec/H4)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Four new deterministic-steering unit tests closing high-value Tier-A gaps from
steering-coverage.md (each drafted + self-validated by a sub-agent against a faithful
harness, then re-validated together by me + the full suite):
- Test 37 (A1): CLAIM-ABORT (rename-refused) — a directory appears at the lock path
  mid-steal; the only acquire/steal VERDICT branch previously untested.
- Test 38 (A2): the step-3.3 pre-rename re-verify abort lane (kcov hits=0); a
  call-counter shadow proves the steal got past step-2 to the 3.3 position.
- Test 39 (A3): the foreign claim-recheck branch (kcov hits=0) — rival's claim left
  intact, discovery read, no false 98 (mutation-checked: 6 FAILs against a broken branch).
- Test 40 (A4): the exec-bypass / OOS-5 no-silent-loss boundary — exec in the
  lock-holding shell skips release (lock left, no RELEASED); exec in a child
  (run -- bash -c 'exec') does NOT; plus the displaced-holder silent-loss case.

Full unit suite: 259 passed, 0 failed, 1..259 consistent (REDUCED). No product change.
Tests 41-47 (A5-A11) land in waves 2-3.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.test.sh | 289 ++++++++++++++++++++++++++++++++++
 1 file changed, 289 insertions(+)

diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index eb8b662..c44f8ae 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -2208,6 +2208,295 @@ grep -q "resolved tok=tok.leak.t36.2" "$LOG" && ok "conclusive resolution logged
                                              || bad "no resolution log line for the conclusive drop"
 rm -f "$LOCK" "$LOCK.next"
 
+echo "== Test 37: rename-refused — a directory appearing at the lock path mid-steal aborts the steal, no false hold =="
+# The only acquire/steal VERDICT branch with no test: a NON-regular object (a
+# directory) appears AT the lock path between the claimant's final re-verify
+# (step 3.3, sees a stale FILE) and its rename-over, so the rename is refused
+# with the lock path occupied by a non-file. The claimant must classify this
+# as rename-refused (non-file at the lock path), delete its claim, take NO
+# hold, and re-poll to MAX_WAIT. Steered deterministically by shadowing mv:
+# the claim->lock rename (the `.next` move) is intercepted to swap the stale
+# lock FILE for a DIRECTORY at the lock path, then the real `mv -T` runs and
+# fails NATURALLY (mv refuses to overwrite a directory with a non-directory) —
+# exactly the wrong-type rename lane. The verifies don't call mv, so the lock
+# reads as a stale file through step 3.3; only the rename sees the directory.
+# Mutation check: an implementation that mis-classifies the refused rename
+# (e.g. treats it as blocked, or proceeds to STOLE-BY-CLAIM) fails the
+# no-false-hold / rename-refused assertions below.
+LOCK="$WORK/renref.lock"; LOG="$WORK/renref.log"; : > "$LOG"
+fabricate_lock "$LOCK" "tok.ghost.t37" "pid=9 host=ghost"; backdate "$LOCK" 9999
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=3 \
+  bash -c '
+    source "$1" || exit 70
+    # Shadow mv: on the claim->lock rename (the only mv touching ".next"),
+    # replace the stale lock file with a directory, then run the real mv -T,
+    # which refuses to overwrite a directory with a non-directory. The mv -T
+    # capability probe inside _lock_rename_over operates on its own temp paths
+    # (never ".next"), so it is unaffected.
+    mv() {
+      case "$*" in
+        *".next"*)
+          command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null
+          command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null
+          ;;
+      esac
+      command mv "$@"
+    }
+    lock_acquire
+    exit $?
+  ' _ "$LIB" 2>/dev/null; rc=$?
+[ "$rc" = 97 ] && ok "rename-refused waiter honoured MAX_WAIT (97), never falsely held" \
+               || bad "rename-refused rc=$rc (want 97 — a false hold would exit 0)"
+grep -q "CLAIM-ABORT (rename-refused)" "$LOG" \
+  && ok "CLAIM-ABORT (rename-refused) logged — the wrong-type rename branch was hit" \
+  || bad "no CLAIM-ABORT (rename-refused) — branch not exercised"
+grep -q "non-file at the lock path" "$LOG" \
+  && ok "rename refusal classified as non-file at the lock path" \
+  || bad "missing 'non-file at the lock path' classification wording"
+grep -q "STOLE-BY-CLAIM" "$LOG" \
+  && bad "spurious STOLE-BY-CLAIM — the steal was claimed despite the refused rename" \
+  || ok "no STOLE-BY-CLAIM (no false steal of the directory-occupied path)"
+grep -q "DISCOVERY-HOLD" "$LOG" \
+  && bad "spurious discovery-HOLD — the victim wrongly believed it acquired" \
+  || ok "no spurious discovery-HOLD — ownership-discovery read found no hold"
+grep -q "acquire verification FAILED" "$LOG" \
+  && bad "read-back path entered — the rename was treated as having succeeded" \
+  || ok "rename treated as refused, not as a completed-then-unverified steal"
+[ -e "$LOCK.next" ] \
+  && bad "claim leftover (\$LOCK.next) after the rename-refused abort" \
+  || ok "claim file cleaned up — no leftover \$LOCK.next"
+[ -d "$LOCK" ] \
+  && ok "directory left in place at the lock path (never overwritten)" \
+  || bad "lock path is no longer the squatting directory"
+rm -rf "$LOCK" "$LOCK.next"
+
+echo "== Test 38: step-3.3 pre-rename re-verify abort — claim cleaned, discovery, no false hold =="
+# The step-2 re-verify (sh:1075) and the step-3.3 re-verify immediately before
+# the rename (sh:1149) are near-identical abort lanes; Test 23/27 exercise the
+# step-2 lane only, leaving 3.3 untested. Steered with a CALL-COUNTER on
+# _lock_verify_stale: call 1 (step-2) passes through to the REAL verdict
+# (stale — the ghost is backdated 9999s), so the steal proceeds PAST step-2;
+# call 2 (step-3.3) freshens the lock first, so the real verify reports "fresh"
+# and the abort fires SPECIFICALLY at step-3.3. The proof is the log suffix
+# "(lock re-verify before rename: fresh)" — step-2's suffix is "after claim",
+# so the string can only be the 3.3 lane. STALE_SECS=30 keeps the freshened
+# ghost fresh long enough that the post-abort re-poll does NOT re-steal before
+# the test removes the lock — so the waiter then acquires via the CREATE race
+# (no second STOLE-BY-CLAIM), the same shape as Test 23.
+LOCK="$WORK/pr33.lock"; LOG="$WORK/pr33.log"; : > "$LOG"
+fabricate_lock "$LOCK" "tok.ghost.t38" "pid=9 host=slow"; backdate "$LOCK" 9999
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=30 \
+  AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \
+  bash -c '
+    source "$1" || exit 70
+    clone_fn _lock_verify_stale _vs_orig
+    N=0
+    _lock_verify_stale() {
+      N=$((N+1))
+      # call 1 = step-2: pass through to the real verdict (stale). call 2 =
+      # step-3.3: freshen the ghost lock so the real verify now sees "fresh",
+      # tripping the pre-rename abort at the 3.3 position.
+      if [ "$N" = 2 ]; then command touch -- "$AGENT_LOCK_PATH"; fi
+      _vs_orig "$@"
+    }
+    lock_acquire || exit 72
+    lock_release || exit 74
+    exit 0
+  ' _ "$LIB" 2>/dev/null &
+w38=$!
+# Proof the 3.3 lane ran AND the steal got PAST step-2: the "before rename"
+# suffix is unique to the step-3.3 position (step-2 logs "after claim").
+wait_for_grep "lock re-verify before rename: fresh" "$LOG" 20 \
+  && ok "step-3.3 pre-rename re-verify aborted (fresh) — got past step-2 to the 3.3 lane" \
+  || bad "no step-3.3 'before rename' abort — the 3.3 lane did not run"
+grep -q "CLAIM-ABORT (fresh) tok=.* (lock re-verify before rename: fresh)" "$LOG" \
+  && ok "CLAIM-ABORT (fresh) logged at the 3.3 position (reason map: fresh)" \
+  || bad "no CLAIM-ABORT (fresh) with the 'before rename' suffix"
+grep -q "lock re-verify after claim" "$LOG" \
+  && bad "the abort fired at step-2 (after claim) — the call-counter let call 1 trip, not the 3.3 lane" \
+  || ok "no step-2 (after claim) abort — call 1 passed; only the 3.3 lane aborted"
+grep -q "STOLE-BY-CLAIM" "$LOG" \
+  && bad "a rename installed the claim — the 3.3 fresh abort did not prevent the steal" \
+  || ok "no STOLE-BY-CLAIM — no rename onto the lock from the aborted attempt"
+grep -q "DISCOVERY-HOLD" "$LOG" \
+  && bad "spurious DISCOVERY-HOLD — the victim wrongly held after the 3.3 abort" \
+  || ok "no false hold — the discovery read ran and the victim did not wrongly hold"
+[ -e "$LOCK.next" ] && bad "claim leftover immediately after the 3.3 fresh abort" \
+                    || ok "claim deleted on the 3.3 fresh abort"
+rm -f "$LOCK"                       # the slow holder releases normally
+wait "$w38"; rc=$?
+[ "$rc" = 0 ] && ok "waiter re-polled past the 3.3 abort, then acquired/released (rc 0)" \
+              || bad "waiter rc=$rc after the slow holder released (want 0)"
+[ -e "$LOCK.next" ] && bad "claim leftover after the waiter finished" || ok "no claim leftover at exit"
+rm -f "$LOCK" "$LOCK.next"
+
+
+echo "== Test 39: foreign claim at recheck — left intact, discovery, no false 98 =="
+# After winning its claim and passing step-2 re-verify, the claimant rechecks
+# its OWN claim file before installing. The `gone` recheck leg is covered (Test
+# 25 recheck-gone / Test 32); the `foreign` leg is NOT: a waiter judged our
+# claim abandoned, cleared it, and a RIVAL re-claimed in its place, so the
+# recheck reads back a FOREIGN token at the claim path. The claimant must then
+# LEAVE the rival's claim alone, run the ownership-discovery read (the lock is
+# still the ghost, not ours -> no hold), and back off to re-poll — never a 98
+# (a mere claim recheck carries NO stolen-lease semantics) and never a deletion
+# of the rival's claim.
+#
+# Steering (Test 24/25 idiom): clone _lock_claim_state and, on the FIRST recheck
+# only (fire-once via a flag FILE so a subshell can't lose the state), overwrite
+# <lock>.next with a fresh-mtime foreign "tok.rival.*" token before delegating
+# to the original — exactly what a waiter-cleared + rival-reclaimed claim path
+# looks like. The original then classifies it `foreign`. CLAIM_STALE is large
+# and MAX_WAIT small so the freshly-planted rival claim is never aged out: it
+# survives, the create on the next poll loses to it, and the waiter times out
+# 97. Mutation check: an implementation that 98'd on a foreign recheck, or that
+# deleted/overwrote the rival's claim, or that false-HELD, fails the asserts.
+LOCK="$WORK/foreign-recheck.lock"; LOG="$WORK/foreign-recheck.log"; : > "$LOG"
+fabricate_lock "$LOCK" "tok.ghost.t39" "pid=9 host=ghost"; backdate "$LOCK" 9999
+SF="$LOCK.steered"; RIVAL="tok.rival.t39.deadbeef"; rm -f "$SF"
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=3 \
+  SF="$SF" RIVAL="$RIVAL" \
+  bash -c '
+    source "$1" || exit 70
+    clone_fn _lock_claim_state _cs_orig
+    _lock_claim_state() {
+      # Fire ONCE, at the post-win recheck of OUR claim: a waiter cleared ours
+      # and a rival re-claimed. Plant the rival token (fresh mtime => not stale)
+      # then classify via the real function.
+      if [ ! -e "$SF" ] && [ "$1" = "$_LOCK_CLAIM_TOKEN" ] \
+         && [ "$_LOCK_CLAIM_PATH" -ef "$AGENT_LOCK_PATH.next" ] 2>/dev/null; then
+        : > "$SF"
+        printf "%s\n%s\n" "$RIVAL" "pid=4242 host=rival" > "$_LOCK_CLAIM_PATH"
+      fi
+      _cs_orig "$@"
+    }
+    lock_acquire
+    exit $?
+  ' _ "$LIB" 2>/dev/null; rc=$?
+
+# The foreign-recheck branch ran (its log line is the proof the leg executed).
+grep -q "claim recheck: foreign token '$RIVAL' at the claim" "$LOG" \
+  && ok "foreign-recheck branch ran (rival token left at the claim, discovery read)" \
+  || bad "no foreign-recheck log line — branch not executed"
+# A mere claim recheck must NEVER report a stolen-lease 98.
+[ "$rc" = 98 ] && bad "false 98 on a foreign CLAIM recheck (no lease was ever held)" \
+              || ok "no false 98 on the foreign claim recheck (rc=$rc)"
+# No hold was ever taken: discovery saw the ghost, not our token.
+grep -q "DISCOVERY-HOLD" "$LOG" && bad "false discovery-HOLD on the foreign recheck" \
+                               || ok "no false hold (ownership-discovery read found the ghost, not ours)"
+grep -q "STOLE-BY-CLAIM" "$LOG" && bad "claimant stole despite a foreign claim at recheck" \
+                                || ok "no STOLE-BY-CLAIM — claimant backed off the foreign claim"
+# The rival's claim file SURVIVES, unmodified (left intact, never deleted).
+[ -e "$LOCK.next" ] && ok "rival's foreign claim file still present (not deleted)" \
+                    || bad "rival's foreign claim was deleted — must be left alone"
+rl1=""; IFS= read -r rl1 < "$LOCK.next" 2>/dev/null || true
+[ "$rl1" = "$RIVAL" ] && ok "rival's claim token intact (untouched: $rl1)" \
+                      || bad "rival's claim token modified (line1=$rl1, want $RIVAL)"
+grep -q "CLAIM-STALE-CLEARED" "$LOG" && bad "claimant aged-out/cleared the rival's fresh claim" \
+                                     || ok "rival's fresh claim never cleared as stale"
+# Clean outcome: the lock was never acquired; the waiter timed out (97).
+[ "$rc" = 97 ] && ok "waiter re-polled past the foreign claim and timed out cleanly (97)" \
+              || bad "rc=$rc (want 97 — clean re-poll/timeout behind the surviving rival claim)"
+# The ghost lock is untouched (never stolen).
+gl1=""; IFS= read -r gl1 < "$LOCK" 2>/dev/null || true
+[ "$gl1" = "tok.ghost.t39" ] && ok "ghost lock untouched by the foreign-recheck backoff" \
+                             || bad "ghost lock modified (line1=$gl1)"
+rm -f "$LOCK" "$LOCK.next" "$SF"
+
+echo "== Test 40: exec-bypass boundary — exec in the lock-holding shell skips release (OOS-5); exec in a child does not =="
+# `lock_run` runs the wrapped command vector with `"$@"` IN THE WRAPPER SHELL
+# (git-commit-lock.sh), so a command that is itself an `exec` REPLACES the
+# lock-holding wrapper process: the trailing `lock_release` AND the EXIT trap
+# are both skipped, and the lock is left held with no RELEASED logged. This is
+# the one interleaving that can SILENTLY lose an update (guarantees.md OOS-5) —
+# this test pins the exact boundary so a future change to the release/trap
+# wiring can't quietly widen or close it without a red.
+
+# (a1) BYPASS: `run -- exec true` — the wrapped command IS an exec, so it
+# replaces the wrapper. Release + EXIT trap are skipped: lock LEFT, no RELEASED
+# (ACQUIRED proves the hold was taken, so "no RELEASED" means the trap really
+# was bypassed, not that nothing ran).
+LOCK="$WORK/t40.bypass.lock"; LOG="$WORK/t40.bypass.log"; : > "$LOG"
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- exec true; rc=$?
+[ "$rc" = 0 ] && ok "run -- exec true exits 0 (the exec'd command's code)" \
+              || bad "run -- exec true rc=$rc (want 0)"
+grep -q ACQUIRED "$LOG" && ok "run -- exec true did take the lock (ACQUIRED logged)" \
+                        || bad "run -- exec true: no ACQUIRED — the hold never happened, test is vacuous"
+[ -e "$LOCK" ] && ok "run -- exec true LEFT the lock file (release bypassed by exec)" \
+               || bad "run -- exec true: lock released — exec did NOT bypass (boundary changed)"
+grep -q RELEASED "$LOG" && bad "run -- exec true logged RELEASED — the EXIT trap was NOT skipped (boundary changed)" \
+                        || ok "run -- exec true logged NO RELEASED (EXIT trap skipped — OOS-5 boundary)"
+rm -f "$LOCK"
+
+# (a2) CONTROL — NO bypass: `run -- bash -c 'exec true'` — the exec replaces the
+# CHILD, not the wrapper, so the wrapper releases normally: lock GONE, RELEASED
+# logged. The opposite outcome to (a1) is the whole point; assert both so the
+# test documents the exact boundary.
+LOCK="$WORK/t40.child.lock"; LOG="$WORK/t40.child.log"; : > "$LOG"
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'exec true'; rc=$?
+[ "$rc" = 0 ] && ok "run -- bash -c 'exec true' exits 0" \
+              || bad "run -- bash -c 'exec true' rc=$rc (want 0)"
+[ -e "$LOCK" ] && bad "run -- bash -c 'exec true' LEFT the lock — exec in a child must NOT bypass" \
+               || ok "run -- bash -c 'exec true' released the lock (exec in a child does not bypass)"
+grep -q RELEASED "$LOG" && ok "run -- bash -c 'exec true' logged RELEASED (the control: release ran)" \
+                        || bad "run -- bash -c 'exec true' logged NO RELEASED — the control case did not release"
+rm -f "$LOCK"
+
+# (a3) REALISTIC sourced bypass: `lock_acquire; exec true` in a sourcing shell
+# (a subshell so it can't take the suite down) — the holder execs away before
+# release, leaving the lock held. This is the shape a real caller hits if it
+# execs while holding instead of calling lock_release.
+LOCK="$WORK/t40.sourced.lock"; LOG="$WORK/t40.sourced.log"; : > "$LOG"
+( AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash -c '
+    source "$1" || exit 70
+    lock_acquire || exit 72
+    exec true
+  ' _ "$LIB" ); rc=$?
+[ "$rc" = 0 ] && ok "sourced lock_acquire; exec true exits 0" \
+              || bad "sourced lock_acquire; exec true rc=$rc (want 0)"
+[ -e "$LOCK" ] && ok "sourced lock_acquire; exec true LEFT the lock held (release skipped)" \
+               || bad "sourced lock_acquire; exec true released the lock — exec did not bypass"
+grep -q RELEASED "$LOG" && bad "sourced exec-while-holding logged RELEASED — the trap was not skipped" \
+                        || ok "sourced exec-while-holding logged NO RELEASED (release + trap skipped)"
+rm -f "$LOCK"
+
+# (b) SILENT-LOSS boundary: a DISPLACED holder that execs a 0-exit is UNWARNED.
+# Build a holder H that (sourced) acquires, backdates its OWN lock ancient so a
+# contender steals it (H is now displaced — a rival token sits at the path),
+# then execs a 0-exit. Because the exec skips BOTH release and the EXIT trap,
+# the displacement-detection in lock_release NEVER runs: H exits 0 with no
+# WARNING and no 98. This is exactly the documented silent boundary (OOS-5): a
+# non-unwinding exit while displaced cannot report that the hold was not
+# exclusive. (backdate/epoch_to_stamp are export -f'd by the preamble, so the
+# steering shell inherits them.)
+LOCK="$WORK/t40.silent.lock"; LOG="$WORK/t40.silent.log"; : > "$LOG"
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+  AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 bash -c '
+    source "$1" || exit 70
+    lock_acquire || exit 72             # H holds the lock
+    backdate "$2" 9999                  # H'"'"'s own lock now ancient -> instantly stealable
+    # A contender steals it (separate process) — H is displaced once a rival
+    # token lands at the path.
+    AGENT_LOCK_PATH="$2" AGENT_LOCK_LOG="$3" AGENT_LOCK_STALE_SECS=1 \
+      AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 \
+      bash "$1" run -- true
+    exec true                           # H execs 0 — neither release nor trap runs
+  ' _ "$LIB" "$LOCK" "$LOG"; rc=$?
+[ "$rc" = 0 ] && ok "displaced holder's exec-0 exits 0 (no unwinding ran)" \
+              || bad "displaced holder's exec-0 rc=$rc (want 0)"
+grep -q "STOLE-BY-CLAIM" "$LOG" \
+  && ok "the contender genuinely displaced H (STOLE-BY-CLAIM logged) — H WAS displaced" \
+  || bad "no STOLE-BY-CLAIM — H was not actually displaced, the (b) premise is gone"
+grep -q "lock LOST" "$LOG" \
+  && bad "H logged a 'lock LOST' displacement WARNING — the exec did NOT skip release/trap" \
+  || ok "displaced holder's exec-0 emitted NO 'lock LOST' WARNING (silent boundary — OOS-5)"
+grep -q "WARNING" "$LOG" \
+  && bad "an unexpected WARNING was logged by the displaced exec-0 holder" \
+  || ok "displaced holder's exec-0 emitted NO WARNING at all (unwarned silent loss)"
+rm -f "$LOCK"
+
 # NOTES (deliberately untested here):
 # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a
 #   foreign no-delete-share handle on the lock file — Windows-only, and the

From dee154342dbd01efa00e5b43440cfd35c8db7649 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 01:42:09 +1000
Subject: [PATCH 32/76] Bucket 2A waves 2-3: steering tests 41-47 (Tier-A
 coverage complete)

Seven more deterministic-steering unit tests (each drafted + self-validated by a
sub-agent against a faithful harness, then re-validated together by me + the full suite):
- Test 41 (A5): forward clock jump steals a live lock -> detected 98, never silent (E2).
- Test 42 (A6): mtime unreadable -> staleness disabled, fail-safe no-steal, warn-once,
  97 (E3). Shadows the INNER _lock_stat_mtime (NOT _lock_path_mtime, which emits the warning).
- Test 43 (A7): malformed/unreadable lock content at the poll guard -- #18 blank line 1
  ("not lock-shaped"), #17 unreadable steal-read ("steal skipped ... unreadable") -- never stolen.
- Test 44 (A8): socket & device-node wrong-type arms -> refused, 97 (socket arm POSIX/CI-gated;
  device-node arm runs everywhere via /dev/null, proven non-destructive).
- Test 45 (A9): log self-truncates past ~1 MB (rotation), with a sub-threshold negative control.
- Test 46 (A10): EXIT while waiting (no hold) -> the no-hold trap arc, no spurious release.
  kcov-confirmed it flips :1009/:1017/:1018 from hits=0; corrected from a wrong initial recipe
  (a post-97 exit has the EXIT trap already restored, so it can't reach this arc).
- Test 47 (A11): the no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 ->
  steal still installs; the [ -d ] guard refuses a directory. Lane proven via an mv trace
  (bare mv vs mv -T).

Full unit suite: 311 passed, 0 failed, 1..311 consistent (REDUCED). No product change.
Bucket 2A (the 11 Tier-A steering gaps from steering-coverage.md) is complete.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.test.sh | 553 ++++++++++++++++++++++++++++++++++
 1 file changed, 553 insertions(+)

diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index c44f8ae..aca7323 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -2497,6 +2497,559 @@ grep -q "WARNING" "$LOG" \
   || ok "displaced holder's exec-0 emitted NO WARNING at all (unwarned silent loss)"
 rm -f "$LOCK"
 
+echo "== Test 41: forward clock jump steals a live lock — detected as 98, never silent (E2) =="
+# Staleness is age = now - mtime (git-commit-lock.sh ~:928, ~:1409), where `now`
+# is _lock_now. A process whose clock has LEAPED FORWARD computes an inflated age
+# for everyone's lock, so it can judge a LIVE, fresh lock ancient and steal it.
+# This is correctness-safe but liveness-degraded: it degrades into the already-
+# handled robbed-holder lane (Test 4b) — the displaced holder DETECTS the theft
+# at release and exits 98 with a loud WARNING; it never silently double-commits.
+#
+# Steering (no real sleep/backdate): holder H acquires and HOLDS a fresh lock on
+# a NORMAL clock. Waiter W has _lock_now shadowed to return the real now PLUS a
+# large offset (+9999s), so H's just-created lock looks ~9999s old to W and W
+# steals it. STALE=100 means the lock is genuinely fresh under a normal clock
+# (without the jump W would block, never steal — the jump is what's causal);
+# CLAIM_STALE=99999 keeps W's own just-created claim (also judged ~9999s old by
+# W's jumped clock) well under the claim-stale window, so W's recheck does not
+# self-abort (contested) and the steal proceeds to rename.
+LOCK="$WORK/fwdjump.lock"; LOG="$WORK/fwdjump.log"; : > "$LOG"; OUT="$WORK/fwdjump-out"; : > "$OUT"
+READY="$WORK/t41.ready"; TDONE="$WORK/t41.thief-done"
+# Holder H (sourced, NORMAL clock): create+hold a fresh lock, signal READY, hold
+# until told the waiter is done, then release and exit with the release rc.
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=100 \
+  AGENT_LOCK_CLAIM_STALE_SECS=99999 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=120 \
+  bash -c '
+    source "$1" || exit 70
+    lock_acquire || exit 72
+    echo h-work >> "$2"
+    touch "$3"
+    until [ -e "$4" ]; do sleep 0.05; done
+    lock_release
+    exit $?
+  ' _ "$LIB" "$OUT" "$READY" "$TDONE" &
+hpid=$!
+wait_for_file "$READY" || bad "T41 holder never signalled ready (lock not held)"
+# Waiter W (sourced, clock JUMPED +9999s): _lock_now returns real now + offset, so
+# every age it computes is inflated and H's fresh lock reads as ancient. W acquires
+# (by stealing) then releases; run in the FOREGROUND so its rc is captured.
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=100 \
+  AGENT_LOCK_CLAIM_STALE_SECS=99999 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \
+  bash -c '
+    source "$1" || exit 70
+    clone_fn _lock_now _now_orig
+    _lock_now() { echo $(( $(_now_orig) + 9999 )); }
+    lock_acquire || exit 72
+    echo w-work >> "$2"
+    lock_release
+    exit $?
+  ' _ "$LIB" "$OUT"
+wpid_rc=$?
+touch "$TDONE"
+wait "$hpid"; h_rc=$?
+# W judged H's live, fresh lock ancient under the jumped clock and stole it.
+grep -q "STOLE-BY-CLAIM" "$LOG" \
+  && ok "forward-jumped waiter stole a LIVE fresh lock (STOLE-BY-CLAIM)" \
+  || bad "no STOLE-BY-CLAIM — jumped waiter did not steal the live lock"
+[ "$wpid_rc" = 0 ] && ok "thief (its own fresh hold) released cleanly (rc 0)" \
+                   || bad "thief rc=$wpid_rc (its own fresh hold should release 0)"
+grep -q w-work "$OUT" && ok "thief did its work" || bad "thief work missing"
+# The proof: the premature steal was DETECTED, not silent — H exits exactly 98.
+[ "$h_rc" = 98 ] && ok "robbed holder detected the premature steal — exits exactly 98" \
+                 || bad "robbed holder rc=$h_rc (forward-jump steal must degrade to 98, never silent)"
+grep -q "WARNING: lock LOST" "$LOG" \
+  && ok "robbed holder logged a loud theft WARNING (no silent double-commit)" \
+  || bad "no theft WARNING logged for the forward-jump steal"
+rm -f "$LOCK" "$LOCK.next"
+
+echo "== Test 42: mtime unreadable — staleness disabled, fail-safe (no steal), warn-once, 97 (E3) =="
+# §E3: if the lock file's mtime cannot be read AT ALL (every probe fails on a
+# PRESENT file), staleness detection is BROKEN. The mtime floor fails closed to
+# "fresh": _lock_verify_stale returns state=fresh, so a crashed/stale holder is
+# NEVER stolen — recovery is disabled and waiters block to MAX_WAIT (97). The
+# tool must say so LOUDLY, exactly once per process. Test 1 only asserts the
+# NEGATIVE (the warning must NOT fire under healthy contention); this drives the
+# positive lane.
+#
+# Steering: shadow _lock_stat_mtime — the INNER single-probe (sh:606, runs
+# stat/date and prints the epoch) — to return EMPTY for the LOCK path while it
+# is PRESENT. We must NOT shadow _lock_path_mtime (sh:629): that is the 3x-retry
+# wrapper that EMITS the warn-once, so shadowing it would remove the very
+# warning we assert. With the inner probe empty on a present file,
+# _lock_path_mtime retries 3x, sees the file present-but-unreadable, fires the
+# warn-once and sets _LOCK_MTIME="" -> _lock_verify_stale -> fresh -> no steal.
+# The shadow returns empty ONLY for the lock path: _lock_stat_mtime is also used
+# for the CLAIM file's mtime (sh:1120/1230), which must keep working, and other
+# paths fall through to the real probe.
+T42_LOCK="$WORK/t42.lock"; T42_LOG="$WORK/t42.log"; T42_ERR="$WORK/t42.err"
+: > "$T42_LOG"; : > "$T42_ERR"
+# A STALE ghost that WOULD normally be stolen (backdated 9999s, well past STALE):
+# the whole point is that it is NOT stolen because its age can't be established.
+fabricate_lock "$T42_LOCK" "tok.ghost.t42.99999" "pid=99999 host=ghost"
+backdate "$T42_LOCK" 9999
+T42_INNER='
+  source "$1" || exit 70
+  clone_fn _lock_stat_mtime _sm_orig
+  # Return EMPTY for the present lock path; defer to the real probe otherwise
+  # (the claim-file mtime at sh:1120/1230 must stay readable).
+  _lock_stat_mtime() {
+    if [ "$1" = "$AGENT_LOCK_PATH" ]; then printf ""; return 0; fi
+    _sm_orig "$@"
+  }
+  lock_acquire; exit $?
+'
+# Tight timing: small MAX_WAIT so the blocked waiter reaches 97 in ~2-3s.
+AGENT_LOCK_PATH="$T42_LOCK" AGENT_LOCK_LOG="$T42_LOG" AGENT_LOCK_STALE_SECS=2 \
+  AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \
+  bash -c "$T42_INNER" _ "$LIB" 2>"$T42_ERR"; t42_rc=$?
+
+# (1) The fail-safe lane ran: the warn-once line appears. It is logged via
+#     _lock_log (lock log) AND echoed to stderr; assert either surface.
+if grep -q "Staleness detection is BROKEN" "$T42_LOG" "$T42_ERR" 2>/dev/null \
+   || grep -q "cannot read the lock file's mtime" "$T42_ERR" 2>/dev/null; then
+  ok "mtime-unreadable: 'Staleness detection is BROKEN' fail-safe warning fired"
+else
+  bad "mtime-unreadable: no broken-staleness warning (fail-safe lane did not run); err=$(cat "$T42_ERR")"
+fi
+# (2) NO steal: the stale ghost is NOT stolen and is left in place.
+if grep -q "STOLE-BY-CLAIM" "$T42_LOG" 2>/dev/null || grep -q "STOLE" "$T42_LOG" 2>/dev/null; then
+  bad "mtime-unreadable: ghost was STOLEN — staleness should have been disabled"
+else
+  ok "mtime-unreadable: no steal (recovery disabled, ghost not stolen)"
+fi
+g42="$(head -n 1 -- "$T42_LOCK" 2>/dev/null | tr -d '\r')"
+[ "$g42" = "tok.ghost.t42.99999" ] \
+  && ok "mtime-unreadable: stale ghost lock left in place (token unchanged)" \
+  || bad "mtime-unreadable: ghost lock disturbed (line1=$g42, want tok.ghost.t42.99999)"
+# (3) The waiter blocks to MAX_WAIT and exits 97 (recovery disabled).
+[ "$t42_rc" = 97 ] \
+  && ok "mtime-unreadable: waiter blocked to MAX_WAIT and exited 97" \
+  || bad "mtime-unreadable: waiter rc=$t42_rc (want 97 — was the stale ghost stolen?)"
+# (4) Warn-once: the broken-staleness warning fires EXACTLY once per process.
+t42_warns="$(grep -c "Staleness detection is BROKEN" "$T42_ERR" 2>/dev/null || echo 0)"
+[ "$t42_warns" -le 1 ] \
+  && ok "mtime-unreadable: broken-staleness warning fired at most once on stderr ($t42_warns)" \
+  || bad "mtime-unreadable: warning repeated ($t42_warns times — warn-once broken)"
+rm -f "$T42_LOCK" "$T42_LOCK.next"
+
+echo "== Test 43: malformed/unreadable lock content at the poll guard — never stolen, warned/skipped =="
+# Two sibling branches of the in-acquire steal CONTENT GUARD (git-commit-lock.sh
+# ~:1419-1444), both gated on an already-stale candidate, neither of which the
+# torn/empty/tok.-prefixed cases (Tests 17/18) reach:
+#   (a) #18 — line 1 is NON-EMPTY but BLANK (whitespace/CR only): the trim at
+#       :1421 reduces it to empty, but the file is NOT empty (`-s` true) and the
+#       read SUCCEEDED, so it lands in the final `else` -> _lock_warn_nonlock
+#       "its content is not lock-shaped" (the `is not a lock file` config
+#       warning). NO steal; waiters reach 97.
+#   (b) #17 — the content read FAILS on a present, non-empty regular file (the
+#       `[ "$rdrc" -ne 0 ]` lane at :1432): logs "steal skipped: stale lock
+#       content unreadable"; NO steal; waiters reach 97. We can't make a real
+#       file unreadable on every platform (a chmod-000 file still reads for its
+#       owner on Windows/Cygwin), so we STEER it: source the lib in-process and
+#       shadow the `read` builtin to fail ONLY for the inline steal-guard read,
+#       identified by its direct caller `lock_acquire` (FUNCNAME[1]) — the
+#       _lock_read_tok / _lock_verify_stale reads delegate to `builtin read`, so
+#       only the :1420 site is perturbed.
+
+# (a) #18 — whitespace-only line 1: non-empty, blank, read OK -> never stolen, warned.
+LOCK="$WORK/t43blank.lock"; LOG="$WORK/t43blank.log"; : > "$LOG"
+printf ' \n' > "$LOCK"; backdate "$LOCK" 9999          # one space + LF: non-empty, blank line 1
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+  AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \
+  bash "$LIB" run -- bash -c 'true' 2> "$WORK/t43a.err"; rc=$?
+[ "$rc" = 97 ] && ok "#18 blank line 1: waiter timed out (97) instead of stealing" \
+               || bad "#18 blank line 1: rc=$rc (want 97)"
+grep -q "is not a lock file" "$WORK/t43a.err" \
+  && ok "#18 config warning fired (line 1 not lock-shaped)" || bad "#18 no config warning for blank line 1"
+grep -q "non-lock object at lock path (its content is not lock-shaped)" "$LOG" \
+  && ok "#18 log records the non-lock-shaped classification (branch ran)" \
+  || bad "#18 missing the non-lock-shaped log line (branch did not run)"
+grep -q "STOLE" "$LOG" && bad "#18 blank-content file was STOLEN" || ok "#18 no steal of the blank-content file"
+[ -f "$LOCK" ] && ok "#18 blank-content file left in place" || bad "#18 blank-content file was removed"
+rm -f "$LOCK"
+
+# (b) #17 — steal-guard content read FAILS on a present, non-empty file.
+# Steering shell: source the lib, shadow the `read` builtin to fail ONLY when
+# invoked directly by lock_acquire (the inline steal read at sh:1420). The ghost
+# is tok.-prefixed and ancient, so absent the shadow it WOULD be stolen — the
+# 97 outcome plus the "steal skipped ... unreadable" line prove the failed-read
+# lane (not some other refusal) is what blocked the steal.
+LOCK="$WORK/t43unread.lock"; LOG="$WORK/t43unread.log"; : > "$LOG"
+fabricate_lock "$LOCK" "tok.ghost.t43" "pid=9 host=ghost"; backdate "$LOCK" 9999
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+  AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \
+  bash -c '
+    source "$1" || exit 70
+    # Shadow the read builtin; reach the real one via `builtin read`. Fail only
+    # the steal-guard read (its direct caller is lock_acquire) so the
+    # _lock_read_tok / _lock_verify_stale reads stay intact.
+    read() {
+      if [ "${FUNCNAME[1]:-}" = lock_acquire ]; then return 1; fi
+      builtin read "$@"
+    }
+    lock_acquire || exit 97
+    lock_release || exit 74
+    exit 0
+  ' _ "$LIB" 2> "$WORK/t43b.err"; rc=$?
+[ "$rc" = 97 ] && ok "#17 unreadable steal content: waiter timed out (97) instead of stealing" \
+               || bad "#17 unreadable steal content: rc=$rc (want 97)"
+grep -q "steal skipped: stale lock content unreadable" "$LOG" \
+  && ok "#17 log records the skipped steal (unreadable branch ran)" \
+  || bad "#17 missing the 'steal skipped ... unreadable' log line (branch did not run)"
+grep -q "STOLE" "$LOG" && bad "#17 ghost was STOLEN despite the unreadable content read" \
+                       || ok "#17 no steal while the steal-guard read fails"
+[ -f "$LOCK" ] && ok "#17 stale ghost left in place" || bad "#17 stale ghost was removed"
+rm -f "$LOCK"
+
+echo "== Test 44: socket & device-node at the lock path — never stolen/deleted, refused (97) =="
+# The never-steal wrong-type guard (git-commit-lock.sh ~:1557-1567) classifies
+# NON-regular objects at the lock path so they are NEVER stolen and NEVER
+# deleted: a real config error (a typo'd AGENT_LOCK_PATH, a stray special file)
+# must wedge waiters to 97 with a loud one-time config warning, not get
+# clobbered. Test 17 covers the directory / symlink / FIFO arms of that
+# classifier; this test covers the two remaining arms — the SOCKET (-S) and the
+# DEVICE NODE (-b/-c) — both of which name their detected type in the warning.
+# For each: rc 97, the object survives unchanged (same type), the warning fires
+# naming the type, and nothing is ever stolen.
+
+# (a) a UNIX-DOMAIN SOCKET at the lock path. Fabricated with a backgrounded
+# python3 AF_UNIX bind (the socket inode persists while the process holds it);
+# skipped where a real socket can't be made AND classified -S by the running
+# shell — notably default Git-Bash on Windows, whose bundled python is a native
+# build with no socket.AF_UNIX (probed: bind raises AttributeError, so no inode
+# appears). CI's POSIX legs exercise this arm. The listener is reaped by its
+# EXACT pid at the end (never by name).
+LOCK="$WORK/sock.lock"; LOG="$WORK/sock.log"; : > "$LOG"
+SOCKERR="$WORK/sock.py.err"; sock_pid=""; sock_ok=0
+if command -v python3 >/dev/null 2>&1; then
+  rm -f "$LOCK"
+  python3 -c 'import socket,sys,time
+s=socket.socket(socket.AF_UNIX)
+s.bind(sys.argv[1])
+sys.stderr.write("bound\n"); sys.stderr.flush()
+time.sleep(30)' "$LOCK" 2> "$SOCKERR" &
+  sock_pid=$!
+  # Gate on the socket actually existing AND classifying -S (not just the pid
+  # being alive): on a no-AF_UNIX build the process exits immediately with no
+  # inode, so we must positively confirm the object before relying on it.
+  for _ in $(seq 1 100); do
+    [ -S "$LOCK" ] && { sock_ok=1; break; }
+    kill -0 "$sock_pid" 2>/dev/null || break
+    sleep 0.05
+  done
+fi
+if [ "$sock_ok" = 1 ]; then
+  AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+    AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=3 \
+    bash "$LIB" run -- bash -c 'true' 2> "$WORK/t44a.err"; rc=$?
+  [ "$rc" = 97 ] && ok "socket at lock path: waiter timed out (97), command never ran" \
+                 || bad "socket at lock path: rc=$rc (want 97)"
+  [ -S "$LOCK" ] && ok "socket untouched (never stolen/deleted, still a socket)" \
+                 || bad "socket at lock path was removed/replaced!"
+  grep -q "is not a lock file" "$WORK/t44a.err" && ok "loud config warning on stderr (socket)" \
+                                                || bad "no config warning for socket at lock path"
+  grep -q "it is a socket" "$WORK/t44a.err" && ok "warning names the detected type (socket)" \
+                                            || bad "warning does not name the socket type"
+  n="$(grep -c "is not a lock file" "$WORK/t44a.err")"
+  [ "$n" = 1 ] && ok "socket config warning fired exactly once per process (got $n)" \
+               || bad "socket config warning fired $n times (want 1)"
+  grep -q STOLE "$LOG" && bad "socket was STOLEN" || ok "no steal attempted on a socket"
+else
+  echo "note: cannot create a unix-domain socket here (no socket.AF_UNIX / not classified -S) — socket guard not exercised (CI POSIX legs cover it)"
+fi
+# Reap the listener by ITS exact pid only (bounded wait, then hard-kill of the
+# same pid as a last resort) — never by name. Harmless if it already exited.
+if [ -n "$sock_pid" ]; then
+  kill "$sock_pid" 2>/dev/null
+  for _ in $(seq 1 40); do kill -0 "$sock_pid" 2>/dev/null || break; sleep 0.05; done
+  kill -0 "$sock_pid" 2>/dev/null && kill -9 "$sock_pid" 2>/dev/null
+  wait "$sock_pid" 2>/dev/null
+fi
+rm -f "$LOCK"
+
+# (b) a DEVICE NODE at the lock path. mknod needs root, but /dev/null is a
+# character device that always exists, so we point AGENT_LOCK_PATH straight at
+# it: the -c arm of the classifier must refuse it. This is SAFE precisely
+# because the guard refuses — it is never opened-for-write, stolen, or deleted —
+# which the post-run assertion below proves (/dev/null is still a char device).
+# Skipped only if /dev/null somehow isn't a char device on this platform.
+if [ -c /dev/null ]; then
+  LOG="$WORK/dev.log"; : > "$LOG"
+  AGENT_LOCK_PATH="/dev/null" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
+    AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=3 \
+    bash "$LIB" run -- bash -c 'true' 2> "$WORK/t44b.err"; rc=$?
+  [ "$rc" = 97 ] && ok "device node (/dev/null) at lock path: waiter timed out (97), command never ran" \
+                 || bad "device node at lock path: rc=$rc (want 97)"
+  [ -c /dev/null ] && ok "/dev/null untouched (never stolen/deleted, still a char device)" \
+                   || bad "/dev/null was damaged — the guard must NEVER touch a device node!"
+  grep -q "is not a lock file" "$WORK/t44b.err" && ok "loud config warning on stderr (device node)" \
+                                                || bad "no config warning for device node at lock path"
+  grep -q "it is a device node" "$WORK/t44b.err" && ok "warning names the detected type (device node)" \
+                                                 || bad "warning does not name the device-node type"
+  n="$(grep -c "is not a lock file" "$WORK/t44b.err")"
+  [ "$n" = 1 ] && ok "device-node config warning fired exactly once per process (got $n)" \
+               || bad "device-node config warning fired $n times (want 1)"
+  grep -q STOLE "$LOG" && bad "device node was STOLEN" || ok "no steal attempted on a device node"
+else
+  echo "note: /dev/null is not a char device here — device-node guard not exercised (CI POSIX legs cover it)"
+fi
+
+
+echo "== Test 45: log self-truncates past ~1 MB (rotation, not unbounded growth) =="
+# _lock_log starts the log over (not rotate) once it grows past ~1MB: the size
+# check at the top of _lock_log truncates the file to empty before the write,
+# so a normal log-producing op on an oversized log leaves a small, well-formed
+# log carrying only the fresh protocol lines. Pre-fill > 1MB, run one clean
+# acquire+release, assert the log SHRANK and the lock still worked.
+LOCK="$WORK/t45.lock"; LOG="$WORK/t45.log"
+# Pre-fill comfortably above the 1048576-byte (1MB) threshold (~1.2MB of 'x').
+head -c 1200000 /dev/zero | tr '\0' 'x' > "$LOG"
+before=$(wc -c < "$LOG")
+[ "$before" -gt 1048576 ] && ok "pre-fill exceeds the 1MB threshold (${before} bytes)" \
+                          || bad "pre-fill not over threshold (${before} bytes)"
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'true'; rc=$?
+[ "$rc" = 0 ] && ok "lock op succeeded over an oversized log (rc=0)" \
+             || bad "lock op rc=$rc over oversized log (want 0)"
+after=$(wc -c < "$LOG")
+# Truncation fired iff the log is now far below the threshold (it holds only a
+# handful of fresh lines). Use 1MB as the boundary: any non-truncation leaves
+# it at/above the 1.2MB pre-fill.
+[ "$after" -lt 1048576 ] && ok "log shrank below threshold after the op (${before} -> ${after} bytes — rotation fired)" \
+                         || bad "log did NOT shrink (${before} -> ${after} bytes — truncation never fired)"
+# Well-formed: the new log carries the fresh protocol lines, not the old giant
+# 'x' content, and records the truncation.
+grep -q 'log exceeded 1MB; truncated' "$LOG" && ok "log records the self-truncation notice" \
+                                             || bad "no truncation notice in the restarted log"
+grep -q 'ACQUIRED' "$LOG" && grep -q 'RELEASED' "$LOG" \
+  && ok "restarted log carries fresh ACQUIRED + RELEASED protocol lines" \
+  || bad "restarted log missing fresh protocol lines (ACQUIRED/RELEASED)"
+grep -q 'xxxx' "$LOG" && bad "old oversized 'x' content survived into the restarted log" \
+                      || ok "old oversized content is gone (clean restart, not appended)"
+[ -e "$LOCK" ] && bad "lock left held after run" || ok "lock released after the over-threshold run"
+rm -f "$LOCK" "$LOG"
+
+echo "== Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release =="
+# A10 (steering-coverage.md): _lock_on_exit's no-hold arc-end (:1009,1017-1018).
+# A sourced waiter, blocked in the wait loop against a LIVE held lock, exits 0
+# while still parked — the EXIT trap is STILL '_lock_on_exit' (the timeout's
+# trap-restore has NOT run, because we never time out), so EXIT fires the
+# handler on the NO-HOLD path: claim-trap cleanup (no token => no-op),
+# leaked-resolve, restore traps. NO release semantics may run (we never held).
+#
+# Why interposition and not "lock_acquire times out 97 then exit": the 97
+# timeout path itself runs _lock_restore_traps BEFORE returning, so by the time
+# the caller exits the EXIT trap is already gone and _lock_on_exit never fires
+# (verified: post-97 `trap -p EXIT` is empty). To exercise the EXIT-while-
+# WAITING arc the process must leave the loop via `exit` with the trap still
+# armed — so W shadows `sleep` (called once per poll inside the wait loop) to
+# park on a marker, then `exit 0` from inside that first poll-sleep. At that
+# point _LOCK_HELD=0 and no claim is in flight (the live lock is never stale, so
+# no steal/claim was attempted), which is exactly the no-hold arc.
+T46_INNER='
+  source "$1" || exit 70
+  F46=0
+  sleep() {
+    if [ "$F46" = 0 ]; then
+      F46=1
+      command touch "$T46R"                 # signal: parked in the wait loop
+      until [ -e "$T46G" ]; do command sleep 0.05; done
+      # Record the live EXIT trap so the assertions can prove _lock_on_exit
+      # (not a bare/restored trap) is what fires on the exit below.
+      trap -p EXIT > "$T46T"
+      exit 0                                  # EXIT while waiting, no hold held
+    fi
+    command sleep "$@"
+  }
+  lock_acquire
+  echo "REACHED-UNEXPECTED rc=$?" >&2        # the shadowed sleep must exit first
+'
+LOCK="$WORK/exitwait.lock"; LOG="$WORK/exitwait.log"; : > "$LOG"
+HLOG="$WORK/exitwait.h.log"; : > "$HLOG"
+T46R="$WORK/t46.ready"; T46G="$WORK/t46.go"; T46T="$WORK/t46.trap"
+rm -f "$T46R" "$T46G" "$T46T" "$LOCK" "$LOCK.next"
+# H: holder — sourced, takes a FRESH live lock and parks until released. STALE is
+# huge so the lock is never judged stealable; W therefore stays a pure waiter.
+HR="$WORK/t46.hready"; HG="$WORK/t46.hgo"; rm -f "$HR" "$HG"
+HR="$HR" HG="$HG" \
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$HLOG" AGENT_LOCK_STALE_SECS=600 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \
+  bash -c '
+    source "$1" || exit 70
+    lock_acquire || exit 72
+    touch "$HR"
+    until [ -e "$HG" ]; do sleep 0.05; done
+    lock_release
+  ' _ "$LIB" 2>/dev/null &
+h46=$!
+wait_for_file "$HR" 30 || bad "T46 holder never acquired the lock"
+htok=""; IFS= read -r htok < "$LOCK" || true       # the live holder's token
+# W: the waiter that will exit while parked in the wait loop (no hold).
+T46R="$T46R" T46G="$T46G" T46T="$T46T" \
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=600 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \
+  bash -c "$T46_INNER" _ "$LIB" 2>/dev/null &
+w46=$!
+# Gate on W proving it reached the wait-loop poll (its WAITING line is logged,
+# and its shadowed sleep touched the ready marker) before releasing it to exit.
+wait_for_grep "WAITING for lock" "$LOG" 30 || bad "T46 waiter never logged WAITING"
+wait_for_file "$T46R" 30 || bad "T46 waiter never reached its wait-loop poll"
+touch "$T46G"
+wait "$w46"; rc=$?
+# Core assertion: W exited cleanly via the EXIT no-hold arc, with NO release
+# semantics — it never held the lock, so a RELEASED or a 98/'lock LOST' would
+# mean the handler wrongly ran the holding branch.
+[ "$rc" = 0 ] && ok "waiter exited 0 via the EXIT-while-waiting no-hold arc" \
+              || bad "T46 waiter rc=$rc (want 0; EXIT trap mishandled the no-hold arc?)"
+grep -q RELEASED "$LOG" && bad "spurious RELEASED on the no-hold EXIT arc (release ran without a hold)" \
+                        || ok "no RELEASED on the no-hold EXIT arc (no release semantics)"
+grep -q "lock LOST" "$LOG" && bad "98-classification ran on the no-hold EXIT arc" \
+                           || ok "no 98 classification on the no-hold EXIT arc"
+# The trap that fired was our handler, not a bare/restored one — this is the
+# discriminator that the EXIT-WHILE-WAITING arc ran (vs a post-97 exit, where
+# the trap is already empty). Mirrors Test 12d's trap-restoration idiom.
+grep -q "_lock_on_exit" "$T46T" && ok "EXIT trap still armed as _lock_on_exit at exit (no-hold arc, not post-97)" \
+                                || bad "EXIT trap was not _lock_on_exit at exit (got: $(cat "$T46T" 2>/dev/null))"
+# The waiter left no claim behind (it never claimed — the live lock is not stale).
+[ -e "$LOCK.next" ] && bad "waiter left a claim file behind on the no-hold EXIT arc" \
+                    || ok "no leftover claim from the no-hold EXIT waiter"
+# H's lock is untouched — still the holder's original token, still held.
+l1=""; IFS= read -r l1 < "$LOCK" 2>/dev/null || true
+[ -n "$htok" ] && [ "$l1" = "$htok" ] && ok "holder's lock untouched by the dying waiter (token intact)" \
+                                      || bad "holder's lock changed by the dying waiter (was=$htok now=$l1)"
+# Release H and confirm it shut down cleanly (no fallout from W's exit).
+touch "$HG"; wait "$h46" 2>/dev/null
+grep -q "lock LOST" "$HLOG" && bad "holder saw a stolen lease (98) — the waiter's exit disturbed the hold" \
+                            || ok "holder released its still-held lock cleanly (no 98)"
+rm -f "$LOCK" "$LOCK.next" "$T46R" "$T46G" "$T46T" "$HR" "$HG"
+
+echo "== Test 47: no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 — steal still installs =="
+# _lock_rename_over (git-commit-lock.sh ~:961-979) probes once for GNU `mv -T`
+# and caches the verdict in _LOCK_MVT (""=unprobed, 1=supported, 0=not). On
+# Linux/MINGW the probe ALWAYS picks `mv -T`, so the no-`-T` fallback lane
+# (~:976-977: a last-instant `[ -d "$dst" ]` guard + a bare `mv`) is NEVER
+# executed in CI except on a real BSD/macOS runner. Pre-seeding _LOCK_MVT=0 in
+# the sourced steal shell BEFORE any acquire makes the `[ -z "$_LOCK_MVT" ]`
+# probe short-circuit (the var is already non-empty), forcing the fallback on
+# the common leg. Two scenarios:
+#   (a) a normal steal of a stale ghost under _LOCK_MVT=0 installs the lock via
+#       the unlink-free bare-`mv` fallback (STOLE-BY-CLAIM, the steal acquires);
+#   (b) a DIRECTORY squatting the lock path under _LOCK_MVT=0 is refused by the
+#       fallback's `[ -d ]` last-instant guard (no clobber) — the fallback-path
+#       analogue of Test 37's `mv -T` natural refusal.
+# Determinism proof that the fallback truly ran (not GNU `mv -T`): scenario (a)
+# shadows `mv` to record, per invocation touching ".next", whether `-T` was
+# passed; under _LOCK_MVT=0 the steal's claim->lock rename MUST be a bare `mv`
+# (no `-T`). A control run WITHOUT the override is asserted to still steal, so a
+# pass cannot come from the override having silently broken the steal entirely.
+
+# ---- (a) forced-fallback steal of a stale ghost: STOLE-BY-CLAIM via bare mv ----
+LOCK="$WORK/mvt0.lock"; LOG="$WORK/mvt0.log"; : > "$LOG"
+MVTRACE="$WORK/mvt0.mvtrace"; : > "$MVTRACE"
+fabricate_lock "$LOCK" "tok.ghost.t47" "pid=9 host=ghost"; backdate "$LOCK" 9999
+# Sourced steal shell: pre-seed _LOCK_MVT=0, shadow `mv` to log the flags it was
+# called with on the ".next" (claim->lock) rename, then call the real `mv`.
+AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 \
+  bash -c '
+    source "$1" || exit 70
+    _LOCK_MVT=0                                  # force the no-mv-T fallback lane
+    export MVTRACE_PATH="$2"                     # pass the trace path into mv() via env
+    mv() {
+      case "$*" in
+        *".next"*) printf "%s\n" "$*" >> "$MVTRACE_PATH" ;;  # record claim->lock rename flags
+      esac
+      command mv "$@"
+    }
+    lock_acquire || exit 72
+    lock_release || exit 74
+    exit 0
+  ' _ "$LIB" "$MVTRACE" 2>/dev/null; rc=$?
+[ "$rc" = 0 ] && ok "T47(a): forced-fallback steal acquired+released rc 0 (_LOCK_MVT=0)" \
+              || bad "T47(a): forced-fallback steal rc=$rc (want 0)"
+grep -q "STOLE-BY-CLAIM" "$LOG" \
+  && ok "T47(a): stale ghost stolen via the no-mv-T fallback (STOLE-BY-CLAIM logged)" \
+  || bad "T47(a): no STOLE-BY-CLAIM under _LOCK_MVT=0 — fallback did not install the lock"
+grep -q "ACQUIRED" "$LOG" && grep -q "RELEASED" "$LOG" \
+  && ok "T47(a): fallback steal produced a clean ACQUIRED/RELEASED pair" \
+  || bad "T47(a): missing ACQUIRED/RELEASED after the fallback steal"
+# The mv trace proves the fallback lane (bare mv, no -T) actually carried the
+# claim->lock rename — the whole point of forcing _LOCK_MVT=0.
+[ -s "$MVTRACE" ] \
+  && ok "T47(a): claim->lock rename went through the shadowed mv (trace non-empty)" \
+  || bad "T47(a): no .next rename recorded — the steal did not rename-over as expected"
+if grep -q -- '-T' "$MVTRACE"; then
+  bad "T47(a): claim->lock rename used 'mv -T' — the GNU fast path ran, fallback NOT forced"
+else
+  ok "T47(a): claim->lock rename used a BARE mv (no -T) — the BSD/macOS fallback lane was taken"
+fi
+{ [ -e "$LOCK" ] || [ -e "$LOCK.next" ]; } \
+  && bad "T47(a): leftover lock/claim after the fallback steal+release" \
+  || ok "T47(a): clean final state (no lock, no claim) after fallback steal+release"
+
+# ---- (a-control) same steal WITHOUT the override still succeeds ----
+# Guards against a false pass where _LOCK_MVT=0 silently broke the steal: the
+# unmodified library must steal the identical ghost too (here via mv -T).
+LOCKC="$WORK/mvt0c.lock"; LOGC="$WORK/mvt0c.log"; : > "$LOGC"
+fabricate_lock "$LOCKC" "tok.ghost.t47c" "pid=9 host=ghost"; backdate "$LOCKC" 9999
+AGENT_LOCK_PATH="$LOCKC" AGENT_LOCK_LOG="$LOGC" AGENT_LOCK_STALE_SECS=2 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 \
+  bash -c 'source "$1" || exit 70; lock_acquire || exit 72; lock_release || exit 74; exit 0' \
+  _ "$LIB" 2>/dev/null; rcc=$?
+[ "$rcc" = 0 ] && grep -q "STOLE-BY-CLAIM" "$LOGC" \
+  && ok "T47(a-control): unmodified steal of the same ghost also succeeds (override didn't trivially break it)" \
+  || bad "T47(a-control): control steal rc=$rcc / no STOLE-BY-CLAIM (the (a) pass may be vacuous)"
+
+# ---- (b) directory at the lock path under _LOCK_MVT=0: [ -d ] guard refuses ----
+# The fallback's last-instant `[ -d "$dst" ]` guard (sh:976) must refuse to
+# rename a file over a directory — Test 37's no-clobber outcome, reached via the
+# fallback rather than `mv -T`'s natural directory refusal. Test 37 shadows `mv`
+# so the directory appears just before the real `mv -T` refuses it; that timing
+# does NOT exercise the fallback's `[ -d ]` because the swap lands AFTER the
+# library has already passed line 976. To hit the fallback guard itself we wrap
+# `_lock_rename_over`: the wrapper installs the directory and pins _LOCK_MVT=0,
+# THEN calls the unmodified original — whose own `[ -d "$dst" ]` check (line 976)
+# now sees the directory and returns 1, with NO library `mv`/`mv -T` ever run.
+# The verifies (step 3.3) ran before the wrapper, so they saw a stale FILE; the
+# directory exists only from the wrapper's first line onward. This is the
+# fallback-lane analogue of Test 37's wrong-type refusal.
+LOCKB="$WORK/mvt0dir.lock"; LOGB="$WORK/mvt0dir.log"; : > "$LOGB"
+fabricate_lock "$LOCKB" "tok.ghost.t47b" "pid=9 host=ghost"; backdate "$LOCKB" 9999
+AGENT_LOCK_PATH="$LOCKB" AGENT_LOCK_LOG="$LOGB" AGENT_LOCK_STALE_SECS=1 \
+  AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=3 \
+  bash -c '
+    source "$1" || exit 70
+    clone_fn _lock_rename_over _ro_orig
+    _lock_rename_over() {
+      # Land a DIRECTORY at the lock path, then force the fallback lane and run
+      # the REAL rename-over: its own `[ -d ]` guard (sh:976) must refuse (rc 1).
+      command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null
+      command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null
+      _LOCK_MVT=0
+      _ro_orig
+    }
+    lock_acquire
+    exit $?
+  ' _ "$LIB" 2>/dev/null; rcb=$?
+[ "$rcb" = 97 ] && ok "T47(b): fallback [ -d ] guard refused; waiter honoured MAX_WAIT (97), no false hold" \
+               || bad "T47(b): rc=$rcb (want 97 — a clobber/false hold would differ)"
+grep -q "CLAIM-ABORT (rename-refused)" "$LOGB" \
+  && ok "T47(b): CLAIM-ABORT (rename-refused) logged — fallback guard hit the wrong-type lane" \
+  || bad "T47(b): no CLAIM-ABORT (rename-refused) — fallback guard branch not exercised"
+grep -q "non-file at the lock path" "$LOGB" \
+  && ok "T47(b): refusal classified as non-file at the lock path" \
+  || bad "T47(b): missing 'non-file at the lock path' classification"
+grep -q "STOLE-BY-CLAIM" "$LOGB" \
+  && bad "T47(b): spurious STOLE-BY-CLAIM — the directory-occupied path was falsely stolen" \
+  || ok "T47(b): no STOLE-BY-CLAIM (the [ -d ] guard prevented a false steal)"
+[ -d "$LOCKB" ] \
+  && ok "T47(b): directory left in place at the lock path (never clobbered by the fallback mv)" \
+  || bad "T47(b): lock path no longer the squatting directory — the guard failed to protect it"
+[ -e "$LOCKB.next" ] \
+  && bad "T47(b): claim leftover (\$LOCK.next) after the fallback rename-refused abort" \
+  || ok "T47(b): claim file cleaned up — no leftover \$LOCK.next"
+rm -rf "$LOCK" "$LOCK.next" "$LOCKC" "$LOCKC.next" "$LOCKB" "$LOCKB.next"
+
+
 # NOTES (deliberately untested here):
 # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a
 #   foreign no-delete-share handle on the lock file — Windows-only, and the

From 3f7bd2372b2de2e3e02544504400dc17d12dfcfb Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 01:52:28 +1000
Subject: [PATCH 33/76] Bucket 2B: fault-injection tests 48-50 (F4, F2/J1, F1)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Three new fault-injection unit tests (failure-modes §4.5), validated on Windows
(Git Bash) + Linux (WSL):
- Test 48 (F4): unwritable lock dir (chmod 0555) -> clean 97, command never runs,
  no false hold. POSIX-only (chmod 0555 is a no-op on NTFS -- skip-with-note on
  Windows; the POSIX CI legs run it). WSL: 5/5.
- Test 49 (F2/J1): a failing log path (AGENT_LOCK_LOG under a regular file ->
  ENOTDIR) -> the lock still acquires+releases, the log write is swallowed.
  Portable (no guard), runs everywhere. 4/4 both platforms.
- Test 50 (F1): ENOSPC on create/write (a tiny full tmpfs) -> wait then 97, no
  false hold. Linux + passwordless-sudo only (ulimit -f is a SIGXFSZ trap, not
  usable) -- skip-with-note otherwise; the Linux CI leg runs it. WSL: 2/2.

F3 (FD/inode exhaustion) is document-only -- not deterministically injectable
(the create needs ~1 FD), per steering-coverage B4.

Full unit suite Windows REDUCED: 315 passed, 0 failed, 1..315. No product change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.test.sh | 79 +++++++++++++++++++++++++++++++++++
 1 file changed, 79 insertions(+)

diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index aca7323..e33a98b 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -3050,6 +3050,85 @@ grep -q "STOLE-BY-CLAIM" "$LOGB" \
 rm -rf "$LOCK" "$LOCK.next" "$LOCKC" "$LOCKC.next" "$LOCKB" "$LOCKB.next"
 
 
+echo "== Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4) =="
+# F4 (failure-modes.md §4.5): a read-only / unwritable lock-dir parent makes the
+# O_EXCL create fail every poll, so the waiter times out at 97 — no corruption, no
+# false hold, and the wrapped command never runs. POSIX-only: chmod 0555 is a no-op
+# for writes on Git-Bash/NTFS (the create would wrongly succeed), so skip-with-note
+# on Windows; the Linux/macOS CI legs exercise it.
+case "$(uname -s)" in
+  MINGW*|MSYS*|CYGWIN*)
+    echo "note: Test 48 skipped on Windows — chmod 0555 does not deny writes on NTFS; the POSIX CI legs cover it" ;;
+  *)
+    T48DIR="$WORK/t48.nowrite"; T48LOG="$WORK/t48.log"; mkdir -p "$T48DIR"; : > "$T48LOG"
+    T48MARK="$WORK/t48.ran"; rm -f "$T48MARK"
+    chmod 0555 "$T48DIR"
+    AGENT_LOCK_PATH="$T48DIR/commit.lock" AGENT_LOCK_LOG="$T48LOG" \
+      AGENT_LOCK_STALE_SECS=1 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \
+      bash "$LIB" run -- bash -c "touch '$T48MARK'" 2> "$WORK/t48.err"; rc=$?
+    [ "$rc" = 97 ] && ok "F4 unwritable lock dir: waiter timed out (97)" \
+                   || bad "F4 unwritable lock dir: rc=$rc (want 97)"
+    [ ! -e "$T48MARK" ] && ok "F4: the wrapped command never ran" \
+                        || bad "F4: the wrapped command ran despite no lock"
+    [ ! -e "$T48DIR/commit.lock" ] && ok "F4: no lock file created in the unwritable dir" \
+                                   || bad "F4: a lock file appeared in an unwritable dir"
+    grep -q "WAITING for lock" "$T48LOG" && ok "F4: logged WAITING (the create kept failing)" \
+                                         || bad "F4: no WAITING log"
+    grep -q "TIMEOUT after" "$T48LOG" && ok "F4: logged the TIMEOUT" || bad "F4: no TIMEOUT log"
+    chmod 0755 "$T48DIR" 2>/dev/null; rm -rf "$T48DIR"   # restore so cleanup() can rm -rf $WORK
+    ;;
+esac
+
+echo "== Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1) =="
+# F2/J1 (failure-modes.md §4.5): logging is best-effort (every write ends || true).
+# Point AGENT_LOCK_LOG under a REGULAR FILE so every append/open fails ENOTDIR — the
+# lock must still acquire+release cleanly (rc 0) with the log write swallowed.
+# Portable (no chmod/perms). NOTE: bash's redirection-OPEN failure leaks to stderr
+# (the ||true is on the write, not the open), so do NOT assert clean stderr; and do
+# NOT grep the log (nothing is ever written to it).
+T49P="$WORK/t49.notadir"; : > "$T49P"          # a regular FILE; using it as a dir -> ENOTDIR
+T49LOG="$T49P/x.log"                            # every open/append under it fails ENOTDIR
+T49MARK="$WORK/t49.ran"; rm -f "$T49MARK"
+AGENT_LOCK_PATH="$WORK/t49.lock" AGENT_LOCK_LOG="$T49LOG" \
+  bash "$LIB" run -- bash -c "touch '$T49MARK'" 2>/dev/null; rc=$?
+[ "$rc" = 0 ] && ok "F2/J1 failing log: lock acquired+released, command ran (rc 0)" \
+             || bad "F2/J1 failing log: rc=$rc (want 0 — a bad log must not fail the lock)"
+[ -e "$T49MARK" ] && ok "F2/J1: the wrapped command ran" \
+                  || bad "F2/J1: the wrapped command did not run"
+[ ! -e "$WORK/t49.lock" ] && ok "F2/J1: lock released/cleaned up despite the failing log" \
+                          || bad "F2/J1: lock left behind"
+[ ! -e "$T49LOG" ] && ok "F2/J1: the log write was swallowed (no log file under the non-dir)" \
+                   || bad "F2/J1: a log file was created under a non-dir"
+rm -f "$T49P" "$WORK/t49.lock"
+
+echo "== Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1) =="
+# F1 (failure-modes.md §4.5): a full filesystem makes the create's write fail
+# (ENOSPC); the created-but-write-failed file is an empty orphan and the waiter
+# times out at 97 — no corruption, no false hold. Real ENOSPC needs a full FS, which
+# needs root (a small tmpfs); `ulimit -f` is NOT usable (it raises SIGXFSZ and kills
+# the wrapper, the wrong lane). So: Linux + passwordless sudo only; skip-with-note
+# otherwise. The Linux CI leg (ubuntu runners have passwordless sudo) exercises it.
+if [ "$(uname -s)" = Linux ] && sudo -n true 2>/dev/null; then
+  T50MNT="$WORK/t50.full"; T50LOG="$WORK/t50.log"; mkdir -p "$T50MNT"; : > "$T50LOG"
+  T50MARK="$WORK/t50.ran"; rm -f "$T50MARK"
+  if sudo mount -t tmpfs -o size=64k tmpfs "$T50MNT" 2>/dev/null; then
+    dd if=/dev/zero of="$T50MNT/fill" bs=1k count=256 2>/dev/null || true   # fill to ENOSPC
+    AGENT_LOCK_PATH="$T50MNT/commit.lock" AGENT_LOCK_LOG="$T50LOG" \
+      AGENT_LOCK_STALE_SECS=1 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \
+      bash "$LIB" run -- bash -c "touch '$T50MARK'" 2> "$WORK/t50.err"; rc=$?
+    [ "$rc" = 97 ] && ok "F1 ENOSPC: waiter timed out (97)" \
+                   || bad "F1 ENOSPC: rc=$rc (want 97)"
+    [ ! -e "$T50MARK" ] && ok "F1: the wrapped command never ran under ENOSPC" \
+                        || bad "F1: the wrapped command ran despite ENOSPC"
+    sudo umount "$T50MNT" 2>/dev/null
+  else
+    echo "note: Test 50 skipped — could not mount a tmpfs (sudo mount failed); covered where mountable"
+  fi
+  rmdir "$T50MNT" 2>/dev/null || true
+else
+  echo "note: Test 50 skipped — ENOSPC injection needs Linux + passwordless sudo (a small tmpfs); the Linux CI leg covers it"
+fi
+
 # NOTES (deliberately untested here):
 # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a
 #   foreign no-delete-share handle on the lock file — Windows-only, and the

From ba443c7b1823feda324a52a3e9616fe5e1c99a93 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 01:59:39 +1000
Subject: [PATCH 34/76] Fix Test 37 (rename-refused) portability on macOS/BSD
 mv

Test 37's mv shadow created the squatting directory INSIDE the mv call, which
lands AFTER the no-mv-T fallback's [ -d ] guard -- and BSD "mv file dir" moves the
file INTO the dir rather than erroring, so the macOS CI leg saw the steal succeed
(STOLE-BY-CLAIM) instead of rename-refused (4 FAILs on macos-15; the product is fine).

Fix: make the directory appear BEFORE rename-over runs, by wrapping
_lock_rename_over (mkdir then delegate to the original) -- refused PORTABLY: GNU
"mv -T" refuses to overwrite a directory, and the no-mv-T fallback [ -d ] guard
(BSD/macOS) refuses it too. Validated locally on both engine paths (native mv -T and
a forced _LOCK_MVT=0 fallback: 8/8 each); Test 47 (A11) already exercises the same
wrap+guard and passed on macOS. Full Windows suite: 315 passed, 0 failed.

Caught by the cross-platform CI dispatch (run 27701297220).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.test.sh | 26 +++++++++++++-------------
 1 file changed, 13 insertions(+), 13 deletions(-)

diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index e33a98b..56cc7c2 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -2229,19 +2229,19 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
   AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=3 \
   bash -c '
     source "$1" || exit 70
-    # Shadow mv: on the claim->lock rename (the only mv touching ".next"),
-    # replace the stale lock file with a directory, then run the real mv -T,
-    # which refuses to overwrite a directory with a non-directory. The mv -T
-    # capability probe inside _lock_rename_over operates on its own temp paths
-    # (never ".next"), so it is unaffected.
-    mv() {
-      case "$*" in
-        *".next"*)
-          command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null
-          command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null
-          ;;
-      esac
-      command mv "$@"
+    # Make a DIRECTORY appear at the lock path BEFORE the real rename-over runs,
+    # by wrapping _lock_rename_over (NOT by shadowing mv). It is refused PORTABLY:
+    # GNU "mv -T" refuses to overwrite a directory with a non-directory, AND the
+    # no-mv-T fallback [ -d ] guard (BSD/macOS) refuses it too. A mv shadow that
+    # mkdirs the dir INSIDE the mv call works only on GNU: it lands AFTER the
+    # fallback [ -d ] check, and BSD "mv file dir" MOVES the file INTO the dir
+    # rather than erroring (this failed the macOS CI leg). NB: no apostrophes
+    # here -- this comment lives inside the bash -c single-quoted steering shell.
+    clone_fn _lock_rename_over _ro_orig
+    _lock_rename_over() {
+      command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null
+      command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null
+      _ro_orig
     }
     lock_acquire
     exit $?

From f47185756dfc5dcb4479dc55b3eb849b95a6249b Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 02:18:39 +1000
Subject: [PATCH 35/76] =?UTF-8?q?Plan:=20REOPEN=20D-d=20(merge-to-main=20s?=
 =?UTF-8?q?trategy)=20=E2=80=94=20cherry-pick=20vs=20tidy-rebase=20vs=20sq?=
 =?UTF-8?q?uash?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ben reopened the merge-to-main mechanism: cherry-picking may not beat tidying
up and preserving history. Recorded the alternatives + git facts in Bucket 5 of
the guarantees-and-coverage plan, flipped D-d from settled to open, and
cross-referenced from the phase2 build plan's "what lands on main" section.

Key facts captured: main has not diverged (merge-base == main HEAD), so a
cleaned branch can ff-merge; b430d73 is a mixed commit (with-load.sh graduates,
CI wiring drops); and Bucket 6 already rewrites the CI workflows, so the final
ci-stress tree is main-worthy and the decision is about history, not the tree.
Recommendation: (B) tidy-rebase + ff-merge. Still Ben's call; merge is last.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...-ci-stress-guarantees-and-coverage-plan.md | 52 +++++++++++++++----
 .../2026-06-17-ci-stress-phase2-build-plan.md |  6 ++-
 2 files changed, 47 insertions(+), 11 deletions(-)

diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
index 757f601..23ba646 100644
--- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
+++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
@@ -74,13 +74,41 @@ Ben's box).
   don't hard-fail on? (Recommend the latter — it makes the envelope explicit and stops
   future stress runs re-raising these as "flakes".)
 
-### Bucket 5 — Branch hygiene (standing, NOT part of this workflow unless wanted)
-- The mergeable commits (the 4 test fixes 58c3741/06c6d8e/51a1753/19a28fd + the docs) vs the
-  **stress-only, do-not-merge** commits (980856b concurrency tweak, b430d73 load wrapper).
-  When this lands on `main`, cherry-pick the mergeable set and leave the stress scaffolding.
-  *Open decision D-d:* do this work on `ci-stress` and cherry-pick later, or branch a clean
-  `failure-modes` off `main` now? (Recommend: keep working on `ci-stress`; cherry-pick at the
-  end — the stress wrapper is useful for CI-verifying the new tests under load.)
+### Bucket 5 — Merge-to-`main` strategy (**D-d REOPENED 2026-06-18**)
+Ben reopened this: cherry-picking may not be the best path — "tidying up and preserving
+history" is a live alternative. **Git facts (verified 2026-06-18):**
+- **`main` has not diverged** — `merge-base(main, ci-stress) == main HEAD (fa43f30)`. So
+  ci-stress is strictly **34 commits ahead**, and a cleaned-up branch can **fast-forward**
+  onto `main` (no merge commit).
+- The 34 commits are a mix: genuine product/test/doc work; **pure stress-only scaffolding**
+  (`980856b` concurrency tweak; `b430d73`'s `tests.yml` load-wiring + raised timeouts — *but
+  `b430d73` also adds `tests/with-load.sh`, which graduates*, so it is a **mixed** commit);
+  intermediate **plan / AGENTS.md churn**; and the **`/c` commit+revert pairs**
+  (`534a007` → `959cca9` → `a5df9d9`).
+- **Bucket 6 itself rewrites the CI workflows** (3 new files) and reverts the stress wiring.
+  So after Bucket 6 lands, **ci-stress's final *tree* is already main-worthy** — the
+  stress-only commits are a *history* concern, not a tree concern. **The decision is therefore
+  mostly about what history `main` should carry, not about keeping bad code out of the tree.**
+
+Options:
+- **(A) Cherry-pick a curated subset** onto `main` (the prior plan). Surgical, but ~20
+  interdependent picks (later commits edit the same test file repeatedly → conflict-prone),
+  new SHAs disconnected from the branch, and `b430d73` must be split by hand. Drops the
+  review/decision narrative.
+- **(B) Tidy-rebase `ci-stress`, then `--ff-only` merge** ("tidy up + preserve history").
+  Interactively rewrite the branch: squash the `/c` commit+revert pairs and the intermediate
+  plan/changelog churn into their content commits, excise the pure scaffolding (or rely on
+  Bucket 6 having already removed the wiring from the tree), curate messages; then `git -C
+  <main> merge ci-stress --ff-only` lands a clean linear history in one operation. Keeps a
+  curated narrative; **rewrites history** — gotcha: `rebase.updateRefs=true` moves any branch
+  pointing into the range, so back up with a **raw SHA/tag, never a branch**.
+- **(C) Squash-merge** to one (or a few) curated commit(s). Cleanest `main` log, trivially
+  excludes scaffolding (final tree only), but discards all granular history.
+
+*Recommendation:* **(B)** — enabled cleanly by `main` not having diverged; gives a
+curated-but-real history (which (C) discards and (A) reconstructs laboriously) and matches
+"tidy up and preserve." **Still Ben's call** (it's about `main`'s permanent history); settle it
+before the merge step. **Not a blocker for the rest of Phase 3 — the merge is last.**
 
 ### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code
 The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete
@@ -181,8 +209,9 @@ the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **V
 
 **Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the
 agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the
-matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then (D-d) cherry-pick
-the mergeable commits to `main`.
+matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then land on `main`
+per **D-d** (merge strategy reopened 2026-06-18 — cherry-pick vs tidy-rebase+ff-merge vs
+squash; see Bucket 5).
 
 ## Decisions (settled 2026-06-17)
 - **D-a → new `docs/guarantees.md`** (dedicated normative doc).
@@ -190,7 +219,10 @@ the mergeable commits to `main`.
   gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier.
 - **D-c → split the suite** into a strict-correctness tier (always enforced) and a
   latency/envelope tier (not hard-failed by extreme-stress runs).
-- **D-d → keep on `ci-stress`**, cherry-pick the mergeable commits to `main` at the end.
+- **D-d → REOPENED 2026-06-18** (was: keep on `ci-stress`, cherry-pick mergeable commits at
+  the end). Work continues on `ci-stress`; the *merge-to-`main` mechanism* is now an **open
+  decision** — cherry-pick (A) vs tidy-rebase + ff-merge (B, recommended) vs squash (C). See
+  **Bucket 5** for the analysis. Settle before the merge step (it's the last step).
 - **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the
   test/matrix count is known.
 - **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing
diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
index 69b3bb6..a547f8b 100644
--- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md
+++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
@@ -276,7 +276,11 @@ calibrated envelope leg (IO throttling experimental — don't rely on it); emit
 **load-manifest** artifact (`{kind, R, nproc, achieved-slowdown, tool versions, os/arch,
 sha}`) uploaded on success too.
 
-**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):**
+**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):** *(This lists the
+mergeable **content**; it is mechanism-agnostic. The merge **mechanism** — cherry-pick vs
+tidy-rebase+ff-merge vs squash — was **reopened 2026-06-18**; see the guarantees-and-coverage
+plan's Bucket 5. Note that after this Bucket 6 lands, ci-stress's tree already excludes the
+stress wiring, so "what graduates" is mostly a history-curation question, not a tree one.)*
 - **Graduate to `main`:** the calibrated `with-load.sh` (strip the do-not-merge banner;
   add ratio calibration + load-manifest); `ok_envelope`/`bad_envelope` + the 3
   reassigned assertions; `GCL_TEST_SWEEP` + Axis-A loop (default-off → per-PR identical

From 353a2dd1c741964f026dac6c993a85a49d43ea8f Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 02:21:11 +1000
Subject: [PATCH 36/76] =?UTF-8?q?Plan:=20RESOLVE=20D-d=20=E2=80=94=20mild?=
 =?UTF-8?q?=20tidy-up,=20merge=20to=20main=20via=20GitHub=20PR=20(extent?=
 =?UTF-8?q?=20=3D=20Ben's=20call)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ben resolved the merge-to-main strategy: option (B), a *mild* tidy-up of the
ci-stress history, then merge via a GitHub pull request (not a local ff-merge).
The extent of the tidy-up is Ben's call — propose the specific commits to
drop/squash and get sign-off before any history rewrite. Updated Bucket 5, the
D-d decision line, and Phase 4. Merge remains the last step.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...-ci-stress-guarantees-and-coverage-plan.md | 26 +++++++++++++------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
index 23ba646..523118a 100644
--- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
+++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
@@ -107,8 +107,17 @@ Options:
 
 *Recommendation:* **(B)** — enabled cleanly by `main` not having diverged; gives a
 curated-but-real history (which (C) discards and (A) reconstructs laboriously) and matches
-"tidy up and preserve." **Still Ben's call** (it's about `main`'s permanent history); settle it
-before the merge step. **Not a blocker for the rest of Phase 3 — the merge is last.**
+"tidy up and preserve."
+
+**RESOLVED (Ben, 2026-06-18): (B) — a *mild* tidy-up, then merge via a GitHub pull request**
+(ci-stress → main), **not** a local ff-merge. Refinements:
+- **Extent of tidy-up is Ben's call.** Keep it mild. Before any history rewrite, propose the
+  specific tidy (candidates: drop the pure scaffolding commits `980856b` + `b430d73`'s
+  required-job wiring; squash the obvious `/c` commit+revert noise `534a007`→`959cca9`→
+  `a5df9d9`; leave the rest) and get Ben's sign-off on the extent — do not decide it autonomously.
+- **Merge via a GitHub PR**, so the PR's CI is the gate and the merge is reviewable. `main`
+  has not diverged, so the PR stays clean.
+- Still the **last** step of Phase 3/4; not a blocker for the harness/CI work.
 
 ### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code
 The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete
@@ -210,8 +219,8 @@ the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **V
 **Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the
 agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the
 matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then land on `main`
-per **D-d** (merge strategy reopened 2026-06-18 — cherry-pick vs tidy-rebase+ff-merge vs
-squash; see Bucket 5).
+per **D-d** (resolved 2026-06-18: a mild tidy-up — extent is Ben's call — then merge via a
+GitHub PR; see Bucket 5).
 
 ## Decisions (settled 2026-06-17)
 - **D-a → new `docs/guarantees.md`** (dedicated normative doc).
@@ -219,10 +228,11 @@ squash; see Bucket 5).
   gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier.
 - **D-c → split the suite** into a strict-correctness tier (always enforced) and a
   latency/envelope tier (not hard-failed by extreme-stress runs).
-- **D-d → REOPENED 2026-06-18** (was: keep on `ci-stress`, cherry-pick mergeable commits at
-  the end). Work continues on `ci-stress`; the *merge-to-`main` mechanism* is now an **open
-  decision** — cherry-pick (A) vs tidy-rebase + ff-merge (B, recommended) vs squash (C). See
-  **Bucket 5** for the analysis. Settle before the merge step (it's the last step).
+- **D-d → RESOLVED 2026-06-18: (B) mild tidy-up, then merge via a GitHub PR** (ci-stress →
+  main), not a local ff-merge. **Extent of tidy-up is Ben's call** — propose the specific
+  commits to drop/squash and get his sign-off before rewriting history. (Was briefly reopened
+  2026-06-18 across cherry-pick (A) / tidy-rebase (B) / squash (C); see **Bucket 5**.) Still the
+  last step.
 - **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the
   test/matrix count is known.
 - **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing

From 4ee5899869420d7ae72c77f812a695fe2154f498 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 03:09:30 +1000
Subject: [PATCH 37/76] Bucket 8 item 2: GCL_TEST_ONLY single-test selector
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Add a regex single-test selector to the suites (Phase 3, harness ergonomics):

- Unit + interop: each top-level `== Test N: … ==` header becomes
  `if section "Test N: …"; then … fi`. `section` echoes the header (visible
  output unchanged) and runs the block iff GCL_TEST_ONLY is unset/empty or its
  regex matches the label, bumping SECTIONS_RUN on a match. Each `fi` sits
  immediately before the next block's `if section`, so trailing per-test
  cleanup stays inside the block.
- Zero-match guard: if GCL_TEST_ONLY is set but matched no block
  (SECTIONS_RUN==0), bail loudly with exit 1 — a typo'd regex can't produce a
  vacuous PASS=0/FAIL=0 green (same spirit as the undercount sentinel).
- Integration suite note-and-ignores GCL_TEST_ONLY: it is one indivisible
  scenario (Tests 1-3 share a repo + the ALL_IDS audit), so it prints a loud
  stderr note and runs the whole suite.

Default runs are byte-identical (selector logic is gated on GCL_TEST_ONLY).
Validated: unit 315/0, interop 141/0, integration 12/0 (reduced, exit 0);
sorted PASS/FAIL set identical before/after (volatile token/path fields aside);
selector precision proven (regex match, trailing-colon anchoring so 'Test 2:'
excludes Test 20/2b); zero-match guard exits 1. shellcheck -S style clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.integration.test.sh |  10 ++
 tests/git-commit-lock.interop.test.sh     | 105 +++++++++---
 tests/git-commit-lock.test.sh             | 195 +++++++++++++++-------
 3 files changed, 226 insertions(+), 84 deletions(-)

diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh
index 579a5da..e7837f4 100644
--- a/tests/git-commit-lock.integration.test.sh
+++ b/tests/git-commit-lock.integration.test.sh
@@ -114,6 +114,16 @@ echo "fan-out mode: $GCL_MODE (bash swarm ${BROUNDS}x${BN}, mixed swarm ${MSH}+$
 # bounded max wait so a wedge fails the suite instead of hanging it.
 LK_ENV=(AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=240)
 
+# Note-and-ignore the per-test selector the unit/interop suites honour: this
+# suite is ONE indivisible scenario (Tests 1-3 share a single repo + the ALL_IDS
+# accumulator, and Test 3 audits Tests 1+2's output), so a per-block selector
+# can't apply. If GCL_TEST_ONLY is set, say so loudly on stderr and run the
+# whole scenario as normal.
+GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"
+if [ -n "$GCL_TEST_ONLY" ]; then
+    echo "NOTE: integration suite ignores GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" — Tests 1-3 are one indivisible scenario (shared repo + ALL_IDS audit); running the whole suite." >&2
+fi
+
 # --- scratch repo ------------------------------------------------------------
 REPO="$WORK/repo"; OUTD="$WORK/out"; NOHOOKS="$WORK/nohooks"
 mkdir -p "$REPO" "$OUTD" "$NOHOOKS"
diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh
index a638005..8bda7c7 100644
--- a/tests/git-commit-lock.interop.test.sh
+++ b/tests/git-commit-lock.interop.test.sh
@@ -67,8 +67,13 @@ WORK="$(pwsh -NoProfile -Command '[IO.Path]::Combine([IO.Path]::GetTempPath(), "
 WORK="${WORK//\\//}"
 mkdir -p "$WORK"
 
-PASS=0; FAIL=0; TAPN=0; DONE=0
+PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
 GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
+# Single-test selector: GCL_TEST_ONLY=<regex> runs only the test blocks whose
+# `== Test N: <desc> ==` label matches the regex (BASH regex, =~). Unset/empty
+# runs every block (default). A typo'd regex that matches nothing bails out
+# loudly at the verdict (the zero-match guard) rather than passing vacuously.
+GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"
 # ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
 # bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
 # just before the verdict) lets a TAP consumer fail on a short count; together with the
@@ -79,6 +84,19 @@ ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
 bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
         [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
+# Per-test gate: echoes the block header (so a normal run is byte-unchanged) and
+# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label.
+# Each top-level `== Test N: <desc> ==` block is wrapped `if section "..."; then ... fi`.
+# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard can catch a
+# selector regex that matched nothing.
+section() {
+  echo "== $1 =="
+  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
+    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
+  fi
+  return 1
+}
+
 # Failure post-mortems need the logs: keep $WORK when anything failed, and
 # honour GCL_TEST_PRESERVE_DIR (the CI preserve-logs knob) by copying
 # the work dir there unconditionally when it is set.
@@ -243,7 +261,7 @@ ps_worker() {  # $1=lock $2=log $3=holder $4=violations $5=id
     pwsh -NoProfile -File "$PS1WIN" run "$body"
 }
 
-echo "== Test 1: mixed pwsh+bash workers, mutual exclusion across implementations ($GCL_MODE width) =="
+if section "Test 1: mixed pwsh+bash workers, mutual exclusion across implementations ($GCL_MODE width)"; then
 NSH=$T1_NSH; NPS=$T1_NPS; TOT=$((NSH+NPS))
 LOCK="$WORK/excl.lock"
 HOLDER="$WORK/holder"; : > "$HOLDER"; VIOL="$WORK/violations"; : > "$VIOL"
@@ -278,8 +296,9 @@ else
   [ "$st" != 0 ] && { echo "  STALE/STEAL log lines:"; grep -E "STALE|STOLE" "$WORK/excl-all.log" | sed 's/^/    /'; }
   bad "cross-impl exclusion/balance: violations=$nv steals=$st acquired=$a (floor $((TOT/2))) released=$rl leftover=$([ -e "$LOCK" ] && echo yes || echo no)"
 fi
+fi
 
-echo "== Test 2: a bash holder blocks a pwsh waiter (no concurrent hold, no wrongful steal) =="
+if section "Test 2: a bash holder blocks a pwsh waiter (no concurrent hold, no wrongful steal)"; then
 LOCK="$WORK/b2.lock"; LOG="$WORK/b2.log"; : > "$LOG"; ORDER="$WORK/b2.order"; : > "$ORDER"
 READY="$WORK/b2.ready"; rm -f "$READY"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \
@@ -295,8 +314,9 @@ wait "$holder"
 got="$(tr '\n' ',' < "$ORDER")"
 [ "$got" = "sh-start,sh-end,ps-ran," ] && ok "bash-holds / pwsh-waits ordering correct" || bad "ordering wrong: $got"
 grep -q STOLE "$LOG" && bad "pwsh wrongly STOLE a live bash lock" || ok "pwsh did not steal the live bash lock"
+fi
 
-echo "== Test 3: a pwsh holder blocks a bash waiter =="
+if section "Test 3: a pwsh holder blocks a bash waiter"; then
 LOCK="$WORK/b3.lock"; LOG="$WORK/b3.log"; : > "$LOG"; ORDER="$WORK/b3.order"; : > "$ORDER"
 READY="$WORK/b3.ready"; rm -f "$READY"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \
@@ -309,8 +329,9 @@ wait "$holder"
 got="$(tr '\n' ',' < "$ORDER")"
 [ "$got" = "ps-start,ps-end,sh-ran," ] && ok "pwsh-holds / bash-waits ordering correct" || bad "ordering wrong: $got"
 grep -q STOLE "$LOG" && bad "bash wrongly STOLE a live pwsh lock" || ok "bash did not steal the live pwsh lock"
+fi
 
-echo "== Test 4: pwsh steals a STALE lock fabricated as bash's (old file mtime) =="
+if section "Test 4: pwsh steals a STALE lock fabricated as bash's (old file mtime)"; then
 # AGENT_LOCK_MAX_WAIT caps the run so a steal regression fails in ~20s, not 420s.
 LOCK="$WORK/b4.lock"; LOG="$WORK/b4.log"; : > "$LOG"; MARK="$WORK/b4.mark"; printf '%s' before > "$MARK"
 fabricate_lock "$LOCK" "tok.sh.ghost.1" "pid=99999 host=ghost"
@@ -323,8 +344,9 @@ grep -q STOLE "$LOG" && ok "log records the cross-impl steal" || bad "no STOLE e
 grep -q "holder=pid=99999 host=ghost" "$LOG" \
   && ok "STALE log line carries the holder parsed from line 2 (cross-impl wire format)" \
   || bad "holder from line 2 missing in pwsh's STALE log line"
+fi
 
-echo "== Test 5: bash steals a STALE lock GENUINELY created by pwsh (holder killed mid-hold) =="
+if section "Test 5: bash steals a STALE lock GENUINELY created by pwsh (holder killed mid-hold)"; then
 # The stale lock really is pwsh's: a pwsh process dot-sources the lock, acquires (writing
 # its tok.ps.* token to line 1 and flushing+closing the file), signals ready, then
 # SELF-EXITS via [Environment]::Exit(0) — the port's documented hard-exit that bypasses
@@ -356,8 +378,9 @@ else
   kill -9 "$hpid" 2>/dev/null; wait "$hpid" 2>/dev/null
   bad "T5 pwsh holder never acquired/signalled ready"
 fi
+fi
 
-echo "== Test 6: deterministic lost-update counter, mixed bash+pwsh increments ($GCL_MODE width) =="
+if section "Test 6: deterministic lost-update counter, mixed bash+pwsh increments ($GCL_MODE width)"; then
 # The deterministic complement to Test 1's exclusion probe (which has a blind
 # window and tolerates launch flakiness): every worker MUST launch (strict rc
 # checks) and the final counter MUST equal the total increments — any lost
@@ -403,8 +426,9 @@ cat "$WORK"/cnt-*.log > "$WORK/cnt-all.log" 2>/dev/null || : > "$WORK/cnt-all.lo
 a="$(grep -c ACQUIRED "$WORK/cnt-all.log")"; rl="$(grep -c RELEASED "$WORK/cnt-all.log")"
 [ "$a" = "$CTOT" ] && [ "$rl" = "$CTOT" ] && ok "lock logs balanced ($a acquired / $rl released)" || bad "lock logs unbalanced: acquired=$a released=$rl want=$CTOT"
 [ -e "$LOCK" ] && bad "leftover counter lock" || ok "no leftover lock"
+fi
 
-echo "== Test 7: pwsh run propagates the command's exit code (two contending runs in parallel) =="
+if section "Test 7: pwsh run propagates the command's exit code (two contending runs in parallel)"; then
 LOCK="$WORK/rc.lock"; LOG="$WORK/rc.log"; : > "$LOG"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_MAX_WAIT=60 \
   pwsh -NoProfile -File "$PS1WIN" run "exit 0" & p0=$!
@@ -415,8 +439,9 @@ wait "$p7"; rc7=$?
 [ "$rc0" = 0 ] && ok "pwsh exit 0 propagated" || bad "pwsh exit 0 not propagated (rc=$rc0)"
 [ "$rc7" = 7 ] && ok "pwsh exit 7 propagated" || bad "pwsh exit code not propagated ($rc7)"
 [ -e "$LOCK" ] && bad "lock left held after pwsh run" || ok "lock released after pwsh run (success and failure)"
+fi
 
-echo "== Test 7b: ps1 run verdicts for PowerShell-NATIVE failure (a failing cmdlet must not exit 0) =="
+if section "Test 7b: ps1 run verdicts for PowerShell-NATIVE failure (a failing cmdlet must not exit 0)"; then
 # A cmdlet's non-terminating error never sets LASTEXITCODE, so a runner
 # consulting only LASTEXITCODE would return 0 for a failed command. The
 # runner must consult the staged script's FINAL '$?' when no nonzero native
@@ -454,8 +479,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_MAX_WAIT=20 \
 [ "$rc" = 0 ] && ok "mid-command cmdlet failure + succeeding final statement -> 0 (the documented final-statement limitation)" \
               || bad "limitation pin: rc=$rc (want 0 — has the final-statement contract changed?)"
 [ -e "$LOCK" ] && bad "lock left held after the failing-cmdlet verdict runs" || ok "no leftover lock after the failing-cmdlet verdict runs"
+fi
 
-echo "== Test 7c: ps1 CLI help/usage convention — explicit help -> stdout + exit 0; usage errors -> stderr + 96 =="
+if section "Test 7c: ps1 CLI help/usage convention — explicit help -> stdout + exit 0; usage errors -> stderr + 96"; then
 # (bash's side of the same convention is pinned in the unit suite, Test 7.)
 for h in --help -h; do
   pwsh -NoProfile -File "$PS1WIN" "$h" > "$WORK/t7c.out" 2> "$WORK/t7c.err"; rc=$?
@@ -475,8 +501,9 @@ pwsh -NoProfile -File "$PS1WIN" > "$WORK/t7c-noargs.out" 2> "$WORK/t7c-noargs.er
   || bad "ps1 no-args rc=$rc (want 96) stderr-usage=$(grep -c '^usage:' "$WORK/t7c-noargs.err")"
 pwsh -NoProfile -File "$PS1WIN" frobnicate >/dev/null 2>&1; rc=$?
 [ "$rc" = 96 ] && ok "ps1 unknown subcommand -> 96" || bad "ps1 unknown subcommand rc=$rc (want 96)"
+fi
 
-echo "== Test 8: a ROBBED holder exits 98 — pwsh victim/bash thief, then bash victim/pwsh thief =="
+if section "Test 8: a ROBBED holder exits 98 — pwsh victim/bash thief, then bash victim/pwsh thief"; then
 # Fail-open ceiling, cross-impl: the victim holds past its 1s stale window
 # UNTIL THE THIEF IS DONE (marker, not a fixed sleep — a fixed hold once let a
 # slow-starting thief arrive after the victim had already released), the other
@@ -509,15 +536,17 @@ touch "$TDONE"
 wait "$vic"; vic_rc=$?
 [ "$vic_rc" = 98 ] && ok "robbed bash holder exited 98" || bad "robbed bash holder exited $vic_rc (want 98)"
 [ "$thief_rc" = 0 ] && ok "pwsh thief exited 0" || bad "pwsh thief exited $thief_rc"
+fi
 
-echo "== Test 9: a slow but UNCONTENDED pwsh holder keeps its lock (slowness != failure) =="
+if section "Test 9: a slow but UNCONTENDED pwsh holder keeps its lock (slowness != failure)"; then
 LOCK="$WORK/slow.lock"; LOG="$WORK/slow.log"; : > "$LOG"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \
   pwsh -NoProfile -File "$PS1WIN" run "Start-Sleep 2"; rc=$?
 [ "$rc" = 0 ] && ok "uncontended slow pwsh holder exited 0" || bad "uncontended slow pwsh holder exited $rc"
 grep -q "WARNING" "$LOG" && bad "spurious theft WARNING with no contender" || ok "no spurious WARNING when uncontended"
+fi
 
-echo "== Test 10: default lock location is <gitdir>/commit.lock for BOTH impls (regression: item 1) =="
+if section "Test 10: default lock location is <gitdir>/commit.lock for BOTH impls (regression: item 1)"; then
 # The BLOCKER this guards against: the .ps1 silently fell back to a CWD lock at
 # default config, so the two impls never contended. Run BOTH impls from a
 # SUBDIRECTORY of a scratch repo with AGENT_LOCK_PATH/LOG unset; each command
@@ -539,8 +568,9 @@ nps="$(grep -c "ACQUIRED.*tok=tok\.ps\." "$DLOG" 2>/dev/null)"
   && ok "shared <gitdir> log shows 1 bash + 1 pwsh acquisition" \
   || bad "default-log evidence wrong: ACQUIRED=$na (want 2), pwsh tokens=$nps (want 1) in $DLOG"
 [ -e "$GITDIR2/commit.lock" ] && bad "leftover default lock" || ok "no leftover default lock"
+fi
 
-echo "== Test 11: release-time classification agrees across impls — truncated => unverifiable (1); deleted => theft (98) =="
+if section "Test 11: release-time classification agrees across impls — truncated => unverifiable (1); deleted => theft (98)"; then
 # (i) TRUNCATED at release: the file still exists but reads EMPTY after the
 # retry ladder. NOT provable theft (it is the probe-F create->write window of
 # a successor after a boundary steal, or external truncation), so BOTH impls
@@ -569,8 +599,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_MAX_WAIT=20 \
   pwsh -NoProfile -File "$PS1WIN" run "Remove-Item -LiteralPath '$LOCK' -Force" 2>/dev/null; rc_ps=$?
 [ "$rc_sh" = 98 ] && ok "bash: lock GONE at release -> exit 98 (theft)" || bad "bash gone-at-release rc=$rc_sh (want 98)"
 [ "$rc_ps" = 98 ] && ok "pwsh: lock GONE at release -> exit 98 (theft)" || bad "pwsh gone-at-release rc=$rc_ps (want 98)"
+fi
 
-echo "== Test 12: fractional STALE/MAX_WAIT rejected identically by both impls (note + default) =="
+if section "Test 12: fractional STALE/MAX_WAIT rejected identically by both impls (note + default)"; then
 # These two knobs are integers in both impls; a fractional value silently
 # rounded by one side but rejected by the other would give the two impls
 # DIFFERENT steal thresholds for the same env. Both must note + use defaults.
@@ -625,10 +656,11 @@ n_ps="$(grep -c 'ignoring invalid' "$WORK/poll-ps.err")"
 [ "$rc_sh" = 0 ] && [ "$n_sh" = 0 ] && [ "$rc_ps" = 0 ] && [ "$n_ps" = 0 ] \
   && ok "POLL_SECS='' (empty): silent default in BOTH impls (no note)" \
   || bad "POLL_SECS='' parity: sh rc=$rc_sh notes=$n_sh; pwsh rc=$rc_ps notes=$n_ps (want rc 0 + 0 notes each)"
+fi
 
 if [ "$GCL_WINDOWS" = 1 ]; then
 
-echo "== Test 13: blocked release (no-delete-share handle) — deterministic LEFTOVER, run keeps the command's code, then recovery =="
+if section "Test 13: blocked release (no-delete-share handle) — deterministic LEFTOVER, run keeps the command's code, then recovery"; then
 # Probe D1 made this lane deterministically testable (TODO #30): a pwsh
 # FileShare.Read handle on the lock file blocks the release unlink (and any
 # steal rename) until it closes. (a) sourced bash: lock_release returns 1 and
@@ -732,8 +764,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 AGENT_LOCK
 [ "$rc" = 0 ] && ok "leftover reclaimed once the handle closed + stale window elapsed (TODO #30 lane)" \
               || bad "leftover recovery rc=$rc (want 0)"
 grep -q STOLE "$LOG" && ok "recovery steal logged" || bad "no STOLE entry during leftover recovery"
+fi
 
-echo "== Test 14: blocked steal — a no-delete-share handle on a STALE lock defers the steal until it closes =="
+if section "Test 14: blocked steal — a no-delete-share handle on a STALE lock defers the steal until it closes"; then
 # Same handle class against a stale lock: the stealer's rename keeps failing
 # while the handle is open (probe D1), so it re-polls — and acquires promptly
 # once the handle closes. Run with the ps1 stealer: this exercises its
@@ -761,8 +794,9 @@ else
   touch "$BGO"; wait "$blk14" 2>/dev/null
   bad "T14 blocker never signalled its handle open"
 fi
+fi
 
-echo "== Test 14b: blocked steal NEVER bypasses MAX_WAIT — squatted stale lock => 97 with bounded logging (regression: busy-spin) =="
+if section "Test 14b: blocked steal NEVER bypasses MAX_WAIT — squatted stale lock => 97 with bounded logging (regression: busy-spin)"; then
   # Discriminator: when the steal rename keeps
 # failing with the lock file still present (a no-delete-share handle squatting
   # it), a failed-steal lane that `continue`s past the timeout check AND the
@@ -834,13 +868,14 @@ else
   bad "T14b squatter never signalled its handle open"
 fi
 rm -f "$LOCK"
+fi
 
 else
   echo "== Tests 13/14/14b SKIPPED (POSIX): open handles never block unlink/rename here =="
   echo "note: the LEFTOVER and blocked-steal lanes are Windows-only by construction (.NET's Unix FileShare gates no namespace operation); the Windows CI leg covers them"
 fi
 
-echo "== Test 15: ps1-side never-steal guards — dir, dangling symlink, non-lock content (parity with the bash guards) =="
+if section "Test 15: ps1-side never-steal guards — dir, dangling symlink, non-lock content (parity with the bash guards)"; then
 # The ps1 guards use different APIs than bash (PSIsContainer, reparse
 # attributes, the catch-all CreateNew exception), so bash coverage proves
 # nothing about them. The wrong-type warning needs the SAME concrete type on
@@ -899,8 +934,9 @@ grep -q "is not a lock file" "$WORK/psuser.err" && ok "ps1: config warning names
                                                 || bad "ps1: no config warning for non-lock content"
 grep -q STOLE "$LOG" && bad "ps1 STOLE the user file" || ok "ps1: no steal of the user file"
 rm -f "$LOCK"
+fi
 
-echo "== Test 16: crash recovery under CONTENTION, mixed impls — claim-serialized: zero displacement, zero 98s =="
+if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-serialized: zero displacement, zero 98s"; then
 # Cross-impl variant of the unit suite's Test 2b (which carries the full
 # rationale): 2 bash + 2 pwsh waiters race ONE crashed lock. Under the claim
 # protocol the straggler-robs-recovery-winner race is PREVENTED (the claim
@@ -1032,8 +1068,9 @@ if [ "$t16_valid" = 1 ]; then
 else
   bad "T16: no clean run under a conclusive backdate in $T16_TRIES attempts (see above)"
 fi
+fi
 
-echo "== Test 16b: bash claimant vs ps1 claimant racing ONE ghost — one claim winner, cross-impl wire parity =="
+if section "Test 16b: bash claimant vs ps1 claimant racing ONE ghost — one claim winner, cross-impl wire parity"; then
 # The 1+1 distilled version of Test 16: one bash and one pwsh waiter race the
 # same ancient ghost. Exactly one wins the O_EXCL claim and steals
 # (STOLE-BY-CLAIM x1); the loser either loses the claim create (a young
@@ -1105,8 +1142,9 @@ if [ "$t16b_valid" = 1 ]; then
 else
   bad "T16b: no clean run under a conclusive backdate in $T16B_TRIES attempts (see above)"
 fi
+fi
 
-echo "== Test 16c: cross-impl claim staleness — each side clears the OTHER side's aged claim; young foreign claims are respected =="
+if section "Test 16c: cross-impl claim staleness — each side clears the OTHER side's aged claim; young foreign claims are respected"; then
 # (a) bash clears an aged ps1-tokened claim, then completes the steal.
 LOCK="$WORK/cstale.lock"; LOG="$WORK/cstale.log"; : > "$LOG"
 fabricate_lock "$LOCK" "tok.ghost.cstale" "pid=9 host=ghost"; backdate "$LOCK" 9999
@@ -1156,8 +1194,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
   && ok "ps1 respected a young bash claim (97, claim intact, no clear/steal)" \
   || bad "ps1 young-bash-claim handling: rc=$rc intact=$([ -f "$LOCK.next" ] && echo yes || echo no)"
 rm -f "$LOCK" "$LOCK.next"
+fi
 
-echo "== Test 16d: static checks — no File.Replace in the ps1 port =="
+if section "Test 16d: static checks — no File.Replace in the ps1 port"; then
 # File.Replace is deliberately never used: it throws on a
 # read-only destination and has partial-failure states when called without a
 # backup file. The 5.1 lane must stay unlink + fail-if-exists Move.
@@ -1166,8 +1205,9 @@ if grep -qE 'File\]?::Replace' "$ROOT/git-commit-lock.ps1"; then
 else
   ok "git-commit-lock.ps1 contains no File.Replace call"
 fi
+fi
 
-echo "== Test 16e: ps1 arc-end pass keeps INCONCLUSIVE entries; trap-time discovery-HOLD releases per normal release semantics =="
+if section "Test 16e: ps1 arc-end pass keeps INCONCLUSIVE entries; trap-time discovery-HOLD releases per normal release semantics"; then
 # Driven directly via a dot-sourcing pwsh driver — the ps1 side's
 # unit-equivalent steering mechanism (the lib skips its CLI when
 # dot-sourced). Part 1: the arc-end resolution pass's entry-drop is gated
@@ -1261,8 +1301,9 @@ PSEOF
 else
   echo "note: the blocked trap-time release leg is Windows-only by construction (POSIX open handles never block unlink); the happy-path leg above pins the honest-log contract"
 fi
+fi
 
-echo "== Test 16f: ps1 claim-gone-at-touch — the SetLastWriteTimeUtc FileNotFound gone signal fires; no resurrection =="
+if section "Test 16f: ps1 claim-gone-at-touch — the SetLastWriteTimeUtc FileNotFound gone signal fires; no resurrection"; then
 # The unit suite's discovery-position matrix (T25) covers bash's
 # touch-gone lane; this is the ps1 counterpart: the claim passes the
 # step-3.1 recheck, vanishes before the step-3.2 touch (steered via the
@@ -1321,9 +1362,10 @@ PSEOF
 else
   echo "== Test 16f SKIPPED: claim-gone-at-touch steering uses Windows pwsh (POSIX legs cover the protocol via the bash matrix; the ps1 gone-catch is probed Q1) =="
 fi
+fi
 
 if command -v powershell >/dev/null 2>&1; then
-echo "== Test 17: Windows PowerShell 5.1 smoke lane — the ps1 must run, not just parse, on the in-box engine =="
+if section "Test 17: Windows PowerShell 5.1 smoke lane — the ps1 must run, not just parse, on the in-box engine"; then
 # Everything above runs the port under pwsh (7+). 5.1 ships in every Windows
 # 10/11 box and stays supported, so its claim is tested, not asserted: the
   # run lane's exit-code contract (0 / exit 7 / the failing-cmdlet -> 1) and
@@ -1393,12 +1435,23 @@ AGENT_LOCK_PATH="$LOCK51" AGENT_LOCK_LOG="$LOG51" AGENT_LOCK_STALE_SECS=2 \
 grep -q "CLAIM .*tok=tok\.ps\." "$LOG51" && ok "5.1: claim create logged with its per-attempt token" || bad "5.1: no CLAIM line with a tok.ps.* token"
 [ -e "$LOCK51" ] && bad "5.1: leftover lock after the steal ladder" || ok "5.1: no leftover lock"
 [ -e "$LOCK51.next" ] && bad "5.1: leftover claim after the steal ladder" || ok "5.1: no leftover claim"
+fi
 else
   echo "== Test 17 SKIPPED: Windows PowerShell 5.1 (powershell) not on PATH — POSIX leg; the Windows CI leg covers it =="
   echo "note: the 5.1 unlink+Move steal-ladder leg is part of this lane and is covered by the Windows CI leg"
 fi
 
 echo
+# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran no test block, so
+# the (vacuously green) verdict below would lie. Bail loudly instead — a typo'd
+# selector regex must FAIL, not pass with zero assertions.
+if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then
+  echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2
+  exit 1
+fi
+# When a selector is active, report how many blocks it matched (the default run
+# stays byte-unchanged because this is gated on GCL_TEST_ONLY being non-empty).
+[ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)"
 DONE=1
 echo "==== INTEROP RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="
 [ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 56cc7c2..7fc5f2b 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -64,8 +64,21 @@ finish() {
 }
 trap finish EXIT
 
-PASS=0; FAIL=0; TAPN=0; DONE=0
+PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
 GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
+GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
+# section() replaces each per-test header `echo "== Test N: … =="`: it echoes the
+# header verbatim (visible output unchanged) and returns success — gating the
+# `if section …; then … fi` block — iff GCL_TEST_ONLY is unset/empty OR its regex
+# matches the label. A run-counter (SECTIONS_RUN) backs the zero-match guard below,
+# so a typo'd selector regex can't masquerade as a vacuous PASS=0/FAIL=0 green.
+section() {
+  echo "== $1 =="
+  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
+    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
+  fi
+  return 1
+}
 # ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
 # bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
 # just before the verdict) lets a TAP consumer fail on a short count; together with the
@@ -202,7 +215,7 @@ wait_for_grep() {
 # Critical section that loses updates without a mutex: read, gap, write+1.
 INCR='n="$(cat "$1")"; sleep 0.03; echo $((n+1)) > "$1"'
 
-echo "== Test 1: concurrent workers, mutual exclusion (repeated rounds, $GCL_MODE width) =="
+if section "Test 1: concurrent workers, mutual exclusion (repeated rounds, $GCL_MODE width)"; then
 # A single pass is too weak to trust a rare exclusion race (the release-steal
 # bug found 2026-05-30 lost ~1 update per 25 only intermittently). Repeat
 # several rounds; ANY lost update across ALL rounds fails the test.
@@ -232,8 +245,9 @@ done
 grep -q "Staleness detection is BROKEN" "$T1ERR" \
   && bad "spurious mtime-probe WARNING under contention (see $T1ERR)" \
   || ok "no spurious mtime-probe warnings under contention"
+fi
 
-echo "== Test 2: stale lock (old file mtime) is stolen; holder comes from line 2 =="
+if section "Test 2: stale lock (old file mtime) is stolen; holder comes from line 2"; then
 LOCK="$WORK/steal.lock"; LOG="$WORK/steal.log"; : > "$LOG"; MARKER="$WORK/steal-marker"
 fabricate_lock "$LOCK" "tok.fake.99999.1" "pid=99999 host=ghost"
 backdate "$LOCK" 9999                       # make the FILE mtime ancient -> stale
@@ -247,8 +261,9 @@ grep -q STOLE "$LOG" && ok "log records a steal" || bad "no STOLE entry"
 grep -q "holder=pid=99999 host=ghost" "$LOG" \
   && ok "STALE log line carries the holder parsed from line 2" \
   || bad "holder from line 2 missing in the STALE log line"
+fi
 
-echo "== Test 2b: crash recovery under CONTENTION — claim-serialized: zero displacement, zero 98s ($GCL_MODE: $T2B_ROUNDS rounds) =="
+if section "Test 2b: crash recovery under CONTENTION — claim-serialized: zero displacement, zero 98s ($GCL_MODE: $T2B_ROUNDS rounds)"; then
 # The claim SERIALIZES stealers, so the straggler-robs-recovery-winner race
 # is PREVENTED, not detected-and-repaired. Scenario: one crashed lock, N
 # waiters judging stale in the same poll window (the launch/backdate sync
@@ -383,8 +398,9 @@ done
   || bad "'STOLE stale lock' line appeared x$t2b_old_shape — an unserialized steal lane is present"
 [ "$t2b_disp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)" \
   || bad "STEAL-DISPLACED fired x$t2b_disp — displacement-repair machinery present?"
+fi
 
-echo "== Test 3: REGRESSION — EMPTY lock file (crash between create and write) is still stolen =="
+if section "Test 3: REGRESSION — EMPTY lock file (crash between create and write) is still stolen"; then
 # The file-protocol descendant of the 2026-05-30 orphan bug: an acquirer that
 # died after the open but before (or mid-) content write leaves an empty file.
 # Staleness MUST come from the file mtime and the content guard MUST class an
@@ -398,8 +414,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 \
   bash "$LIB" run -- bash -c 'echo after > "$1"' _ "$MARKER"; rc=$?
 [ "$rc" = 0 ] && ok "empty-file orphan stolen (no hang)" || bad "orphan NOT stolen (rc=$rc) — regression!"
 [ "$(cat "$MARKER")" = after ] && ok "command ran after stealing orphan" || bad "command did not run"
+fi
 
-echo "== Test 4: a LIVE lock is NOT stolen (waiter logs WAITING, blocks, then proceeds) =="
+if section "Test 4: a LIVE lock is NOT stolen (waiter logs WAITING, blocks, then proceeds)"; then
 LOCK="$WORK/live.lock"; LOG="$WORK/live.log"; : > "$LOG"; ORDER="$WORK/order"; echo none > "$ORDER"
 READY="$WORK/t4.ready"; GO4="$WORK/t4.go"
 # Holder keeps the lock until the test has SEEN the waiter contend (the
@@ -422,8 +439,9 @@ wait "$waiter"; wait "$holder"
 [ "$(tr '\n' ',' < "$ORDER")" = "none,holder-start,holder-end,waiter-ran," ] \
   && ok "ordering correct" || bad "ordering wrong: $(tr '\n' ',' < "$ORDER")"
 grep -q STOLE "$LOG" && bad "waiter wrongly STOLE a live lock" || ok "no wrongful steal of live lock"
+fi
 
-echo "== Test 4b: a ROBBED slow holder detects the theft and FAILS with 98 on release =="
+if section "Test 4b: a ROBBED slow holder detects the theft and FAILS with 98 on release"; then
 # The fail-open ceiling: a hold longer than the stale window CAN be stolen by a
 # contender. The robbed holder must DETECT this at release (the lock file is
 # gone, or carries the thief's token) and exit EXACTLY 98 (the reserved
@@ -454,8 +472,9 @@ wait "$vpid"; victim_rc=$?
 grep -q "WARNING: lock LOST" "$LOG" && ok "robbed holder logged a loud theft WARNING" || bad "no theft WARNING logged"
 [ "$thief_rc" = 0 ] && ok "thief (its own fresh hold) released cleanly (rc 0)" || bad "thief rc=$thief_rc (should be 0)"
 grep -q thief-work "$OUT" && ok "thief did its work" || bad "thief work missing"
+fi
 
-echo "== Test 4c: a slow but UNCONTENDED holder keeps its lock (slowness != failure) =="
+if section "Test 4c: a slow but UNCONTENDED holder keeps its lock (slowness != failure)"; then
 # Documents the boundary: exceeding the stale window is only dangerous when a
 # contender actually steals. With no waiter, the file is never moved, the token
 # still matches, and release succeeds. (If this failed, the lock would punish
@@ -466,16 +485,18 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 AGENT_LOCK
 [ "$solo_rc" = 0 ] && ok "uncontended slow holder released cleanly (rc 0)" || bad "uncontended slow holder rc=$solo_rc (should be 0)"
 grep -q "WARNING: lock LOST" "$LOG" && bad "spurious theft WARNING with no contender" || ok "no spurious WARNING when uncontended"
 grep -q solo-done "$OUT" && ok "uncontended slow holder did its work" || bad "work missing"
+fi
 
-echo "== Test 5: run propagates the command's exit code, releases either way =="
+if section "Test 5: run propagates the command's exit code, releases either way"; then
 LOCK="$WORK/rc.lock"; LOG="$WORK/rc.log"; : > "$LOG"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'exit 0'; rc=$?
 [ "$rc" = 0 ] && ok "exit 0 propagated" || bad "exit 0 not propagated (rc=$rc)"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'exit 7'; rc=$?
 [ "$rc" = 7 ] && ok "exit 7 propagated" || bad "exit code not propagated (rc=$rc)"
 [ -e "$LOCK" ] && bad "lock left held after run" || ok "lock released after run (success and failure)"
+fi
 
-echo "== Test 6: default lock FILE and log live in the git dir =="
+if section "Test 6: default lock FILE and log live in the git dir"; then
 SCRATCH="$WORK/scratch"; mkdir -p "$SCRATCH"
 git -C "$SCRATCH" init -q; git -C "$SCRATCH" config user.email t@t; git -C "$SCRATCH" config user.name t
 GITDIR="$(git -C "$SCRATCH" rev-parse --absolute-git-dir)"
@@ -494,8 +515,9 @@ touch "$GO6"
 wait "$h6"
 [ -e "$GITDIR/commit.lock" ] && bad "default lock file left behind after release" || ok "default lock file removed on release"
 [ -f "$GITDIR/git-commit-lock.log" ] && ok "lock log created in git dir ($GITDIR)" || bad "no log in git dir"
+fi
 
-echo "== Test 7: CLI usage errors exit 96 (stderr); explicit --help/-h exits 0 (stdout) =="
+if section "Test 7: CLI usage errors exit 96 (stderr); explicit --help/-h exits 0 (stdout)"; then
 bash "$LIB" >/dev/null 2>&1;            [ "$?" = 96 ] && ok "no args -> 96" || bad "no args rc=$? (want 96)"
 bash "$LIB" frobnicate > "$WORK/t7.err.out" 2> "$WORK/t7.err.err"
 [ "$?" = 96 ] && ok "unknown subcommand -> 96" || bad "unknown subcommand rc=$? (want 96)"
@@ -514,8 +536,9 @@ for h in --help -h; do
     && ok "$h -> usage on stdout, exit 0, stderr empty" \
     || bad "$h rc=$rc (want 0) stdout-usage=$(grep -c '^usage:' "$WORK/t7.help.out") stderr=$(head -c 60 "$WORK/t7.help.err")"
 done
+fi
 
-echo "== Test 8: acquire timeout exits 97 and the command NEVER runs =="
+if section "Test 8: acquire timeout exits 97 and the command NEVER runs"; then
 LOCK="$WORK/tmo.lock"; LOG="$WORK/tmo.log"; : > "$LOG"; READY="$WORK/t8.ready"; DONE8="$WORK/t8.done"
 # Holder keeps the lock until the test says so (marker, not a fixed sleep —
 # under heavy load a slow-starting waiter once arrived AFTER a 4s holder had
@@ -561,8 +584,9 @@ grep -q "raise AGENT_LOCK_MAX_WAIT" "$WORK/t8.warn3.err" \
   || ok "explicit MAX_WAIT silences the knob-relation warning (left-default gate kept)"
 wait "$h8"; rc=$?
 [ "$rc" = 0 ] && ok "holder unaffected by the timed-out waiter" || bad "holder rc=$rc (want 0)"
+fi
 
-echo "== Test 9: sub-floor (pre-2000) file mtime is NOT treated as stale =="
+if section "Test 9: sub-floor (pre-2000) file mtime is NOT treated as stale"; then
 # The FILETIME-zero guard: a freshly created file can transiently report a 1601
 # mtime to an observer on Windows (probes C/C1b);
 # anything before 2000-01-01 must be classed unsettled — the waiter WAITS (and
@@ -578,8 +602,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
 grep -q STOLE "$LOG" && bad "sub-floor lock was wrongly STOLEN" || ok "no steal of sub-floor lock"
 [ -f "$LOCK" ] && ok "sub-floor lock file untouched" || bad "sub-floor lock file was removed"
 rm -f "$LOCK"
+fi
 
-echo "== Test 10: every worktree gets its OWN lock (git-dir scoping) =="
+if section "Test 10: every worktree gets its OWN lock (git-dir scoping)"; then
 WTREPO="$WORK/wtrepo"; mkdir -p "$WTREPO"
 git -C "$WTREPO" init -q; git -C "$WTREPO" config user.email t@t; git -C "$WTREPO" config user.name t
 git -C "$WTREPO" commit -q --allow-empty -m init
@@ -612,8 +637,9 @@ wait "$h10"
 [ -e "$WTGD/commit.lock" ] && bad "worktree lock left behind" || ok "worktree lock released"
 [ -f "$WTGD/git-commit-lock.log" ] && ok "worktree log lives in its worktree git dir" || bad "no log at $WTGD"
 [ -e "$MAINGD/commit.lock" ] && bad "main-repo lock left behind" || ok "main-repo lock released"
+fi
 
-echo "== Test 11: TERM mid-hold — lock released, wrapper dies with 128+15 =="
+if section "Test 11: TERM mid-hold — lock released, wrapper dies with 128+15"; then
 # Two discriminators: (a) the EXIT/TERM trap must actually
 # release the lock when the `run` wrapper is killed; (b) the wrapper must NOT
 # swallow the signal (a swallowing wrapper releases, keeps going, and exits 0
@@ -637,8 +663,9 @@ wait "$w11"; rc=$?
                 || bad "TERM'd run wrapper rc=$rc (want 143)"
 [ -e "$LOCK" ] && bad "lock left held after TERM" || ok "lock released on TERM"
 grep -q RELEASED "$LOG" && ok "release logged on TERM path" || bad "no RELEASED entry on TERM path"
+fi
 
-echo "== Test 12: sourced API — acquire/release, traps, strict-mode hygiene =="
+if section "Test 12: sourced API — acquire/release, traps, strict-mode hygiene"; then
 # 12a: sourcing must not impose errexit/nounset/pipefail; acquire/release work
 # across separate commands; reentrant acquire is refused (rc 1, lock kept);
 # release is idempotent. Distinct failure codes pinpoint the broken step.
@@ -730,8 +757,9 @@ done
 wait "$p12"; rc=$?
 [ "$rc" = 143 ] && ok "post-release shell dies on TERM (143) — signal disposition restored" \
                 || bad "post-release shell rc=$rc on TERM (want 143; signal-immune shell?)"
+fi
 
-echo "== Test 13: garbage AGENT_LOCK_* numerics fall back to defaults with a note =="
+if section "Test 13: garbage AGENT_LOCK_* numerics fall back to defaults with a note"; then
 LOCK="$WORK/num.lock"; LOG="$WORK/num.log"; : > "$LOG"
 AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \
   AGENT_LOCK_STALE_SECS=banana AGENT_LOCK_POLL_SECS=-1 AGENT_LOCK_MAX_WAIT=0 \
@@ -740,8 +768,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \
 [ "$rc" = 0 ] && ok "run succeeds despite garbage numeric config" || bad "rc=$rc with garbage numerics"
 n="$(grep -c "ignoring invalid" "$WORK/t13.err")"
 [ "$n" = 4 ] && ok "all 4 garbage values noted on stderr, incl. CLAIM_STALE_SECS (got $n)" || bad "expected 4 'ignoring invalid' notes, got $n"
+fi
 
-echo "== Test 14: run outside any git repo hard-fails 96 unless AGENT_LOCK_PATH is set =="
+if section "Test 14: run outside any git repo hard-fails 96 unless AGENT_LOCK_PATH is set"; then
 NR="$WORK/norepo"; mkdir -p "$NR"
 ( cd "$NR" && env GIT_CEILING_DIRECTORIES="$WORK" bash "$LIB" run -- bash -c 'true' ) 2> "$WORK/t14.err"; rc=$?
 [ "$rc" = 96 ] && ok "run outside a repo refused with 96" || bad "run outside a repo rc=$rc (want 96)"
@@ -749,8 +778,9 @@ grep -q "AGENT_LOCK_PATH" "$WORK/t14.err" && ok "refusal message mentions AGENT_
 ( cd "$NR" && env GIT_CEILING_DIRECTORIES="$WORK" AGENT_LOCK_PATH="$NR/x.lock" AGENT_LOCK_LOG="$NR/x.log" \
     bash "$LIB" run -- bash -c 'true' ) 2>/dev/null; rc=$?
 [ "$rc" = 0 ] && ok "explicit AGENT_LOCK_PATH works outside a repo" || bad "explicit AGENT_LOCK_PATH outside repo rc=$rc"
+fi
 
-echo "== Test 14b: SOURCING outside a repo warns on stderr and creates NO files =="
+if section "Test 14b: SOURCING outside a repo warns on stderr and creates NO files"; then
 # Sourcing keeps the CWD fallback (it must never explode), but the warning
 # goes to STDERR — warning via the lock log instead would, as a side
 # effect, CREATE ./git-commit-lock.log in whatever random directory the
@@ -770,8 +800,9 @@ leftovers="$(ls -A "$NRS" 2>/dev/null)"
 # (There is deliberately no Test 15: the steal installs by rename-over and
 # never creates a move-aside (.dead.*) file, so there is no sweep to test.
 # An implementation must never create one; Test 2b's sampler enforces that.)
+fi
 
-echo "== Test 16: EMPTY lock file at release — unverifiable lane (2 / run:1), NOT a theft verdict =="
+if section "Test 16: EMPTY lock file at release — unverifiable lane (2 / run:1), NOT a theft verdict"; then
 # Truncation stands in for the probe-F window: a file that reads empty after
 # the retry ladder is a successor mid-create after a boundary steal, or
 # external truncation — it canNOT be our own failed write (acquire's
@@ -799,8 +830,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \
   bash "$LIB" run -- bash -c ': > "$AGENT_LOCK_PATH"; exit 7' 2>/dev/null; rc=$?
 [ "$rc" = 7 ] && ok "run keeps a failing command's own code (7) over the unverifiable 1" || bad "run empty-file+exit-7 rc=$rc (want 7)"
 rm -f "$LOCK"
+fi
 
-echo "== Test 16b: lock file GONE at release — definitive theft, exactly 98 =="
+if section "Test 16b: lock file GONE at release — definitive theft, exactly 98"; then
 # Acquire's read-back proved our
 # token was AT the path, so a missing file at release can only mean someone
 # renamed/removed it (a steal, or external interference) — report 98, loudly.
@@ -819,8 +851,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \
   bash "$LIB" run -- bash -c 'rm -f "$AGENT_LOCK_PATH"' 2>/dev/null; rc=$?
 [ "$rc" = 98 ] && ok "run reports 98 (overrides a successful command) when the lock file is gone" \
                || bad "run gone-at-release rc=$rc (want 98)"
+fi
 
-echo "== Test 16c: release rides out a TRANSIENT empty read (escalating retry ladder — ps1 parity) =="
+if section "Test 16c: release rides out a TRANSIENT empty read (escalating retry ladder — ps1 parity)"; then
 # A sub-second window in which the lock file reads EMPTY (stand-in for an AV
 # scanner's blocking handle, or a probe-F create->write gap that resolves)
 # must NOT produce the unverifiable verdict: the read-retry ladder (shared
@@ -853,8 +886,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash -c '
 grep -q "EMPTY/unreadable at release" "$WORK/t16c.err" \
   && bad "spurious unverifiable warning despite the token reappearing" \
   || ok "no unverifiable warning for the ridden-out transient"
+fi
 
-echo "== Test 17: NON-FILE at the lock path — never stolen, loud one-time config warning, waiters reach 97 =="
+if section "Test 17: NON-FILE at the lock path — never stolen, loud one-time config warning, waiters reach 97"; then
 # (a) a directory (a config typo like AGENT_LOCK_PATH=\$HOME, or a directory
 # lock left by an older release). The per-poll type guard fires regardless of
 # age — but only after the SAME concrete type is seen on two consecutive
@@ -929,8 +963,9 @@ else
   rm -f "$LOCK" 2>/dev/null
   echo "note: mkfifo unavailable/unusable here — FIFO guard not exercised (CI POSIX legs cover it)"
 fi
+fi
 
-echo "== Test 17d: REGRESSION — create/delete churn at the lock path must NOT fire the non-lock warning =="
+if section "Test 17d: REGRESSION — create/delete churn at the lock path must NOT fire the non-lock warning"; then
 # The per-poll guard's existence (-e/-L) and classification (-f && ! -L)
 # checks are SEPARATE stats. A rival's release/steal unlink landing between
 # them — or a Windows delete-pending ghost (the unlink queues behind a rival
@@ -1069,8 +1104,9 @@ if [ -n "$churn_pid" ]; then
 else
   echo "note: $churn_skip — churn-vs-guard regression not exercised here (CI legs cover it)"
 fi
+fi
 
-echo "== Test 18: stale NON-LOCK CONTENT at the lock path is never stolen; torn tokens split on the tok. prefix =="
+if section "Test 18: stale NON-LOCK CONTENT at the lock path is never stolen; torn tokens split on the tok. prefix"; then
 # The content guard (age-gated): steal only an empty file or a line 1 starting
 # "tok.". A real user file at a typo'd AGENT_LOCK_PATH must survive, forever.
 # (a) a user file
@@ -1113,8 +1149,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 \
   && ok "tok.-prefixed torn token IS stolen by staleness (crash-orphan lane)" \
   || bad "tok.-prefixed torn token not stolen (rc=$rc marker=$(cat "$MARKER"))"
 grep -q STOLE "$LOG" && ok "steal of the torn token logged" || bad "no STOLE entry for torn token"
+fi
 
-echo "== Test 19: wire format — token on line 1 (tok.-prefixed), owner on line 2 =="
+if section "Test 19: wire format — token on line 1 (tok.-prefixed), owner on line 2"; then
 # Pins the on-disk format the ps1 port must match, and that token parsing
 # takes LINE 1 only (an owner line present must not pollute the token).
 LOCK="$WORK/wire.lock"; LOG="$WORK/wire.log"; : > "$LOG"
@@ -1130,8 +1167,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash -c '
 ' _ "$LIB" "$LOCK"; rc=$?
 [ "$rc" = 0 ] && ok "lock file carries token (line 1, tok.-prefixed) + owner (line 2); release parses line 1 with owner present" \
               || bad "wire-format check failed at step code $rc"
+fi
 
-echo "== Test 20: claim contention — N concurrent stealers, ONE claim winner ($GCL_MODE: $T20_N workers) =="
+if section "Test 20: claim contention — N concurrent stealers, ONE claim winner ($GCL_MODE: $T20_N workers)"; then
 # N stealers race one ancient ghost: exactly one wins the O_EXCL claim and
 # steals (one STOLE-BY-CLAIM); the rest lose the claim create and acquire
 # normally in sequence after the winner releases. No displacement (zero
@@ -1165,8 +1203,9 @@ nlost="$(grep -c "lock LOST" "$WORK/contend.all.log")"
 [ "$nlost" = 0 ] && ok "zero LOST warnings under claim contention" || bad "$nlost LOST warnings under claim contention"
 [ -e "$LOCK" ] && bad "leftover lock after contention" || ok "no leftover lock"
 [ -e "$LOCK.next" ] && bad "leftover claim after contention" || ok "no leftover claim"
+fi
 
-echo "== Test 21: crashed-claimant and empty-claim orphans age out; steals resume =="
+if section "Test 21: crashed-claimant and empty-claim orphans age out; steals resume"; then
 # (a) an aged foreign claim (crashed claimant): cleared by CLAIM-STALE-CLEARED,
 # then the steal completes; recovery latency bounded.
 LOCK="$WORK/cc.lock"; LOG="$WORK/cc.log"; : > "$LOG"
@@ -1191,8 +1230,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \
   bash "$LIB" run -- bash -c 'true' 2>/dev/null; rc=$?
 [ "$rc" = 0 ] && ok "empty claim orphan aged out and recovery completed (rc 0)" || bad "rc=$rc behind an empty claim orphan"
 grep -q "CLAIM-STALE-CLEARED" "$LOG" && ok "empty claim cleared via the same staleness lane" || bad "empty claim was not cleared"
+fi
 
-echo "== Test 22: NON-CLAIM objects at the claim path — never deleted, per-path warn state =="
+if section "Test 22: NON-CLAIM objects at the claim path — never deleted, per-path warn state"; then
 # (a) a directory at ${LOCK}.next blocks steals (waiter reaches 97), is never
 # deleted, and warns once naming the claim path.
 LOCK="$WORK/cwt.lock"; LOG="$WORK/cwt.log"; : > "$LOG"
@@ -1305,8 +1345,9 @@ AGENT_LOCK_PATH="$PPD2/c1.lock" AGENT_LOCK_LOG="$PPD2/ppg2.log" AGENT_LOCK_STALE
 grep -q "is not a claim file" "$PPD2/ba.err" && grep -q "is not a lock file" "$PPD2/ba.err" \
   && ok "claim-path warning did not suppress the lock-path warning (reverse order)" \
   || bad "lock-path warning suppressed after a claim-path warning (shared warn-once state?)"
+fi
 
-echo "== Test 23: live-slow holder — re-verify under the claim sees a fresh lock, CLAIM-ABORT (fresh), no steal =="
+if section "Test 23: live-slow holder — re-verify under the claim sees a fresh lock, CLAIM-ABORT (fresh), no steal"; then
 # Steered deterministically: the lock's mtime is renewed (as a live-slow
 # holder's re-create/renewal would) at the exact step-2 re-verify position,
 # via a sourced shell that wraps the library's verify internal. The claimant
@@ -1337,8 +1378,9 @@ wait "$w23"; rc=$?
 [ "$rc" = 0 ] && ok "waiter then acquired and released normally (rc 0)" || bad "waiter rc=$rc after the slow holder released"
 grep -q "STOLE-BY-CLAIM" "$LOG" && bad "live lock was STOLEN despite the fresh re-verify" || ok "no steal of the live-slow holder's lock"
 [ -e "$LOCK.next" ] && bad "claim leftover after the fresh abort" || ok "claim deleted on the fresh abort"
+fi
 
-echo "== Test 24: OVERAGED own claim — CLAIM-ABORT (contested), no rename =="
+if section "Test 24: OVERAGED own claim — CLAIM-ABORT (contested), no rename"; then
 # A suspended claimant's recheck must refuse to proceed on its own overaged
 # claim (a clearer may be acting on it). Steered: every recheck sees the
 # claim backdated past CLAIM_STALE. Mutation check: an implementation that
@@ -1364,8 +1406,9 @@ l1=""; IFS= read -r l1 < "$LOCK" || true
 [ "$l1" = "tok.ghost.t24" ] && ok "ghost lock untouched by the contested aborts" || bad "ghost lock was modified (line1=$l1)"
 [ -e "$LOCK.next" ] && bad "claim leftover after contested aborts" || ok "claim deleted on each contested abort"
 rm -f "$LOCK"
+fi
 
-echo "== Test 25: discovery-position matrix — own-claim-installed discovered on EVERY exit =="
+if section "Test 25: discovery-position matrix — own-claim-installed discovered on EVERY exit"; then
 # A rival's rename can install OUR claim as the lock while we sit at any
 # post-claim position. Each position steers that rename to the exact spot
 # (wrapping a library internal or shadowing mv/rm/touch in a sourced shell)
@@ -1468,8 +1511,9 @@ for pos in step2-fresh recheck-gone touch-gone lock-gone contested deletion-gone
     bad "position $pos: rc=$rc discovery=$(grep -c DISCOVERY-HOLD "$LOG") expect-line=$(grep -cF "$expect" "$LOG") lock-left=$([ -e "$LOCK" ] && echo yes || echo no) claim-left=$([ -e "$LOCK.next" ] && echo yes || echo no)"
   fi
 done
+fi
 
-echo "== Test 26: delayed claim still installs a FRESH lease (the pre-rename touch) =="
+if section "Test 26: delayed claim still installs a FRESH lease (the pre-rename touch)"; then
 # A claim aged close to CLAIM_STALE (steered: backdated 40s of 60 at the
 # recheck) must still install a lock whose mtime is ~now — the step-3.2
 # touch resets the clock; rename preserves it (probe R2). A no-touch
@@ -1500,8 +1544,9 @@ case "$rc" in
   *)  bad "delayed-claim lease harness rc=$rc" ;;
 esac
 grep -q "STOLE-BY-CLAIM" "$LOG" && ok "the delayed claim still completed its steal" || bad "no STOLE-BY-CLAIM in the lease test"
+fi
 
-echo "== Test 27: lock GONE at re-verify — CLAIM-ABORT (gone), NO rename onto the absent path =="
+if section "Test 27: lock GONE at re-verify — CLAIM-ABORT (gone), NO rename onto the absent path"; then
 # A live-slow holder releasing under a claimant must route to the normal
 # create race, never a rename onto the absent path. Mutation check: a
 # renaming implementation would install the CLAIM token; the correct one
@@ -1532,8 +1577,9 @@ else
   bad "claim token vs acquired token: claim='$ctok' acquired='$atok' (equal or missing => renamed onto the absent path?)"
 fi
 grep -q "DISCOVERY-HOLD" "$LOG" && bad "spurious discovery-HOLD in the gone lane" || ok "no spurious discovery-HOLD"
+fi
 
-echo "== Test 28: SUB-FLOOR claim mtime is never cleared — treated as just-created =="
+if section "Test 28: SUB-FLOOR claim mtime is never cleared — treated as just-created"; then
 LOCK="$WORK/cfloor.lock"
 LOG="$WORK/cfloor.log"
 : >"$LOG"
@@ -1549,8 +1595,9 @@ grep -q "CLAIM-STALE-CLEARED" "$LOG" && bad "sub-floor claim was CLEARED — mti
                                      || ok "sub-floor claim never cleared (floor applies to the claim)"
 [ -f "$LOCK.next" ] && ok "sub-floor claim file untouched" || bad "sub-floor claim file was removed"
 rm -f "$LOCK" "$LOCK.next"
+fi
 
-echo "== Test 29: BLOCKED steal rename — claim deleted IMMEDIATELY, no CLAIM_STALE penalty =="
+if section "Test 29: BLOCKED steal rename — claim deleted IMMEDIATELY, no CLAIM_STALE penalty"; then
 # The rename is forced to fail-with-the-lock-still-present (a shadowed mv —
 # the no-delete-share squat, deterministically). The claimant must delete its
 # own claim at once and re-poll: with CLAIM_STALE=600, a leftover claim would
@@ -1579,8 +1626,9 @@ grep -q "steal FAILED" "$LOG" && ok "blocked rename logged (damped steal FAILED)
 [ -e "$LOCK.next" ] && bad "claim leftover after the blocked steal attempts" || ok "no claim leftover at exit"
 [ -f "$LOCK" ] && ok "squatted lock left in place" || bad "lock vanished in the blocked lane"
 rm -f "$LOCK"
+fi
 
-echo "== Test 30: static checks — the claim touch is NON-creating with an explicit existence check =="
+if section "Test 30: static checks — the claim touch is NON-creating with an explicit existence check"; then
 grep -q 'touch -c -- "\$_LOCK_CLAIM_PATH"' "$LIB" \
   && ok "claim touch uses 'touch -c --' (non-creating)" \
   || bad "no 'touch -c -- \$_LOCK_CLAIM_PATH' in the implementation"
@@ -1590,8 +1638,9 @@ grep -A3 'touch -c -- "\$_LOCK_CLAIM_PATH"' "$LIB" | grep -q -- '-e "\$_LOCK_CLA
 bad_touch="$(grep 'touch ' "$LIB" | grep '_LOCK_CLAIM_PATH' | grep -v -- '-c')"
 [ -z "$bad_touch" ] && ok "no creating touch of the claim path anywhere" \
                     || bad "creating touch of the claim path found: $bad_touch"
+fi
 
-echo "== Test 31: LEAKED-claim discovery — the leaked-token memory closes the unverified-claim lanes =="
+if section "Test 31: LEAKED-claim discovery — the leaked-token memory closes the unverified-claim lanes"; then
 # (a) main leg: a recheck-unreadable exit leaks the claim token; a rival
 # (the external mv below) then installs that claim as the lock; the leaver
 # adopts it (HOLD) and release returns 0. Adoption may go through EITHER of
@@ -1801,8 +1850,9 @@ case "$(uname -s 2>/dev/null)" in
     echo "note: the blocked-unlink feeder leg is Windows-only by construction (POSIX open handles never block unlink); the read-shadow legs above cover the memory machinery"
     ;;
 esac
+fi
 
-echo "== Test 32: per-attempt tokens — an abandoned own-token lock never aliases discovery or release =="
+if section "Test 32: per-attempt tokens — an abandoned own-token lock never aliases discovery or release"; then
 # Walk: the first CREATE's read-back is forced blank (and the abandoned lock
 # backdated stale). A later CLAIM attempt is steered into a recheck-gone
 # discovery against that abandoned lock: a reused-per-acquire-token
@@ -1845,8 +1895,9 @@ grep -q "DISCOVERY-HOLD" "$LOG" && bad "FALSE discovery-HOLD on the abandoned ow
                                 || ok "no false discovery-HOLD — the abandoned token did not alias the claim attempt"
 grep -q "STOLE-BY-CLAIM" "$LOG" && ok "the abandoned lock was then reclaimed by a normal steal" \
                                 || bad "no STOLE-BY-CLAIM of the abandoned lock"
+fi
 
-echo "== Test 32b: steal-path read-back FAILED — rename-over WON but the lock did not read back our token (F2) =="
+if section "Test 32b: steal-path read-back FAILED — rename-over WON but the lock did not read back our token (F2)"; then
 # The steal-path twin of Test 32. Here the stealer WINS the claim race AND wins
 # the rename-over (STOLE-BY-CLAIM is logged, the ghost is destroyed), but the
 # mandatory post-rename read-back verification (git-commit-lock.sh:1171) comes
@@ -1898,8 +1949,9 @@ else
 fi
 [ -e "$LOCK" ] && bad "lock leftover after the steal-readback walk" || ok "lock released cleanly"
 [ -e "$LOCK.next" ] && bad "claim leftover after the steal-readback walk" || ok "no claim leftover"
+fi
 
-echo "== Test 33: TERM mid-claim — the trap deletes the claim (token-checked), no 98, no ageout penalty =="
+if section "Test 33: TERM mid-claim — the trap deletes the claim (token-checked), no 98, no ageout penalty"; then
 # (a) main: claimant paused inside its claim window (at the touch), TERM'd.
 # The trap must delete OUR claim, run the discovery read (miss: the ghost is
 # foreign), restore traps, re-raise (143) — and must NOT touch the lock.
@@ -2030,8 +2082,9 @@ case "$(uname -s 2>/dev/null)" in
     echo "note: TERM-blocked-unlink leg is Windows-only by construction (POSIX open handles never block unlink)"
     ;;
 esac
+fi
 
-echo "== Test 34: TERM on a STEAL-acquired hold releases exactly like a create-acquired one =="
+if section "Test 34: TERM on a STEAL-acquired hold releases exactly like a create-acquired one"; then
 # All acquisition paths go through the shared claim-the-hold helper, so a
 # steal-acquired holder must run the same HELD/trap machinery: release on
 # TERM, re-raise, 143 (T11's contract, on a steal-acquired hold).
@@ -2054,8 +2107,9 @@ wait "$w34"; rc=$?
 [ "$rc" = 143 ] && ok "TERM'd steal-acquired holder exited 143 (signal re-raised)" || bad "steal-acquired TERM rc=$rc (want 143)"
 [ -e "$LOCK" ] && bad "lock left held after TERM on a steal-acquired hold" || ok "steal-acquired lock released on TERM"
 grep -q "RELEASED" "$LOG" && ok "release logged on the steal-acquired TERM path" || bad "no RELEASED entry for the steal-acquired hold"
+fi
 
-echo "== Test 35: release-time leaked-claim cleanup — displaced hold cleans its own installed leak, 98 =="
+if section "Test 35: release-time leaked-claim cleanup — displaced hold cleans its own installed leak, 98"; then
 # (a) B leaks token L (recheck-unreadable; the ghost vanishes at the same
 # moment), acquires fresh N normally; a rival installs L over the lock,
 # displacing B's held N. B's release must return 98 AND unlink L (the lock
@@ -2148,8 +2202,9 @@ esac
 grep -q "RELEASE-CLEANED-LEAKED-CLAIM" "$LOG" && bad "boundary variant wrongly logged a leaked-claim cleanup" \
                                               || ok "no cleanup line when the re-read backed off"
 rm -f "$LOCK" "$LOCK.next" "$WORK/t35b.succ"
+fi
 
-echo "== Test 36: arc-end resolution pass — an INCONCLUSIVE lock read keeps the entry pending; conclusive ones drop it =="
+if section "Test 36: arc-end resolution pass — an INCONCLUSIVE lock read keeps the entry pending; conclusive ones drop it"; then
 # The pass's entry-drop is gated on one lock-path read. That read resolves
 # the entry ONLY when it is conclusive: a DIFFERENT readable token, or the
 # path definitively absent. A lock PRESENT but unreadable/empty proves
@@ -2207,8 +2262,9 @@ grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG" && ok "the surviving entry
 grep -q "resolved tok=tok.leak.t36.2" "$LOG" && ok "conclusive resolution logged for the dropped entry" \
                                              || bad "no resolution log line for the conclusive drop"
 rm -f "$LOCK" "$LOCK.next"
+fi
 
-echo "== Test 37: rename-refused — a directory appearing at the lock path mid-steal aborts the steal, no false hold =="
+if section "Test 37: rename-refused — a directory appearing at the lock path mid-steal aborts the steal, no false hold"; then
 # The only acquire/steal VERDICT branch with no test: a NON-regular object (a
 # directory) appears AT the lock path between the claimant's final re-verify
 # (step 3.3, sees a stale FILE) and its rename-over, so the rename is refused
@@ -2270,8 +2326,9 @@ grep -q "acquire verification FAILED" "$LOG" \
   && ok "directory left in place at the lock path (never overwritten)" \
   || bad "lock path is no longer the squatting directory"
 rm -rf "$LOCK" "$LOCK.next"
+fi
 
-echo "== Test 38: step-3.3 pre-rename re-verify abort — claim cleaned, discovery, no false hold =="
+if section "Test 38: step-3.3 pre-rename re-verify abort — claim cleaned, discovery, no false hold"; then
 # The step-2 re-verify (sh:1075) and the step-3.3 re-verify immediately before
 # the rename (sh:1149) are near-identical abort lanes; Test 23/27 exercise the
 # step-2 lane only, leaving 3.3 untested. Steered with a CALL-COUNTER on
@@ -2330,9 +2387,10 @@ wait "$w38"; rc=$?
               || bad "waiter rc=$rc after the slow holder released (want 0)"
 [ -e "$LOCK.next" ] && bad "claim leftover after the waiter finished" || ok "no claim leftover at exit"
 rm -f "$LOCK" "$LOCK.next"
+fi
 
 
-echo "== Test 39: foreign claim at recheck — left intact, discovery, no false 98 =="
+if section "Test 39: foreign claim at recheck — left intact, discovery, no false 98"; then
 # After winning its claim and passing step-2 re-verify, the claimant rechecks
 # its OWN claim file before installing. The `gone` recheck leg is covered (Test
 # 25 recheck-gone / Test 32); the `foreign` leg is NOT: a waiter judged our
@@ -2404,8 +2462,9 @@ gl1=""; IFS= read -r gl1 < "$LOCK" 2>/dev/null || true
 [ "$gl1" = "tok.ghost.t39" ] && ok "ghost lock untouched by the foreign-recheck backoff" \
                              || bad "ghost lock modified (line1=$gl1)"
 rm -f "$LOCK" "$LOCK.next" "$SF"
+fi
 
-echo "== Test 40: exec-bypass boundary — exec in the lock-holding shell skips release (OOS-5); exec in a child does not =="
+if section "Test 40: exec-bypass boundary — exec in the lock-holding shell skips release (OOS-5); exec in a child does not"; then
 # `lock_run` runs the wrapped command vector with `"$@"` IN THE WRAPPER SHELL
 # (git-commit-lock.sh), so a command that is itself an `exec` REPLACES the
 # lock-holding wrapper process: the trailing `lock_release` AND the EXIT trap
@@ -2496,8 +2555,9 @@ grep -q "WARNING" "$LOG" \
   && bad "an unexpected WARNING was logged by the displaced exec-0 holder" \
   || ok "displaced holder's exec-0 emitted NO WARNING at all (unwarned silent loss)"
 rm -f "$LOCK"
+fi
 
-echo "== Test 41: forward clock jump steals a live lock — detected as 98, never silent (E2) =="
+if section "Test 41: forward clock jump steals a live lock — detected as 98, never silent (E2)"; then
 # Staleness is age = now - mtime (git-commit-lock.sh ~:928, ~:1409), where `now`
 # is _lock_now. A process whose clock has LEAPED FORWARD computes an inflated age
 # for everyone's lock, so it can judge a LIVE, fresh lock ancient and steal it.
@@ -2561,8 +2621,9 @@ grep -q "WARNING: lock LOST" "$LOG" \
   && ok "robbed holder logged a loud theft WARNING (no silent double-commit)" \
   || bad "no theft WARNING logged for the forward-jump steal"
 rm -f "$LOCK" "$LOCK.next"
+fi
 
-echo "== Test 42: mtime unreadable — staleness disabled, fail-safe (no steal), warn-once, 97 (E3) =="
+if section "Test 42: mtime unreadable — staleness disabled, fail-safe (no steal), warn-once, 97 (E3)"; then
 # §E3: if the lock file's mtime cannot be read AT ALL (every probe fails on a
 # PRESENT file), staleness detection is BROKEN. The mtime floor fails closed to
 # "fresh": _lock_verify_stale returns state=fresh, so a crashed/stale holder is
@@ -2631,8 +2692,9 @@ t42_warns="$(grep -c "Staleness detection is BROKEN" "$T42_ERR" 2>/dev/null || e
   && ok "mtime-unreadable: broken-staleness warning fired at most once on stderr ($t42_warns)" \
   || bad "mtime-unreadable: warning repeated ($t42_warns times — warn-once broken)"
 rm -f "$T42_LOCK" "$T42_LOCK.next"
+fi
 
-echo "== Test 43: malformed/unreadable lock content at the poll guard — never stolen, warned/skipped =="
+if section "Test 43: malformed/unreadable lock content at the poll guard — never stolen, warned/skipped"; then
 # Two sibling branches of the in-acquire steal CONTENT GUARD (git-commit-lock.sh
 # ~:1419-1444), both gated on an already-stale candidate, neither of which the
 # torn/empty/tok.-prefixed cases (Tests 17/18) reach:
@@ -2700,8 +2762,9 @@ grep -q "STOLE" "$LOG" && bad "#17 ghost was STOLEN despite the unreadable conte
                        || ok "#17 no steal while the steal-guard read fails"
 [ -f "$LOCK" ] && ok "#17 stale ghost left in place" || bad "#17 stale ghost was removed"
 rm -f "$LOCK"
+fi
 
-echo "== Test 44: socket & device-node at the lock path — never stolen/deleted, refused (97) =="
+if section "Test 44: socket & device-node at the lock path — never stolen/deleted, refused (97)"; then
 # The never-steal wrong-type guard (git-commit-lock.sh ~:1557-1567) classifies
 # NON-regular objects at the lock path so they are NEVER stolen and NEVER
 # deleted: a real config error (a typo'd AGENT_LOCK_PATH, a stray special file)
@@ -2793,9 +2856,10 @@ if [ -c /dev/null ]; then
 else
   echo "note: /dev/null is not a char device here — device-node guard not exercised (CI POSIX legs cover it)"
 fi
+fi
 
 
-echo "== Test 45: log self-truncates past ~1 MB (rotation, not unbounded growth) =="
+if section "Test 45: log self-truncates past ~1 MB (rotation, not unbounded growth)"; then
 # _lock_log starts the log over (not rotate) once it grows past ~1MB: the size
 # check at the top of _lock_log truncates the file to empty before the write,
 # so a normal log-producing op on an oversized log leaves a small, well-formed
@@ -2827,8 +2891,9 @@ grep -q 'xxxx' "$LOG" && bad "old oversized 'x' content survived into the restar
                       || ok "old oversized content is gone (clean restart, not appended)"
 [ -e "$LOCK" ] && bad "lock left held after run" || ok "lock released after the over-threshold run"
 rm -f "$LOCK" "$LOG"
+fi
 
-echo "== Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release =="
+if section "Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release"; then
 # A10 (steering-coverage.md): _lock_on_exit's no-hold arc-end (:1009,1017-1018).
 # A sourced waiter, blocked in the wait loop against a LIVE held lock, exits 0
 # while still parked — the EXIT trap is STILL '_lock_on_exit' (the timeout's
@@ -2921,8 +2986,9 @@ touch "$HG"; wait "$h46" 2>/dev/null
 grep -q "lock LOST" "$HLOG" && bad "holder saw a stolen lease (98) — the waiter's exit disturbed the hold" \
                             || ok "holder released its still-held lock cleanly (no 98)"
 rm -f "$LOCK" "$LOCK.next" "$T46R" "$T46G" "$T46T" "$HR" "$HG"
+fi
 
-echo "== Test 47: no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 — steal still installs =="
+if section "Test 47: no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 — steal still installs"; then
 # _lock_rename_over (git-commit-lock.sh ~:961-979) probes once for GNU `mv -T`
 # and caches the verdict in _LOCK_MVT (""=unprobed, 1=supported, 0=not). On
 # Linux/MINGW the probe ALWAYS picks `mv -T`, so the no-`-T` fallback lane
@@ -3048,9 +3114,10 @@ grep -q "STOLE-BY-CLAIM" "$LOGB" \
   && bad "T47(b): claim leftover (\$LOCK.next) after the fallback rename-refused abort" \
   || ok "T47(b): claim file cleaned up — no leftover \$LOCK.next"
 rm -rf "$LOCK" "$LOCK.next" "$LOCKC" "$LOCKC.next" "$LOCKB" "$LOCKB.next"
+fi
 
 
-echo "== Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4) =="
+if section "Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4)"; then
 # F4 (failure-modes.md §4.5): a read-only / unwritable lock-dir parent makes the
 # O_EXCL create fail every poll, so the waiter times out at 97 — no corruption, no
 # false hold, and the wrapped command never runs. POSIX-only: chmod 0555 is a no-op
@@ -3078,8 +3145,9 @@ case "$(uname -s)" in
     chmod 0755 "$T48DIR" 2>/dev/null; rm -rf "$T48DIR"   # restore so cleanup() can rm -rf $WORK
     ;;
 esac
+fi
 
-echo "== Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1) =="
+if section "Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1)"; then
 # F2/J1 (failure-modes.md §4.5): logging is best-effort (every write ends || true).
 # Point AGENT_LOCK_LOG under a REGULAR FILE so every append/open fails ENOTDIR — the
 # lock must still acquire+release cleanly (rc 0) with the log write swallowed.
@@ -3100,8 +3168,9 @@ AGENT_LOCK_PATH="$WORK/t49.lock" AGENT_LOCK_LOG="$T49LOG" \
 [ ! -e "$T49LOG" ] && ok "F2/J1: the log write was swallowed (no log file under the non-dir)" \
                    || bad "F2/J1: a log file was created under a non-dir"
 rm -f "$T49P" "$WORK/t49.lock"
+fi
 
-echo "== Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1) =="
+if section "Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1)"; then
 # F1 (failure-modes.md §4.5): a full filesystem makes the create's write fail
 # (ENOSPC); the created-but-write-failed file is an empty orphan and the waiter
 # times out at 97 — no corruption, no false hold. Real ENOSPC needs a full FS, which
@@ -3128,6 +3197,7 @@ if [ "$(uname -s)" = Linux ] && sudo -n true 2>/dev/null; then
 else
   echo "note: Test 50 skipped — ENOSPC injection needs Linux + passwordless sudo (a small tmpfs); the Linux CI leg covers it"
 fi
+fi
 
 # NOTES (deliberately untested here):
 # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a
@@ -3141,6 +3211,15 @@ fi
 #   Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by
 #   Test 32b.
 
+# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran NO test block. Without
+# this, the suite would fall through to a vacuous PASS=0 FAIL=0 "green" — a typo'd
+# selector regex would silently look like success. Fail loudly instead. (The finish
+# EXIT trap also fires here since DONE is still 0; this exit is non-zero regardless.)
+if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then
+  echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2
+  exit 1
+fi
+
 DONE=1
 echo
 echo "==== RESULT: $PASS passed, $FAIL failed, $ENV_WARN envelope warning(s) (fan-out: $GCL_MODE) ===="

From b8e29513406265f0905a0e6770586313079735aa Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 03:42:29 +1000
Subject: [PATCH 38/76] Bucket 8 item 3: extract tests/_harness.sh (shared
 TAP/selector/helpers)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Pure deduplication, zero behavior change — the last harness-restructure step.
A new tests/_harness.sh (177 lines), sourced by all three suites, holds the
genuinely-shared code:

- Tier 1 (all three): the PASS/FAIL/TAPN/DONE/SECTIONS_RUN inits; the GCL_TAP /
  GCL_TEST_ONLY reads; ok()/bad(); section(); the finish/sentinel EXIT-trap
  helper (which calls the suite-local cleanup); the shared shellcheck disables;
  and a unified selector_report() (zero-match guard + the "ran N block(s)" line)
  so unit and interop behave identically.
- Tier 2 (unit + interop, each verified byte-identical before extracting):
  epoch_to_stamp, backdate, backdate_ghost, sync_waiting_fresh, fabricate_lock,
  wait_for_grep.

Left per-suite (deliberately): cleanup (closes over each $WORK; interop differs);
clone_fn + its export -f (unit-only); ok_envelope/bad_envelope/ENV_WARN
(unit-only); the two poll helpers wait_for_file (unit, secs) and wait_for
(interop, 50ms iters) — different names/semantics, NOT unified; and each suite's
verdict line + GCL_TEST_FULL mode handling.

Sourcing is CWD-independent (resolved from BASH_SOURCE). A
`# shellcheck source=tests/_harness.sh` directive at each source site resolves
SC1091, and tests/_harness.sh is added to the CI shellcheck file list so the
shared code is linted.

Validated (reduced, exit 0): unit 315/0, interop 141/0, integration 12/0; sorted
PASS/FAIL identical before/after (volatile token/path/bounded-count fields
aside); selector + zero-match guard + integration note-and-ignore all intact;
shellcheck -S style clean across all files incl. _harness.sh. Net -42 lines.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/tests.yml               |   1 +
 tests/_harness.sh                         | 177 ++++++++++++++++++++++
 tests/git-commit-lock.integration.test.sh |  37 ++---
 tests/git-commit-lock.interop.test.sh     | 160 ++++---------------
 tests/git-commit-lock.test.sh             | 152 ++++---------------
 5 files changed, 243 insertions(+), 284 deletions(-)
 create mode 100644 tests/_harness.sh

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 52961e6..268c257 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -145,6 +145,7 @@ jobs:
           /tmp/shellcheck-v0.11.0/shellcheck --version
           /tmp/shellcheck-v0.11.0/shellcheck -S style \
             git-commit-lock.sh \
+            tests/_harness.sh \
             tests/git-commit-lock.test.sh \
             tests/git-commit-lock.interop.test.sh \
             tests/git-commit-lock.integration.test.sh \
diff --git a/tests/_harness.sh b/tests/_harness.sh
new file mode 100644
index 0000000..d5d8215
--- /dev/null
+++ b/tests/_harness.sh
@@ -0,0 +1,177 @@
+# shellcheck shell=bash
+# tests/_harness.sh — shared test harness for the git-commit-lock suites.
+#
+# Sourced by all three suites (git-commit-lock.test.sh, .interop.test.sh,
+# .integration.test.sh) to share the bits they all copy-pasted: the PASS/FAIL/
+# TAP counters, the GCL_TAP / GCL_TEST_ONLY reads, ok()/bad(), section(), the
+# end-of-suite DONE sentinel (finish), and the per-test selector verdict helper.
+# Pure deduplication — ZERO behaviour change vs the inline copies it replaces.
+#
+# Contract for sourcing suites:
+#   * Source this EARLY (before any use of the inits/helpers below), CWD-
+#     independently — resolve it from the sourcing script's own location:
+#       _HARNESS_DIR="$(CDPATH= cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+#       # shellcheck source=tests/_harness.sh
+#       . "$_HARNESS_DIR/_harness.sh"
+#   * Each suite still defines its OWN cleanup() (it closes over the suite's
+#     $WORK and the bodies genuinely differ); finish() below calls whatever
+#     cleanup() is in scope when the EXIT trap fires.
+#   * Each suite installs the trap itself: `trap finish EXIT`.
+#   * The suite reaching its end sets DONE=1 before its verdict line.
+#
+# The whole project runs its suites under `set -uo pipefail` (NOT set -e); these
+# helpers are written for that (they assert on values, never on implicit exit
+# propagation), and the disables below cover the idioms that pervade the suites.
+#
+# shellcheck disable=SC2015  # The pervasive `<assert> && ok ... || bad ...`
+# idiom is deliberate throughout: ok/bad are echo+counter helpers that cannot
+# fail, so the classic A && B || C pitfall (C running after B fails) is moot.
+# shellcheck disable=SC2310,SC2312  # info-level, deliberate: helper functions
+# and command substitutions run inside conditions all over a test suite; the
+# suites run WITHOUT errexit (set -uo only) and assert on values, not on
+# implicit exit propagation.
+
+PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
+GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
+GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
+
+# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
+# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
+# by each suite just before its verdict) lets a TAP consumer fail on a short count;
+# together with the DONE sentinel below this closes the silent-undercount gap.
+# `return 0` preserves the "ok/bad cannot fail" property the
+# `<assert> && ok ... || bad ...` idiom relies on.
+ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
+        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
+bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
+        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
+
+# Per-test gate: echoes the block header (so a normal run is byte-unchanged) and
+# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label.
+# Each top-level `== Test N: <desc> ==` block is wrapped `if section "..."; then ... fi`.
+# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard (selector_report)
+# can catch a selector regex that matched nothing.
+section() {
+  echo "== $1 =="
+  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
+    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
+  fi
+  return 1
+}
+
+# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
+# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
+# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
+# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
+# Calls the suite-local cleanup() (each suite defines its own, closing over its
+# own $WORK); whatever cleanup is in scope when the trap fires is used.
+finish() {
+  cleanup
+  if [ "${DONE:-0}" != 1 ]; then
+    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
+    exit 1
+  fi
+}
+
+# Selector verdict helper, called by the section-using suites just before their
+# verdict line. Two parts, both gated on GCL_TEST_ONLY being non-empty so a
+# default run stays byte-identical:
+#   1. Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran NO test block,
+#      so the (vacuously green) verdict would lie — a typo'd selector regex must
+#      FAIL, not pass with zero assertions. Bail loudly. (The finish EXIT trap
+#      also fires here since DONE is still 0; this exit is non-zero regardless.)
+#   2. Report how many blocks the selector matched.
+# Integration does NOT call this — it is one indivisible scenario that does not
+# use section(), so it note-and-ignores GCL_TEST_ONLY at its top instead.
+selector_report() {
+  if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then
+    echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2
+    exit 1
+  fi
+  [ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)"
+  return 0
+}
+
+# --- Shared timing/lock helpers (unit + interop; integration uses none) -------
+# Backdate a path's mtime by $2 seconds — how a test fakes a stale lock (the
+# lock's staleness clock is the lock FILE's own mtime, stamped by the creating
+# write). Portable: BSD/macOS touch has no `-d @epoch`, so convert the target
+# epoch to a `touch -t` stamp via GNU `date -d @` with BSD `date -r` as
+# fallback.
+epoch_to_stamp() {
+  date -d "@$1" +%Y%m%d%H%M.%S 2>/dev/null || date -r "$1" +%Y%m%d%H%M.%S 2>/dev/null
+}
+backdate() { touch -t "$(epoch_to_stamp "$(( $(date +%s) - $2 ))")" "$1"; }
+
+# Token-guarded backdate for the contended-recovery rounds (unit T2b /
+# interop T16/T16b). Why: under load a fast waiter can complete its ENTIRE steal
+# (claim -> rename-over -> ACQUIRED) before the harness's `touch` executes, so a
+# blind backdate lands on the WINNER'S freshly installed lock, making it
+# instantly stale for every rival — a legitimate re-steal then fails the round's
+# "zero 98s / exactly one STOLE-BY-CLAIM" assertions although the protocol
+# behaved exactly as designed (observed 2026-06-12 on a loaded box). Verdicts:
+#   * pre-read not the ghost: a waiter stole the ghost BEFORE the touch (it
+#     aged stale naturally during a stalled sync); no touch is performed and
+#     the round premise is gone — invalid, the caller retries the round.
+#   * post-read the ghost: conclusive — nothing ever rewrites the ghost
+#     token at the path, so the touch verifiably hit the ghost; any steal
+#     after the post-read steals an ALREADY-ancient ghost, exactly the
+#     scenario the round wants. Valid.
+#   * post-read anything else: a steal raced the touch->re-read window —
+#     COMMON under load (waiters poll every 0.05s; the post-read costs
+#     subprocess spawns), so it must not blindly invalidate. The lock's
+#     MTIME arbitrates which file the touch hit: a winner's installed lock
+#     is FRESH (the rename carries the claim file's just-created mtime), so
+#     fresh => the touch hit the GHOST and a legitimate steal followed —
+#     valid; ancient => the touch landed on the WINNER'S live lock and
+#     corrupted the round — invalid, retry. Vanished => cannot arbitrate —
+#     invalid, retry.
+backdate_ghost() {  # $1=lock $2=ghost token $3=age-secs -> 0 iff the round premise is intact
+  local pre post now mt
+  pre="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')"
+  [ "$pre" = "$2" ] || return 1
+  backdate "$1" "$3" 2>/dev/null || return 1
+  post="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')"
+  [ "$post" = "$2" ] && return 0
+  [ -e "$1" ] || return 1
+  now="$(date +%s)"
+  mt="$(stat -c %Y -- "$1" 2>/dev/null || stat -f %m -- "$1" 2>/dev/null)" || return 1
+  [ $(( now - mt )) -lt $(( $3 / 2 )) ]
+}
+
+# Wait for every waiter's WAITING line while keeping the ghost lock FRESH
+# (touch -c to now, no-create so a released path is never resurrected): a
+# fresh ghost cannot be judged stale, so no waiter can steal it before the
+# guarded backdate — without this, a sync stalled past STALE (slow worker
+# cold starts on a loaded box) lets the ghost age stale naturally and a
+# waiter steals it mid-sync. Freshening is race-safe: if a steal slipped in
+# anyway, touching the winner's (already fresh) live lock to "now" is a
+# harmless no-op, and backdate_ghost's pre-read catches the broken premise.
+sync_waiting_fresh() {  # $1=lock $2=timeout-secs $3..=waiter logs -> 0 iff all logged WAITING
+  local lock="$1" deadline f ok=1
+  deadline=$(( $(date +%s) + $2 )); shift 2
+  for f in "$@"; do
+    until grep -q "WAITING for lock" "$f" 2>/dev/null; do
+      touch -c "$lock" 2>/dev/null
+      if [ "$(date +%s)" -ge "$deadline" ]; then ok=0; break; fi
+      sleep 0.2
+    done
+  done
+  [ "$ok" = 1 ]
+}
+
+# Fabricate a lock file the way a real (foreign) holder would have written it:
+# token line + owner line. The token MUST be "tok."-prefixed (wire format) or
+# the steal's content guard will — correctly — refuse to steal it.
+fabricate_lock() {  # $1=path $2=token $3=owner
+  printf '%s\n%s\n' "$2" "$3" > "$1"
+}
+
+# Wait (up to $3 seconds, default 15) for a pattern to appear in a file. Used to
+# gate on the WAITING log line: proof the waiter actually contended, without a
+# fixed-length hold.
+wait_for_grep() {
+  local pat="$1" f="$2" tries=$(( ${3:-15} * 20 ))
+  while ! grep -q "$pat" "$f" 2>/dev/null && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done
+  grep -q "$pat" "$f" 2>/dev/null
+}
diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh
index e7837f4..49badf8 100644
--- a/tests/git-commit-lock.integration.test.sh
+++ b/tests/git-commit-lock.integration.test.sh
@@ -36,6 +36,13 @@
 # they expand inside a worker's `bash -c` invocation, not here.
 set -uo pipefail
 
+# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad,
+# section, the finish EXIT-trap sentinel (calls our cleanup below). Resolved from
+# THIS script's own dir so it sources regardless of CWD.
+_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=tests/_harness.sh
+. "$_HARNESS_DIR/_harness.sh"
+
 DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 ROOT="$(cd "$DIR/.." && pwd)"   # the implementations live at the repo root
 LIB="$ROOT/git-commit-lock.sh"
@@ -59,31 +66,10 @@ cleanup() {
     rm -rf "$WORK" 2>/dev/null || true
   fi
 }
-# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
-# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
-# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
-# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
-finish() {
-  cleanup
-  if [ "${DONE:-0}" != 1 ]; then
-    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
-    exit 1
-  fi
-}
+# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup()
+# above and fails loudly if the suite died before setting DONE=1.
 trap finish EXIT
 
-PASS=0; FAIL=0; TAPN=0; DONE=0
-GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
-# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
-# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
-# just before the verdict) lets a TAP consumer fail on a short count; together with the
-# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the
-# "ok/bad cannot fail" property the `<assert> && ok ... || bad ...` idiom relies on.
-ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
-        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
-bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
-        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
-
 # --- sizing ------------------------------------------------------------------
 # Commits serialise (that's the whole point), so wall time ≈ workers x commit
 # cost, and on this Windows/Cygwin box a spawn+add+commit is ~0.5-1s, a pwsh
@@ -117,9 +103,8 @@ LK_ENV=(AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=2
 # Note-and-ignore the per-test selector the unit/interop suites honour: this
 # suite is ONE indivisible scenario (Tests 1-3 share a single repo + the ALL_IDS
 # accumulator, and Test 3 audits Tests 1+2's output), so a per-block selector
-# can't apply. If GCL_TEST_ONLY is set, say so loudly on stderr and run the
-# whole scenario as normal.
-GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"
+# can't apply. If GCL_TEST_ONLY is set (read by _harness.sh), say so loudly on
+# stderr and run the whole scenario as normal.
 if [ -n "$GCL_TEST_ONLY" ]; then
     echo "NOTE: integration suite ignores GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" — Tests 1-3 are one indivisible scenario (shared repo + ALL_IDS audit); running the whole suite." >&2
 fi
diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh
index 8bda7c7..4bad30f 100644
--- a/tests/git-commit-lock.interop.test.sh
+++ b/tests/git-commit-lock.interop.test.sh
@@ -40,6 +40,16 @@
 # they expand inside a worker's `bash -c` or pwsh invocation, not here.
 set -uo pipefail
 
+# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad,
+# section, the finish EXIT-trap sentinel (calls our cleanup below), and the
+# shared timing/lock helpers (epoch_to_stamp, backdate, backdate_ghost,
+# sync_waiting_fresh, fabricate_lock, wait_for_grep). Resolved from THIS
+# script's own dir so it sources regardless of CWD; sourced EARLY (before any
+# use of the inits/helpers below).
+_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=tests/_harness.sh
+. "$_HARNESS_DIR/_harness.sh"
+
 DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 ROOT="$(cd "$DIR/.." && pwd)"   # the implementations live at the repo root
 SH="$ROOT/git-commit-lock.sh"
@@ -67,35 +77,11 @@ WORK="$(pwsh -NoProfile -Command '[IO.Path]::Combine([IO.Path]::GetTempPath(), "
 WORK="${WORK//\\//}"
 mkdir -p "$WORK"
 
-PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
-GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
-# Single-test selector: GCL_TEST_ONLY=<regex> runs only the test blocks whose
-# `== Test N: <desc> ==` label matches the regex (BASH regex, =~). Unset/empty
-# runs every block (default). A typo'd regex that matches nothing bails out
-# loudly at the verdict (the zero-match guard) rather than passing vacuously.
-GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"
-# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
-# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
-# just before the verdict) lets a TAP consumer fail on a short count; together with the
-# DONE sentinel below this closes the silent-undercount gap. `return 0` preserves the
-# "ok/bad cannot fail" property the `<assert> && ok ... || bad ...` idiom relies on.
-ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
-        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
-bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
-        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
-
-# Per-test gate: echoes the block header (so a normal run is byte-unchanged) and
-# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label.
-# Each top-level `== Test N: <desc> ==` block is wrapped `if section "..."; then ... fi`.
-# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard can catch a
-# selector regex that matched nothing.
-section() {
-  echo "== $1 =="
-  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
-    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
-  fi
-  return 1
-}
+# The PASS/FAIL/TAP/SECTIONS_RUN inits, the GCL_TAP/GCL_TEST_ONLY reads, ok/bad,
+# and section() all come from _harness.sh (sourced above). GCL_TEST_ONLY is the
+# single-test selector: a <regex> that runs only the `== Test N: <desc> ==`
+# blocks whose label matches (BASH =~); unset/empty runs every block; a typo'd
+# regex that matches nothing bails out loudly at the verdict (selector_report).
 
 # Failure post-mortems need the logs: keep $WORK when anything failed, and
 # honour GCL_TEST_PRESERVE_DIR (the CI preserve-logs knob) by copying
@@ -112,17 +98,8 @@ cleanup() {
   fi
   rm -rf "$WORK" 2>/dev/null || true
 }
-# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
-# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
-# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
-# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
-finish() {
-  cleanup
-  if [ "${DONE:-0}" != 1 ]; then
-    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
-    exit 1
-  fi
-}
+# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup()
+# above and fails loudly if the suite died before setting DONE=1.
 trap finish EXIT
 
 # Poll for a marker file: ready-markers replace fixed head-start sleeps so a
@@ -132,88 +109,11 @@ wait_for() {  # $1=file $2=max iterations of 50ms (default 200 = 10s)
   return 1
 }
 
-# Wait (up to $3 seconds, default 15) for a pattern to appear in a file —
-# used to gate on the WAITING log line (proof a waiter actually contended)
-# without a fixed-length hold. Same helper as the unit suite.
-wait_for_grep() {
-  local pat="$1" f="$2" tries=$(( ${3:-15} * 20 ))
-  while ! grep -q "$pat" "$f" 2>/dev/null && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done
-  grep -q "$pat" "$f" 2>/dev/null
-}
-
-# Backdate a path's mtime by $2 seconds — how a test fakes a stale lock (the
-# staleness clock is the lock FILE's own mtime, stamped by the creating
-# write). Portable: BSD/macOS touch has no `-d @epoch`, so convert the target
-# epoch to a `touch -t` stamp via GNU `date -d @` with BSD `date -r` as
-# fallback (same helper as the unit suite).
-epoch_to_stamp() {
-  date -d "@$1" +%Y%m%d%H%M.%S 2>/dev/null || date -r "$1" +%Y%m%d%H%M.%S 2>/dev/null
-}
-backdate() { touch -t "$(epoch_to_stamp "$(( $(date +%s) - $2 ))")" "$1"; }
-
-# Token-guarded backdate for the contended-recovery tests (T16/T16b; same
-# guard as the unit suite's T2b — full rationale there). Why: under load a
-# fast waiter can complete its ENTIRE steal (claim -> rename-over ->
-# ACQUIRED) before the harness's `touch` executes, so a blind backdate lands
-# on the WINNER'S freshly installed lock, making it instantly stale for
-# every rival — a legitimate re-steal then fails the test's "zero 98s /
-# exactly one STOLE-BY-CLAIM" assertions although the protocol behaved
-# exactly as designed (observed 2026-06-12 on a loaded box: a fast pwsh
-# waiter judged the FRESH ghost at age==STALE, stole and ACQUIRED before the
-# touch, which then aged its live lock to 10000s and a rival re-stole it).
-# Verdicts:
-#   * pre-read not the ghost: stolen BEFORE the touch (no touch performed) —
-#     invalid, the caller retries the run.
-#   * post-read the ghost: conclusive — the touch hit the ghost. Valid.
-#   * post-read anything else: a steal raced the touch->re-read window —
-#     COMMON under load (waiters poll every 0.05s; the post-read costs
-#     subprocess spawns), so it must not blindly invalidate. The lock's
-#     MTIME arbitrates which file the touch hit: a winner's installed lock
-#     is FRESH (the rename carries the claim file's just-created mtime), so
-#     fresh => the touch hit the GHOST and a legitimate steal followed —
-#     valid; ancient => the touch landed on the WINNER'S live lock and
-#     corrupted the run — invalid, retry. Vanished => cannot arbitrate —
-#     invalid, retry.
-backdate_ghost() {  # $1=lock $2=ghost token $3=age-secs -> 0 iff the run premise is intact
-  local pre post now mt
-  pre="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')"
-  [ "$pre" = "$2" ] || return 1
-  backdate "$1" "$3" 2>/dev/null || return 1
-  post="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')"
-  [ "$post" = "$2" ] && return 0
-  [ -e "$1" ] || return 1
-  now="$(date +%s)"
-  mt="$(stat -c %Y -- "$1" 2>/dev/null || stat -f %m -- "$1" 2>/dev/null)" || return 1
-  [ $(( now - mt )) -lt $(( $3 / 2 )) ]
-}
-
-# Wait for every waiter's WAITING line while keeping the ghost lock FRESH
-# (touch -c to now, no-create so a released path is never resurrected): a
-# fresh ghost cannot be judged stale, so no waiter can steal it before the
-# guarded backdate — without this, a sync stalled past STALE (slow pwsh cold
-# starts on a loaded box) lets the ghost age stale naturally and a waiter
-# steals it mid-sync. Freshening is race-safe: if a steal slipped in anyway,
-# touching the winner's (already fresh) live lock to "now" is a harmless
-# no-op, and backdate_ghost's pre-read catches the broken premise.
-sync_waiting_fresh() {  # $1=lock $2=timeout-secs $3..=waiter logs -> 0 iff all logged WAITING
-  local lock="$1" deadline f ok=1
-  deadline=$(( $(date +%s) + $2 )); shift 2
-  for f in "$@"; do
-    until grep -q "WAITING for lock" "$f" 2>/dev/null; do
-      touch -c "$lock" 2>/dev/null
-      if [ "$(date +%s)" -ge "$deadline" ]; then ok=0; break; fi
-      sleep 0.2
-    done
-  done
-  [ "$ok" = 1 ]
-}
-
-# Fabricate a lock file the way a real (foreign) holder would have written it:
-# token line + owner line. The token MUST be "tok."-prefixed (wire format) or
-# the steal's content guard will — correctly — refuse to steal it.
-fabricate_lock() {  # $1=path $2=token $3=owner
-  printf '%s\n%s\n' "$2" "$3" > "$1"
-}
+# wait_for_grep, epoch_to_stamp, backdate, backdate_ghost, sync_waiting_fresh,
+# and fabricate_lock now live in _harness.sh (sourced above) — shared
+# byte-for-byte with the unit suite. (wait_for above is interop-only: its arg-2
+# is a count of 50ms iterations, distinct from the unit suite's wait_for_file
+# whole-seconds semantics, so the two poll helpers stay separate.)
 
 # A pwsh process that holds the lock FILE open with FileShare.Read — the
 # no-delete-share handle class that blocks unlink AND rename alike (probe
@@ -1442,16 +1342,12 @@ else
 fi
 
 echo
-# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran no test block, so
-# the (vacuously green) verdict below would lie. Bail loudly instead — a typo'd
-# selector regex must FAIL, not pass with zero assertions.
-if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then
-  echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2
-  exit 1
-fi
-# When a selector is active, report how many blocks it matched (the default run
-# stays byte-unchanged because this is gated on GCL_TEST_ONLY being non-empty).
-[ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)"
+# Zero-match guard + selector-report line (shared helper in _harness.sh): a
+# set-but-non-matching GCL_TEST_ONLY ran no test block, so the (vacuously green)
+# verdict below would lie — bail loudly; a typo'd selector regex must FAIL, not
+# pass with zero assertions. When the selector matched, report how many blocks
+# ran. Both gated on GCL_TEST_ONLY non-empty so the default run stays unchanged.
+selector_report
 DONE=1
 echo "==== INTEROP RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ===="
 [ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 7fc5f2b..8b1aa08 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -25,6 +25,16 @@
 # inside the worker's `bash -c`, not here.
 set -uo pipefail
 
+# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad,
+# section, the finish EXIT-trap sentinel (calls our cleanup below), and the
+# shared timing/lock helpers (epoch_to_stamp, backdate, backdate_ghost,
+# sync_waiting_fresh, fabricate_lock, wait_for_grep). Resolved from THIS
+# script's own dir so it sources regardless of CWD; sourced EARLY (before any
+# use of the inits/helpers below).
+_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=tests/_harness.sh
+. "$_HARNESS_DIR/_harness.sh"
+
 DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
 ROOT="$(cd "$DIR/.." && pwd)"   # the implementations live at the repo root
 LIB="$ROOT/git-commit-lock.sh"
@@ -51,44 +61,10 @@ cleanup() {
     rm -rf "$WORK" 2>/dev/null || true
   fi
 }
-# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
-# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is
-# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return`
-# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`.
-finish() {
-  cleanup
-  if [ "${DONE:-0}" != 1 ]; then
-    echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2
-    exit 1
-  fi
-}
+# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup()
+# above and fails loudly if the suite died before setting DONE=1.
 trap finish EXIT
 
-PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
-GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
-GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
-# section() replaces each per-test header `echo "== Test N: … =="`: it echoes the
-# header verbatim (visible output unchanged) and returns success — gating the
-# `if section …; then … fi` block — iff GCL_TEST_ONLY is unset/empty OR its regex
-# matches the label. A run-counter (SECTIONS_RUN) backs the zero-match guard below,
-# so a typo'd selector regex can't masquerade as a vacuous PASS=0/FAIL=0 green.
-section() {
-  echo "== $1 =="
-  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
-    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
-  fi
-  return 1
-}
-# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
-# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
-# just before the verdict) lets a TAP consumer fail on a short count; together with the
-# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the
-# "ok/bad cannot fail" property the `<assert> && ok ... || bad ...` idiom relies on.
-ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
-        [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; }
-bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
-        [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
-
 # Envelope-tier assertions (Bucket 4 / decision D-c). A wall-clock or poll-count
 # bound is a Tier-2 (best-effort latency) property, NOT a correctness one (see
 # guarantees.md BE-1). In the default 'strict' tier these behave exactly like
@@ -109,72 +85,8 @@ bad_envelope() {
     [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"
   fi; return 0; }
 
-# Backdate a path's mtime by $2 seconds — the lock's staleness clock is the
-# lock FILE's own mtime (stamped by the creating write), so this is how a
-# test fakes a stale lock. Portable: BSD touch has no `-d @epoch`, so convert
-# the target epoch to a `touch -t` stamp via GNU `date -d @` with BSD
-# `date -r` as fallback.
-epoch_to_stamp() {
-  date -d "@$1" +%Y%m%d%H%M.%S 2>/dev/null || date -r "$1" +%Y%m%d%H%M.%S 2>/dev/null
-}
-backdate() { touch -t "$(epoch_to_stamp "$(( $(date +%s) - $2 ))")" "$1"; }
-
-# Token-guarded backdate for the contended-recovery rounds (T2b). Why: under
-# load a fast waiter can complete its ENTIRE steal (claim -> rename-over ->
-# ACQUIRED) before the harness's `touch` executes, so a blind backdate lands
-# on the WINNER'S freshly installed lock, making it instantly stale for every
-# rival — a legitimate re-steal then fails the round's "zero 98s / exactly
-# one STOLE-BY-CLAIM" assertions although the protocol behaved exactly as
-# designed (observed 2026-06-12 on a loaded box). Verdicts:
-#   * pre-read not the ghost: a waiter stole the ghost BEFORE the touch (it
-#     aged stale naturally during a stalled sync); no touch is performed and
-#     the round premise is gone — invalid, the caller retries the round.
-#   * post-read the ghost: conclusive — nothing ever rewrites the ghost
-#     token at the path, so the touch verifiably hit the ghost; any steal
-#     after the post-read steals an ALREADY-ancient ghost, exactly the
-#     scenario the round wants. Valid.
-#   * post-read anything else: a steal raced the touch->re-read window —
-#     COMMON under load (waiters poll every 0.05s; the post-read costs
-#     subprocess spawns), so it must not blindly invalidate. The lock's
-#     MTIME arbitrates which file the touch hit: a winner's installed lock
-#     is FRESH (the rename carries the claim file's just-created mtime), so
-#     fresh => the touch hit the GHOST and a legitimate steal followed —
-#     valid; ancient => the touch landed on the WINNER'S live lock and
-#     corrupted the round — invalid, retry. Vanished => cannot arbitrate —
-#     invalid, retry.
-backdate_ghost() {  # $1=lock $2=ghost token $3=age-secs -> 0 iff the round premise is intact
-  local pre post now mt
-  pre="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')"
-  [ "$pre" = "$2" ] || return 1
-  backdate "$1" "$3" 2>/dev/null || return 1
-  post="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')"
-  [ "$post" = "$2" ] && return 0
-  [ -e "$1" ] || return 1
-  now="$(date +%s)"
-  mt="$(stat -c %Y -- "$1" 2>/dev/null || stat -f %m -- "$1" 2>/dev/null)" || return 1
-  [ $(( now - mt )) -lt $(( $3 / 2 )) ]
-}
-
-# Wait for every waiter's WAITING line while keeping the ghost lock FRESH
-# (touch -c to now, no-create so a released path is never resurrected): a
-# fresh ghost cannot be judged stale, so no waiter can steal it before the
-# guarded backdate — without this, a sync stalled past STALE (slow worker
-# cold starts on a loaded box) lets the ghost age stale naturally and a
-# waiter steals it mid-sync. Freshening is race-safe: if a steal slipped in
-# anyway, touching the winner's (already fresh) live lock to "now" is a
-# harmless no-op, and backdate_ghost's pre-read catches the broken premise.
-sync_waiting_fresh() {  # $1=lock $2=timeout-secs $3..=waiter logs -> 0 iff all logged WAITING
-  local lock="$1" deadline f ok=1
-  deadline=$(( $(date +%s) + $2 )); shift 2
-  for f in "$@"; do
-    until grep -q "WAITING for lock" "$f" 2>/dev/null; do
-      touch -c "$lock" 2>/dev/null
-      if [ "$(date +%s)" -ge "$deadline" ]; then ok=0; break; fi
-      sleep 0.2
-    done
-  done
-  [ "$ok" = 1 ]
-}
+# epoch_to_stamp, backdate, backdate_ghost, and sync_waiting_fresh now live in
+# _harness.sh (sourced above) — shared byte-for-byte with the interop suite.
 
 # Clone a shell function under a new name — the steering tests' interposition
 # mechanism: a sourced test shell wraps a library internal (or a command like
@@ -187,31 +99,19 @@ clone_fn() {  # $1=existing function $2=new name
 }
 export -f clone_fn epoch_to_stamp backdate
 
-# Fabricate a lock file the way a real (foreign) holder would have written it:
-# token line + owner line. The token MUST be "tok."-prefixed (wire format) or
-# the steal's content guard will — correctly — refuse to steal it.
-fabricate_lock() {  # $1=path $2=token $3=owner
-  printf '%s\n%s\n' "$2" "$3" > "$1"
-}
+# fabricate_lock and wait_for_grep now live in _harness.sh (sourced above) —
+# shared byte-for-byte with the interop suite.
 
 # Wait (up to $2 seconds, default 15) for a marker file to appear. Holders
 # touch a ready-marker as their first act INSIDE the lock; tests gate on that
-# instead of sleep-margin head starts, which flaked under load.
+# instead of sleep-margin head starts, which flaked under load. Unit-only: the
+# interop suite has its own poll helper (wait_for, 50ms-iteration semantics).
 wait_for_file() {
   local f="$1" tries=$(( ${2:-15} * 20 ))
   while [ ! -e "$f" ] && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done
   [ -e "$f" ]
 }
 
-# Wait (up to $3 seconds, default 15) for a pattern to appear in a file.
-# Used to gate on the WAITING log line: proof the waiter actually contended,
-# without a fixed-length hold.
-wait_for_grep() {
-  local pat="$1" f="$2" tries=$(( ${3:-15} * 20 ))
-  while ! grep -q "$pat" "$f" 2>/dev/null && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done
-  grep -q "$pat" "$f" 2>/dev/null
-}
-
 # Critical section that loses updates without a mutex: read, gap, write+1.
 INCR='n="$(cat "$1")"; sleep 0.03; echo $((n+1)) > "$1"'
 
@@ -3211,14 +3111,14 @@ fi
 #   Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by
 #   Test 32b.
 
-# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran NO test block. Without
-# this, the suite would fall through to a vacuous PASS=0 FAIL=0 "green" — a typo'd
-# selector regex would silently look like success. Fail loudly instead. (The finish
-# EXIT trap also fires here since DONE is still 0; this exit is non-zero regardless.)
-if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then
-  echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2
-  exit 1
-fi
+# Zero-match guard + selector-report line (shared helper in _harness.sh): a
+# set-but-non-matching GCL_TEST_ONLY ran NO test block, which without the guard
+# would fall through to a vacuous PASS=0 FAIL=0 "green" — a typo'd selector regex
+# would silently look like success; bail loudly instead. (The finish EXIT trap
+# also fires there since DONE is still 0; that exit is non-zero regardless.) When
+# the selector matched, it reports how many blocks ran. Both are gated on
+# GCL_TEST_ONLY being non-empty, so a default run stays byte-identical.
+selector_report
 
 DONE=1
 echo

From d2ac607e050a34e5fd9d639a49c90281aa65da28 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 03:43:05 +1000
Subject: [PATCH 39/76] Plan changelog: Bucket 8 items 2+3 done (selector +
 _harness.sh extraction)

Record completion of 8.2 (GCL_TEST_ONLY selector, 4ee5899) and 8.3 (tests/
_harness.sh extraction, b8e2951). (8.2 + 8.3 complete; next is Bucket 6.)
Cross-platform CI verification pending.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-17-ci-stress-phase2-build-plan.md | 23 +++++++++++++++++++
 1 file changed, 23 insertions(+)

diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
index a547f8b..75700f1 100644
--- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md
+++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
@@ -436,3 +436,26 @@ Workflow once the final test count is known (plan D-e) — likely a Workflow for
   (`_harness.sh` extraction — also a large harness change) into one validated
   harness-restructure step near the end. **Revised phasing: 8.1 → 3 → 4 → 2A → 2B →
   (8.2 + 8.3 together) → 6.**
+- **Step (commit `4ee5899`) — Bucket 8 item 2 done** (`GCL_TEST_ONLY` selector). Each
+  top-level `== Test N: … ==` header in unit + interop became `if section "Test N: …";
+  then … fi` (each `fi` before the next `if section`, so trailing cleanup stays inside);
+  `section` runs a block iff `GCL_TEST_ONLY` is unset/empty or its regex matches, bumping
+  `SECTIONS_RUN`. Zero-match guard bails loudly (exit 1) on a set-but-non-matching regex
+  (no vacuous green). Integration note-and-ignores (one indivisible scenario). Built by 3
+  parallel sub-agents (one per suite), each self-validating byte-identical + selector
+  precision + the guard; orchestrator re-verified independently. Validated reduced: unit
+  315/0, interop 141/0, integration 12/0; selector precision proven (regex, trailing-colon
+  anchoring); `shellcheck -S style` clean.
+- **Step (commit `b8e2951`) — Bucket 8 item 3 done** (`tests/_harness.sh` extraction, 177
+  lines, net −42). Tier 1 (all three): inits + `GCL_TAP`/`GCL_TEST_ONLY` reads + `ok`/`bad`
+  + `section` + the `finish` sentinel + shared shellcheck disables + a unified
+  `selector_report` (so unit/interop match). Tier 2 (unit+interop, byte-identical-verified
+  first): `epoch_to_stamp`, `backdate`, `backdate_ghost`, `sync_waiting_fresh`,
+  `fabricate_lock`, `wait_for_grep`. Left per-suite: `cleanup` (closes over `$WORK`),
+  `clone_fn`+`export -f` (unit-only), `ok_envelope`/`bad_envelope` (unit-only), both poll
+  helpers (`wait_for_file` secs vs `wait_for` 50ms-iters — Tier 3, not unified), verdict
+  lines. CWD-independent sourcing (`BASH_SOURCE`) + `# shellcheck source=` directive;
+  `tests/_harness.sh` added to the CI lint list. Byte-identical (315/141/12), `shellcheck`
+  clean, selector/guard/integration-note all intact; orchestrator re-verified independently.
+- **(8.2 + 8.3 COMPLETE.) Next: Bucket 6 (CI matrix wiring).** Cross-platform CI verification
+  of these two commits pending (dispatch `tests.yml` on `ci-stress`).

From 6f20a5b53d7af9a16b3a667415f467284f28e14a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 04:22:08 +1000
Subject: [PATCH 40/76] Bucket 6a: de-stress tests.yml + record
 no-branch-protection decision
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Ben (2026-06-18): no branch protection — single-dev project. So the planned
`tests-passed` aggregator + per-job doc-only path-filter are dropped (both only
served a *required* check). tests.yml keeps the simple workflow-level
paths-ignore and reports per-cell matrix contexts directly. Recorded the
decision at the top of Bucket 6 in the build plan.

tests.yml de-stress = reverse-apply of the two stress-only commits' tests.yml
hunks (precise, nothing else touched):
- 980856b: per-run-unique concurrency group -> group: {workflow}-{ref} +
  cancel-in-progress: true.
- b430d73: drop the stress workflow_dispatch inputs, the GCL_STRESS_* env, and
  the `tests/with-load.sh` wrapper on each suite (suites run un-wrapped);
  restore original step timeouts (unit 15win/10posix, interop 10, integration 7)
  and job_timeouts (ubuntu/macos 35, win-unit 20, win-interop-integration 22).
The later tests/_harness.sh lint-list entry (b8e2951) is preserved.

actionlint clean (-shellcheck=); no with-load/GCL_STRESS/aggregator residue.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/tests.yml                   | 42 ++++++-------------
 .../2026-06-17-ci-stress-phase2-build-plan.md | 16 +++++++
 2 files changed, 29 insertions(+), 29 deletions(-)

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 268c257..2156133 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -17,21 +17,10 @@ on:
   schedule:
     - cron: '17 3 * * 1'   # weekly Monday run: catches runner-image/tool drift
   workflow_dispatch:
-    inputs:
-      stress_kind:
-        description: 'STRESS BRANCH: artificial load during suites — none|cpu|disk|both'
-        default: both
-      stress_load:
-        description: 'STRESS BRANCH: hogs per kind (blank = runner core count)'
-        default: ''
 
 concurrency:
-  # STRESS-BRANCH ONLY — do NOT merge to main. The per-run-unique group + no
-  # cancellation lets many workflow_dispatch runs execute in parallel on this one
-  # branch (flakiness stress test). On main the group is
-  # `${{ github.workflow }}-${{ github.ref }}` with cancel-in-progress: true.
-  group: ${{ github.workflow }}-${{ github.ref }}-${{ github.run_id }}
-  cancel-in-progress: false
+  group: ${{ github.workflow }}-${{ github.ref }}
+  cancel-in-progress: true
 
 permissions:
   contents: read
@@ -48,22 +37,17 @@ jobs:
         # process-spawn overhead, not the PowerShell engines). Suites must NOT run
         # concurrently inside one runner: they're timing-sensitive on 2-core
         # runners. POSIX legs are fast enough to stay single-job.
-        include:                       # STRESS BRANCH: job_timeouts raised to clear the summed step budgets under artificial load
-          - { os: ubuntu-24.04, leg: all, job_timeout: 80 }
-          - { os: macos-15, leg: all, job_timeout: 80 }
-          - { os: windows-2025, leg: unit, job_timeout: 40 }
-          - { os: windows-2025, leg: interop-integration, job_timeout: 50 }
+        include:
+          - { os: ubuntu-24.04, leg: all, job_timeout: 35 }
+          - { os: macos-15, leg: all, job_timeout: 35 }
+          - { os: windows-2025, leg: unit, job_timeout: 20 }
+          - { os: windows-2025, leg: interop-integration, job_timeout: 22 }
     timeout-minutes: ${{ matrix.job_timeout }}   # backstop only: sum of the leg's step budgets + upload headroom
     defaults:
       run:
         shell: bash                  # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires
     env:
       GCL_TEST_FULL: 1               # full fan-out — CI runners are dedicated; the reduced default protects live dev boxes (TODO 58)
-      # STRESS-BRANCH ONLY (do not merge): artificial CPU/disk load wrapped around each
-      # suite (tests/with-load.sh) to widen timing windows and surface latency/race
-      # flakes. From the workflow_dispatch inputs; empty on push/schedule => 'none'.
-      GCL_STRESS_KIND: ${{ inputs.stress_kind || 'none' }}
-      GCL_STRESS_LOAD: ${{ inputs.stress_load }}
     steps:
       - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0   # v6.0.3, SHA-pinned
         with:
@@ -88,30 +72,30 @@ jobs:
 
       - name: Unit suite
         if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}
-        timeout-minutes: ${{ matrix.os == 'windows-2025' && 30 || 25 }}   # STRESS BRANCH: raised (15->30 / 10->25) so artificial load slowness doesn't trip the step timeout and masquerade as a flake
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }}   # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit
         run: |
           mkdir -p test-output
-          bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
+          bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
 
       - name: Interop suite (bash + pwsh)
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}   # run even if an earlier suite failed — every signal is useful
-        timeout-minutes: 25          # STRESS BRANCH: raised 10->25 for artificial load
+        timeout-minutes: 10
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop
         run: |
           mkdir -p test-output
-          bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log
+          bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log
 
       - name: Integration suite (real concurrent commits)
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}
-        timeout-minutes: 20          # STRESS BRANCH: raised 7->20 for artificial load (internal AGENT_LOCK_MAX_WAIT cap is 240s)
+        timeout-minutes: 7           # its internal AGENT_LOCK_MAX_WAIT cap is 240s
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration
         run: |
           mkdir -p test-output
-          bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log
+          bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log
 
       - name: Upload failure diagnostics
         if: ${{ failure() || cancelled() }}   # failure() covers step timeouts (they fail the step); cancelled() is best-effort cover for manual cancels / the job-level backstop
diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
index 75700f1..8da0f2a 100644
--- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md
+++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
@@ -196,6 +196,22 @@ bad_envelope() {
 
 ## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions)
 
+> **DECISION (Ben, 2026-06-18): NO branch protection — single-dev project.** We will not
+> enforce required status checks. Consequences for this bucket:
+> 1. **The `tests-passed` aggregator and the per-job doc-only path-filter (the `changes`
+>    job) are DROPPED.** Both existed only to make a *required* check behave well (one
+>    green context to require; doc-only PRs not blocked by it). With nothing required,
+>    `tests.yml` keeps the simple **workflow-level `paths-ignore`** and reports the per-cell
+>    matrix contexts directly. So **Bucket 6a = the de-stress revert only** (revert
+>    `980856b` + `b430d73`'s `tests.yml` half; restore original concurrency/timeouts; drop
+>    the stress `workflow_dispatch` inputs; suites run un-wrapped).
+> 2. The 3-workflow file split (`tests.yml` / `nightly.yml` / `deep-sweep.yml`) is **kept**,
+>    but now purely for separation of concerns (per-PR no-load gate vs scheduled load vs
+>    on-demand deep) — not to stop `workflow_dispatch` publishing gating contexts (moot
+>    without protection). The "distinct `deep-*` job names" detail is likewise now cosmetic.
+> The paragraphs below that describe the aggregator / path-filter / required-context gotchas
+> are **SUPERSEDED** by this note; keep them only as the rationale for why they're unneeded.
+
 **Three-workflow structure** (revised after review — a `workflow_dispatch` run
 publishes check contexts on the head SHA, so keeping Deep in `tests.yml` under shared
 job names risks a failed Deep run gating a PR; separate files + a stable required

From 43cb64810f541e0b4adc77c7f27b885a41e30aa7 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 04:22:32 +1000
Subject: [PATCH 41/76] Bucket 6b: graduate tests/with-load.sh (calibrated
 ratio + load-manifest)

Promote with-load.sh from stress-branch scaffolding to a main-worthy, calibrated
load wrapper (used by the nightly/deep-sweep tiers, not the required tests.yml):
- Load expressed as oversubscription ratio R = stressors/nproc (GCL_STRESS_RATIO),
  with a total-ratio cap (GCL_STRESS_RATIO_MAX, default 2); GCL_STRESS_LOAD kept as
  a back-compat raw-count override. GCL_STRESS_KIND=none|cpu|disk|both; none/unset
  is a clean pass-through (zero load, propagates the wrapped command's exit code).
- Prefers stress-ng when present, portable shell spinner fallback (Windows + here);
  disk churn via dd conv=fsync. Probe-gated Linux cgroup-v2 CPU-quota path (recorded,
  Linux-only; not actuated elsewhere). IO throttling intentionally not relied on.
- Emits a per-run load-manifest JSON (kind, R, nproc, stressor counts, achieved
  slowdown, tool versions, os/arch, git sha) under test-output/ for reproducibility.
- Robust teardown: every spawned stressor PID tracked and killed by exact PID on a
  trap (never by name); verified no leak on success and on a failing wrapped command.
- Do-not-merge banner stripped.

Validated locally: shellcheck -S style + bash -n clean; pass-through (none) -> exit
propagated; cpu R=1 -> R*nproc spinners, 2.29x slowdown, clean reap, manifest written.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/with-load.sh | 279 +++++++++++++++++++++++++++++++++++++++++----
 1 file changed, 257 insertions(+), 22 deletions(-)

diff --git a/tests/with-load.sh b/tests/with-load.sh
index e19ae5f..077511f 100644
--- a/tests/with-load.sh
+++ b/tests/with-load.sh
@@ -1,40 +1,183 @@
 #!/usr/bin/env bash
-# STRESS-BRANCH ONLY — do NOT merge to main.
+# with-load.sh — run a command under a calibrated, reproducible background load.
 #
-# Run "$@" while artificial CPU and/or disk load saturates the runner, to widen the
-# timing windows that latency/race flakes depend on (e.g. Test 17d's churn "absent
-# window" — driven by both CPU descheduling of the churner AND slow file create/delete
-# IO). Hogs are reaped by their EXACT PIDs afterward (never by name), so this is safe on
-# a shared machine; on an ephemeral CI runner it is doubly safe.
+# Usage:   bash tests/with-load.sh <cmd> [args...]
+# Example: bash tests/with-load.sh bash tests/git-commit-lock.test.sh
 #
-#   GCL_STRESS_KIND = none | cpu | disk | both   (default: both)
-#   GCL_STRESS_LOAD = N hogs of EACH selected kind (default: detected core count)
+# Wraps "$@", applies artificial background load for the command's lifetime, then
+# tears the load down (by EXACT spawned PIDs — never by name, so it is safe on a
+# shared dev box and doubly safe on an ephemeral CI runner) and exits with the
+# wrapped command's exit code.
 #
-# CPU hog  = a bare bash spin loop (one core each).
-# Disk hog = a tight create / write+fsync / delete loop of a small file on the same
-#            volume as the test's scratch dir (TMPDIR) — metadata + write-back pressure
-#            that contends with the lock-file create/delete the suite itself does.
+# WHY load exists here (see docs/load-testing-strategy.md §1): this protocol's
+# *correctness* is load-independent (O_EXCL + atomic rename + per-attempt tokens
+# never consult the clock for a correctness decision), so load cannot break
+# exclusion. Load's only jobs are (J1) perturb scheduling so the protocol's
+# multi-syscall sequences get preempted at adversarial points, and (J2) stretch
+# the few genuinely timing-derived decisions. Magnitude past ~2x CPU
+# oversubscription mostly manufactures harness wall-clock flakes, not bugs — which
+# is why load is expressed as an oversubscription RATIO and the total ratio is
+# CAPPED.
+#
+# ── Calibrated interface (the contract nightly/deep-sweep CI calls against) ──────
+#
+#   GCL_STRESS_KIND        none | cpu | disk | both        (default: none)
+#                          none/unset => CLEAN PASS-THROUGH: zero added load, the
+#                          command's exit code is propagated verbatim.
+#
+#   GCL_STRESS_RATIO       Oversubscription ratio R = stressors / nproc, PER KIND.
+#                          (default: 1)  Stressors-per-kind = round(R * nproc),
+#                          floored at 1 when a kind is selected. Runner-independent:
+#                          "R=2" means the same pressure on a 2-core and a 32-core box,
+#                          whereas a raw hog count does not.
+#
+#   GCL_STRESS_RATIO_MAX   Cap on the TOTAL oversubscription ratio across all kinds
+#                          (default: 2). `both` runs cpu + disk, so its total ratio is
+#                          2*R; this cap scales each kind's stressor count down
+#                          proportionally so the runner is never wedged. Set the
+#                          deep-sweep flake-hunt higher deliberately.
+#
+#   GCL_STRESS_LOAD        BACK-COMPAT raw-count override. If set to a positive
+#                          integer it REPLACES the ratio computation: exactly N
+#                          stressors per selected kind (still capped by RATIO_MAX
+#                          unless GCL_STRESS_RATIO_MAX is also raised). Empty/unset =>
+#                          use the ratio. Kept so the existing deep-sweep
+#                          `stress_load=N` dispatch input keeps working.
+#
+#   GCL_STRESS_CGROUP      1 => on Linux with a writable cgroup v2 cpu controller,
+#                          PROBE the calibrated cgroup CPU-quota path (envelope leg).
+#                          The probe is recorded in the manifest. cgroup IO throttling
+#                          is experimental and intentionally NOT attempted here.
+#                          (default: 0)  Absent/unwritable => fall back to spinners.
+#
+#   GCL_LOAD_MANIFEST      Path for the per-run load-manifest JSON
+#                          (default: test-output/load-manifest.<pid>.json, created
+#                          under a known dir so CI can upload it). One file per run,
+#                          capturing {kind, R, nproc, stressor counts, achieved
+#                          slowdown, tool versions, os/arch, git sha} so any flake is
+#                          reproducible. Written on success too.
+#
+# CPU stressor: `stress-ng --cpu` when available (calibrated, measurable), else a
+#               portable bash spin loop (one busy core each).
+# Disk stressor: a tight create / write+fsync / delete loop over a small file on the
+#               same volume as the test scratch dir — metadata + write-back pressure
+#               that contends with the lock-file create/delete the suite itself does.
+#               (Always the portable shell hog; cross-platform, low-fidelity but real
+#               metadata-op pressure — see strategy §4.)
 set -uo pipefail
 
-kind="${GCL_STRESS_KIND:-both}"
-cores="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)"
-load="${GCL_STRESS_LOAD:-$cores}"
-case "$load" in ''|*[!0-9]*) load="$cores" ;; esac   # guard non-numeric / empty
+# ── Inputs ───────────────────────────────────────────────────────────────────
+kind="${GCL_STRESS_KIND:-none}"
+nproc_count="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)"
+case "$nproc_count" in ''|*[!0-9]*) nproc_count=4 ;; esac
+[ "$nproc_count" -lt 1 ] && nproc_count=1
+
+ratio="${GCL_STRESS_RATIO:-1}"
+case "$ratio" in ''|*[!0-9]*) ratio=1 ;; esac   # integer ratios only (R in {0,1,2,…})
+
+ratio_max="${GCL_STRESS_RATIO_MAX:-2}"
+case "$ratio_max" in ''|*[!0-9]*) ratio_max=2 ;; esac
+
+raw_load="${GCL_STRESS_LOAD:-}"
+case "$raw_load" in *[!0-9]*) raw_load="" ;; esac   # non-numeric => ignore, use ratio
+
+manifest="${GCL_LOAD_MANIFEST:-test-output/load-manifest.$$.json}"
+
+# ── Stressor-count calibration ─────────────────────────────────────────────────
+# Per-kind count: raw-count override wins, else round(R * nproc) floored at 1.
+if [ -n "$raw_load" ]; then
+  per_kind="$raw_load"
+else
+  per_kind=$(( ratio * nproc_count ))
+  [ "$ratio" -gt 0 ] && [ "$per_kind" -lt 1 ] && per_kind=1
+fi
+
+# How many kinds spawn stressors.
+n_kinds=0
+case "$kind" in
+  cpu|disk) n_kinds=1 ;;
+  both)     n_kinds=2 ;;
+esac
+
+# R_total cap: total stressors must not exceed ratio_max * nproc. `both` would
+# otherwise be 2*per_kind; scale each kind down proportionally if it would breach.
+cpu_count=0
+disk_count=0
+capped="no"
+if [ "$n_kinds" -gt 0 ] && [ "$per_kind" -gt 0 ]; then
+  total_cap=$(( ratio_max * nproc_count ))
+  [ "$total_cap" -lt "$n_kinds" ] && total_cap="$n_kinds"   # always allow >=1 per active kind
+  requested_total=$(( per_kind * n_kinds ))
+  if [ "$requested_total" -gt "$total_cap" ]; then
+    per_kind=$(( total_cap / n_kinds ))
+    [ "$per_kind" -lt 1 ] && per_kind=1
+    capped="yes"
+  fi
+  case "$kind" in
+    cpu)  cpu_count="$per_kind" ;;
+    disk) disk_count="$per_kind" ;;
+    both) cpu_count="$per_kind"; disk_count="$per_kind" ;;
+  esac
+fi
 
+# ── Tool discovery ─────────────────────────────────────────────────────────────
+stress_ng_bin="$(command -v stress-ng 2>/dev/null || true)"
+stress_ng_ver="none"
+[ -n "$stress_ng_bin" ] && stress_ng_ver="$("$stress_ng_bin" --version 2>/dev/null | head -1 | tr -d '\r')"
+bash_ver="$(bash --version 2>/dev/null | head -1 | tr -d '\r')"
+os_uname="$(uname -srm 2>/dev/null | tr -d '\r' || echo unknown)"
+git_sha="$(git rev-parse --short HEAD 2>/dev/null || echo unknown)"
+
+# CPU mechanism actually used.
+cpu_mech="none"
+[ "$cpu_count" -gt 0 ] && { [ -n "$stress_ng_bin" ] && cpu_mech="stress-ng" || cpu_mech="spinner"; }
+
+# ── cgroup v2 CPU-quota probe (Linux envelope leg only; probe-gated) ───────────
+# We only PROBE writability + record it; we do not create scopes here (that needs a
+# usable systemd manager — see strategy §3). IO throttling is experimental: skipped.
+cgroup_probe="not-requested"
+if [ "${GCL_STRESS_CGROUP:-0}" = 1 ]; then
+  cgroup_probe="unavailable"
+  if [ "$(uname -s 2>/dev/null)" = "Linux" ] && [ -r /sys/fs/cgroup/cgroup.controllers ]; then
+    if grep -qw cpu /sys/fs/cgroup/cgroup.controllers 2>/dev/null; then
+      # cpu controller present at the v2 root; is a cpu.max writable in our subtree?
+      if [ -w /sys/fs/cgroup/cgroup.subtree_control ] 2>/dev/null; then
+        cgroup_probe="writable"   # the calibrated quota path is reachable on this leg
+      else
+        cgroup_probe="present-not-delegated"
+      fi
+    else
+      cgroup_probe="no-cpu-controller"
+    fi
+  else
+    cgroup_probe="no-cgroup-v2"
+  fi
+fi
+
+# ── Stressor scratch dir (same volume as the test scratch) ─────────────────────
 hogdir="${TMPDIR:-/tmp}/gcl-stress.$$"
 mkdir -p "$hogdir" 2>/dev/null || hogdir="."
 
+# ── Spawn / teardown (track EXACT PIDs; kill only those) ───────────────────────
 hogs=()
+
 spawn_cpu() {
   local i
-  for ((i = 0; i < load; i++)); do
-    bash -c 'while :; do :; done' &
+  if [ "$cpu_mech" = "stress-ng" ]; then
+    # One stress-ng manager spawning $cpu_count workers; reap the manager's PID.
+    "$stress_ng_bin" --cpu "$cpu_count" --cpu-load 100 >/dev/null 2>&1 &
     hogs+=("$!")
-  done
+  else
+    for ((i = 0; i < cpu_count; i++)); do
+      bash -c 'while :; do :; done' &
+      hogs+=("$!")
+    done
+  fi
 }
+
 spawn_disk() {
   local i
-  for ((i = 0; i < load; i++)); do
+  for ((i = 0; i < disk_count; i++)); do
     bash -c '
       d="$1"; j=0
       while :; do
@@ -46,24 +189,116 @@ spawn_disk() {
     hogs+=("$!")
   done
 }
+
 cleanup() {
   local p
   for p in "${hogs[@]:-}"; do
     [ -n "$p" ] && kill "$p" 2>/dev/null
   done
+  # stress-ng forks workers under its manager; kill the worker group too (only the
+  # manager PIDs we spawned are used as the group leader — never a name match).
+  if [ "$cpu_mech" = "stress-ng" ]; then
+    for p in "${hogs[@]:-}"; do
+      [ -n "$p" ] && kill -- "-$p" 2>/dev/null   # negative PID = the manager's process group
+    done
+  fi
   rm -rf "$hogdir" 2>/dev/null
 }
 trap cleanup EXIT INT TERM
 
+# ── Achieved-slowdown micro-benchmark (cheap fixed busy-loop, baseline vs loaded) ─
+# A small fixed integer loop timed once unloaded (baseline) and once mid-load gives a
+# coarse, reproducible "how much did this load slow a CPU-bound task" figure for the
+# manifest. Pure bash, no deps. Only run when load is actually applied — on the
+# none/pass-through path it would be pure overhead.
+micro_bench() {
+  local start end k=0
+  start="$(date +%s%N 2>/dev/null || echo 0)"
+  while [ "$k" -lt 50000 ]; do k=$((k + 1)); done
+  end="$(date +%s%N 2>/dev/null || echo 0)"
+  echo $(( (end - start) / 1000000 ))   # ms
+}
+
+# Will any stressors spawn? (kind selected AND a positive per-kind count.)
+will_load="no"
+case "$kind" in
+  cpu)  [ "$cpu_count"  -gt 0 ] && will_load="yes" ;;
+  disk) [ "$disk_count" -gt 0 ] && will_load="yes" ;;
+  both) { [ "$cpu_count" -gt 0 ] || [ "$disk_count" -gt 0 ]; } && will_load="yes" ;;
+esac
+
+base_ms=0
+loaded_ms=0
+slowdown="1.00"
+[ "$will_load" = yes ] && base_ms="$(micro_bench)"
+
+# ── Apply load ─────────────────────────────────────────────────────────────────
 case "$kind" in
   cpu)  spawn_cpu ;;
   disk) spawn_disk ;;
   both) spawn_cpu; spawn_disk ;;
   none) : ;;
-  *) echo "with-load: unknown GCL_STRESS_KIND='$kind' — running with NO load" >&2 ;;
+  *) echo "with-load: unknown GCL_STRESS_KIND='$kind' — running with NO load" >&2; kind="none" ;;
 esac
-echo "stress: kind=$kind load=$load cores=$cores hogs=${#hogs[@]} :: $*"
 
+if [ "${#hogs[@]}" -gt 0 ] && [ "$base_ms" -gt 0 ]; then
+  loaded_ms="$(micro_bench)"
+  # slowdown = loaded/base to 2 dp, integer-only arithmetic. Pad the centi-value to
+  # >=3 digits so the integer part is always whatever precedes the last 2 digits
+  # (handles slowdown <1.00 from timing noise, e.g. 80 -> "0.80").
+  centi="$(( loaded_ms * 100 / base_ms ))"
+  while [ "${#centi}" -lt 3 ]; do centi="0$centi"; done
+  slowdown="${centi%??}.${centi: -2}"
+fi
+
+# ── Write the load-manifest (best-effort; never fails the run) ──────────────────
+write_manifest() {
+  local dir
+  dir="$(dirname "$manifest")"
+  mkdir -p "$dir" 2>/dev/null || return 0
+  # Hand-rolled JSON (no jq/python dependency on the runner). Escape the JSON-special
+  # chars in string values: backslash, double-quote, and the control chars that the
+  # wrapped command line can legitimately contain (newline/tab/CR) — a raw newline in
+  # a value is invalid JSON. awk keeps this robust where sed's newline handling is not.
+  esc() {
+    printf '%s' "$1" | awk '
+      BEGIN { ORS = "" }
+      {
+        if (NR > 1) printf "\\n"          # join input lines with an escaped newline
+        gsub(/\\/, "\\\\"); gsub(/"/, "\\\""); gsub(/\t/, "\\t"); gsub(/\r/, "\\r")
+        print
+      }'
+  }
+  {
+    printf '{\n'
+    printf '  "kind": "%s",\n'            "$(esc "$kind")"
+    printf '  "ratio_R": %s,\n'          "$ratio"
+    printf '  "ratio_max": %s,\n'        "$ratio_max"
+    printf '  "raw_load_override": "%s",\n' "$(esc "${raw_load:-}")"
+    printf '  "nproc": %s,\n'            "$nproc_count"
+    printf '  "cpu_stressors": %s,\n'    "$cpu_count"
+    printf '  "disk_stressors": %s,\n'   "$disk_count"
+    printf '  "total_stressors": %s,\n'  "${#hogs[@]}"
+    printf '  "ratio_total_capped": "%s",\n' "$capped"
+    printf '  "cpu_mechanism": "%s",\n'  "$(esc "$cpu_mech")"
+    printf '  "cgroup_cpu_probe": "%s",\n' "$(esc "$cgroup_probe")"
+    printf '  "baseline_ms": %s,\n'      "$base_ms"
+    printf '  "loaded_ms": %s,\n'        "$loaded_ms"
+    printf '  "achieved_slowdown": %s,\n' "$slowdown"
+    printf '  "stress_ng_version": "%s",\n' "$(esc "$stress_ng_ver")"
+    printf '  "bash_version": "%s",\n'   "$(esc "$bash_ver")"
+    printf '  "os_arch": "%s",\n'        "$(esc "$os_uname")"
+    printf '  "git_sha": "%s",\n'        "$(esc "$git_sha")"
+    printf '  "command": "%s"\n'         "$(esc "$*")"
+    printf '}\n'
+  } > "$manifest" 2>/dev/null || true
+}
+write_manifest "$@"
+
+echo "stress: kind=$kind R=$ratio nproc=$nproc_count cpu=$cpu_count disk=$disk_count" \
+     "mech=$cpu_mech capped=$capped slowdown=${slowdown}x manifest=$manifest :: $*"
+
+# ── Run the wrapped command, tear down, propagate its exit code ─────────────────
 "$@"
 rc=$?
 

From 36b0033eec931ef0e9552abecc9295905892054d Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 04:24:05 +1000
Subject: [PATCH 42/76] chore: gitignore test-output/ (runtime CI/test artifact
 dir)

CI jobs `mkdir -p test-output` for suite logs, and the graduated with-load.sh
writes its load-manifest there. It's never committed; ignore it so it can't be
swept into a commit locally.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .gitignore | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/.gitignore b/.gitignore
index 9bdb6bd..abf679e 100644
--- a/.gitignore
+++ b/.gitignore
@@ -11,3 +11,6 @@ Thumbs.db
 /.agent/review-queue.lock.*
 /.agent/last-opened
 /.agent/.tmp.*
+
+# Test/CI artifact output (manifests, suite logs); created at runtime, never committed.
+test-output/

From 6a33cbe37116a852cb9e7bb12f47808ca3300ec4 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 05:58:01 +1000
Subject: [PATCH 43/76] Bucket 6e: Axis-A waiter-count sweep (GCL_TEST_SWEEP),
 nightly/deep-only

Parametrize the fan-out/contention tests over a waiter-count axis so the
nightly/deep CI tiers exercise more surface, while per-PR (default) runs stay
byte-identical.

- T_AXIS_A list + GCL_TEST_SWEEP read in tests/_harness.sh: unset/0 -> "4"
  (today's floor, deterministic); =1 -> "4 12 24".
- Test 2b, Test 20 (unit) and interop Test 16 loop their waiter count N over
  T_AXIS_A, naming N in every assertion. Test 20 keeps its mode floor and
  appends 12,24 when sweeping.
- Anti-flake discipline: correctness assertions stay strict (ok/bad) and
  config-independent; MAX_WAIT and STALE scale with N (a real N=24 over-steal
  was caught and fixed by scaling STALE>=N when sweeping, keeping exactly-one-
  steal strict at every N). Codex-reviewed default-byte-identicality hardenings
  adopted (Test 20 default MAX_WAIT, fixture-token N-segment, recov-log glob).

Validated: default unit 315/0 + interop 141/0 (byte-identical); GCL_TEST_SWEEP=1
unit 337/0 + interop 163/0, all N pass; selector still works; shellcheck clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/_harness.sh                     |  15 +++
 tests/git-commit-lock.interop.test.sh | 126 ++++++++++++---------
 tests/git-commit-lock.test.sh         | 155 +++++++++++++++++++-------
 3 files changed, 206 insertions(+), 90 deletions(-)

diff --git a/tests/_harness.sh b/tests/_harness.sh
index d5d8215..88b344c 100644
--- a/tests/_harness.sh
+++ b/tests/_harness.sh
@@ -35,6 +35,21 @@ PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
 GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
 GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
 
+# Axis-A waiter-count sweep (Bucket 6). GCL_TEST_SWEEP=1 (nightly/deep CI) widens
+# the fan-out/contention tests over several waiter counts to wring more coverage
+# from the existing tests; unset/0 (per-PR default + plain dev) keeps the floor so
+# default runs are byte-identical to today. T_AXIS_A is the shared waiter-count
+# list the contention tests (unit Test 2b, interop Test 16) iterate N over; each
+# names N in every assertion message so a sweep failure says which N broke. The
+# floor is 4 — the count those two tests hardcode today, so the single-element
+# default reproduces today's behaviour exactly. (Test 20's floor is mode-driven
+# `$T20_N` (5 REDUCED / 10 FULL), not 4, so it composes its own list from $T20_N +
+# the sweep's higher counts rather than from T_AXIS_A — see that test.)
+GCL_TEST_SWEEP="${GCL_TEST_SWEEP:-0}"
+# shellcheck disable=SC2034  # T_AXIS_A is consumed by the sourcing suites (unit
+# Test 2b, interop Test 16), not within this harness file.
+if [ "$GCL_TEST_SWEEP" = 1 ]; then T_AXIS_A="4 12 24"; else T_AXIS_A="4"; fi
+
 # ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and
 # bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted
 # by each suite just before its verdict) lets a TAP consumer fail on a short count;
diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh
index 4bad30f..0244d1a 100644
--- a/tests/git-commit-lock.interop.test.sh
+++ b/tests/git-commit-lock.interop.test.sh
@@ -838,18 +838,18 @@ fi
 
 if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-serialized: zero displacement, zero 98s"; then
 # Cross-impl variant of the unit suite's Test 2b (which carries the full
-# rationale): 2 bash + 2 pwsh waiters race ONE crashed lock. Under the claim
-# protocol the straggler-robs-recovery-winner race is PREVENTED (the claim
-# serializes stealers across the wire format), not detected-and-repaired, so
-# the assertions are strict: every waiter exits 0 (zero spurious 98s — an
-# unserialized implementation displaces the recovery winner near-certainly),
-# exactly ONE STOLE-BY-CLAIM, NO move-aside file ever exists (an
-# implementation that staged the steal through an intermediate .dead.* file
-# would re-open the displacement race; a background sampler proves no such
-# file ever appears — and the unserialized "STOLE stale lock" line shape and
-# any STEAL-DISPLACED repair line must never appear), and the final state
-# is clean (no lock, no claim). Sync: waiters launch against a FRESH
-# fabricated lock and only once all four have logged WAITING is it
+# rationale): N waiters split half bash / half pwsh race ONE crashed lock.
+# Under the claim protocol the straggler-robs-recovery-winner race is
+# PREVENTED (the claim serializes stealers across the wire format), not
+# detected-and-repaired, so the assertions are strict: every waiter exits 0
+# (zero spurious 98s — an unserialized implementation displaces the recovery
+# winner near-certainly), exactly ONE STOLE-BY-CLAIM, NO move-aside file ever
+# exists (an implementation that staged the steal through an intermediate
+# .dead.* file would re-open the displacement race; a background sampler proves
+# no such file ever appears — and the unserialized "STOLE stale lock" line
+# shape and any STEAL-DISPLACED repair line must never appear), and the final
+# state is clean (no lock, no claim). Sync: waiters launch against a FRESH
+# fabricated lock and only once all have logged WAITING is it
 # backdated, so all judge stale within one poll window despite pwsh's slow
 # cold start; the sync keeps the ghost fresh while it waits
 # (sync_waiting_fresh) so a stalled sync can't let the ghost age stale on
@@ -861,13 +861,34 @@ if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-seri
 # the run's premise is broken (the touch may have aged the WINNER'S live
 # lock), so the run is discarded and retried (bounded) instead of failing
 # assertions the protocol never violated.
+#
+# Waiter count is swept over $T_AXIS_A (Bucket 6): one iteration at N=4 by
+# default (2 bash + 2 pwsh — byte-identical to today) and at N=4,12,24 under
+# GCL_TEST_SWEEP=1. N is split into a bash half (N/2) and a pwsh half (the
+# remainder); at N=4 that is 2+2 exactly. The correctness invariants stay strict
+# at EVERY N — but that needs STALE >> the winner's EFFECTIVE hold, which grows
+# with N under load (the winner is one of N concurrent processes), so STALE is
+# floored to N when sweeping (t16_stale); at the default floor it is the same 8
+# as today. MAX_WAIT scales too (30*N => 120 at N=4) so a wide, pwsh-cold-start-
+# heavy sweep has time to drain. The per-N tag on the non-count-naming
+# assertions is suppressed in the default run so the messages stay byte-identical.
 LOCK="$WORK/recov.lock"
 T16_TRIES=3
 T16_GRAVESEEN="$WORK/recov.graveseen"; T16_SAMPSTOP="$WORK/recov.sampstop"
+for T16_N in $T_AXIS_A; do
+t16_nsh=$(( T16_N / 2 )); t16_nps=$(( T16_N - t16_nsh ))   # bash half + pwsh half (2+2 at N=4)
+t16_maxwait=$(( 30 * T16_N ))
+# STALE budget: today's 8 in the default (non-sweep) run for byte-identical
+# behaviour; when sweeping, floor it to N so a wide fan-out's load-stretched
+# winner hold can never make its own live lock look stale (a legitimate but
+# unwanted second steal), keeping "exactly one steal" strict at every N.
+if [ "$GCL_TEST_SWEEP" = 1 ] && [ "$T16_N" -gt 8 ]; then t16_stale="$T16_N"; else t16_stale=8; fi
+if [ "$GCL_TEST_SWEEP" = 1 ]; then t16_ntag=" at N=$T16_N"; else t16_ntag=""; fi
 t16_valid=0; t16_sync=1; t16_fail=0; n98=0
 for t16_try in $(seq 1 "$T16_TRIES"); do
-  T16_GHOST="tok.ghost.recov.$t16_try"
-  rm -f "$WORK"/recov.ran.* "$T16_GRAVESEEN" "$T16_SAMPSTOP" "$LOCK" "$LOCK.next" 2>/dev/null
+  T16_GHOST="tok.ghost.recov.$T16_N.$t16_try"
+  rm -f "$WORK"/recov.ran.* "$WORK"/recov-sh*.log "$WORK"/recov-ps*.log \
+        "$T16_GRAVESEEN" "$T16_SAMPSTOP" "$LOCK" "$LOCK.next" 2>/dev/null
   fabricate_lock "$LOCK" "$T16_GHOST" "pid=999 host=ghost"   # fresh mtime: not yet stale
   (
     while [ ! -e "$T16_SAMPSTOP" ]; do
@@ -878,41 +899,45 @@ for t16_try in $(seq 1 "$T16_TRIES"); do
     done
   ) &
   t16_sampler=$!
-  pids=()
-  for i in 1 2; do
+  pids=(); t16_logs=()
+  for i in $(seq 1 "$t16_nsh"); do
     : > "$WORK/recov-sh$i.log"   # per-waiter logs: concurrent appends to one log drop lines
-    AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-sh$i.log" AGENT_LOCK_STALE_SECS=8 \
-      AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \
+    t16_logs+=("$WORK/recov-sh$i.log")
+    AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-sh$i.log" AGENT_LOCK_STALE_SECS="$t16_stale" \
+      AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t16_maxwait" \
       bash "$SH" run -- bash -c 'touch "$1"; sleep 0.1' _ "$WORK/recov.ran.sh$i" 2>/dev/null &
     pids+=($!)
   done
-  for i in 1 2; do
+  for i in $(seq 1 "$t16_nps"); do
     : > "$WORK/recov-ps$i.log"
-    AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-ps$i.log" AGENT_LOCK_STALE_SECS=8 \
-      AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \
+    t16_logs+=("$WORK/recov-ps$i.log")
+    AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-ps$i.log" AGENT_LOCK_STALE_SECS="$t16_stale" \
+      AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t16_maxwait" \
       pwsh -NoProfile -File "$PS1WIN" run "[IO.File]::WriteAllText('$WORK/recov.ran.ps$i', 'x'); Start-Sleep -Milliseconds 100" 2>/dev/null &
     pids+=($!)
   done
   t16_sync=1
-  if ! sync_waiting_fresh "$LOCK" 90 "$WORK/recov-sh1.log" "$WORK/recov-sh2.log" \
-                          "$WORK/recov-ps1.log" "$WORK/recov-ps2.log"; then
+  if ! sync_waiting_fresh "$LOCK" 90 "${t16_logs[@]}"; then
     t16_sync=0
-    for f in "$WORK/recov-sh1.log" "$WORK/recov-sh2.log" "$WORK/recov-ps1.log" "$WORK/recov-ps2.log"; do
-      grep -q "WAITING for lock" "$f" 2>/dev/null || echo "  T16 waiter never contended (no WAITING in ${f##*/})"
+    for f in "${t16_logs[@]}"; do
+      grep -q "WAITING for lock" "$f" 2>/dev/null || echo "  T16 N=$T16_N waiter never contended (no WAITING in ${f##*/})"
     done
   fi
-  backdate_ghost "$LOCK" "$T16_GHOST" 9999; t16_bd=$?   # all four now judge the ghost stale together
+  backdate_ghost "$LOCK" "$T16_GHOST" 9999; t16_bd=$?   # all waiters now judge the ghost stale together
   t16_fail=0; n98=0
   for p in "${pids[@]}"; do
     wait "$p"; rc=$?
     case "$rc" in
       0)  ;;
-      98) n98=$((n98+1)); echo "  T16 waiter rc=98 — displacement under the claim protocol" ;;
-      *)  t16_fail=1; echo "  T16 waiter rc=$rc (want 0)" ;;
+      98) n98=$((n98+1)); echo "  T16 N=$T16_N waiter rc=98 — displacement under the claim protocol" ;;
+      *)  t16_fail=1; echo "  T16 N=$T16_N waiter rc=$rc (want 0)" ;;
     esac
   done
   touch "$T16_SAMPSTOP"; wait "$t16_sampler" 2>/dev/null
-  cat "$WORK"/recov-*.log > "$WORK/recov-all.log" 2>/dev/null || : > "$WORK/recov-all.log"
+  # Aggregate from the explicit per-waiter log list, NOT a recov-*.log glob: the
+  # glob would also match recov-all.log itself, which now persists across sweep N
+  # iterations, so a glob could self-cat a stale aggregate into the count.
+  cat "${t16_logs[@]}" > "$WORK/recov-all.log" 2>/dev/null || : > "$WORK/recov-all.log"
   if [ "$t16_bd" != 0 ]; then
     # The backdate was NOT conclusively clean (see backdate_ghost; under
     # load the whole steal+release cycle often completes before the
@@ -929,7 +954,7 @@ for t16_try in $(seq 1 "$T16_TRIES"); do
     [ "$(grep -c "lock LOST" "$WORK/recov-all.log")" = 0 ] || t16_dirty=1
     { [ -e "$LOCK" ] || [ -e "$LOCK.next" ]; } && t16_dirty=1
     if [ "$t16_dirty" = 1 ]; then
-      echo "  T16 try $t16_try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying"
+      echo "  T16 N=$T16_N try $t16_try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying"
       rm -f "$LOCK" "$LOCK.next" 2>/dev/null
       continue
     fi
@@ -944,30 +969,31 @@ if [ "$t16_valid" = 1 ]; then
   nold="$(grep -c "STOLE stale lock" "$WORK/recov-all.log")"
   ndisp="$(grep -c "STEAL-DISPLACED" "$WORK/recov-all.log")"
   [ "$t16_fail" = 0 ] && [ "$t16_sync" = 1 ] \
-    && ok "2 bash + 2 pwsh waiters on one crashed lock: every waiter exited 0" \
-    || bad "mixed crash-recovery exits wrong (see above)"
-  [ "$n98" = 0 ] && ok "zero spurious 98s — the claim serialized recovery across implementations" \
-                 || bad "$n98 waiter(s) exited 98 — displacement happened under the claim protocol"
-  [ "$nran" = 4 ] && ok "all 4 waiter commands ran" || bad "only $nran/4 waiter commands ran"
-  [ "$nstole" = 1 ] && ok "exactly ONE STOLE-BY-CLAIM (the claim serialized the cross-impl recovery)" \
-                    || bad "STOLE-BY-CLAIM x$nstole (want exactly 1)"
+    && ok "$t16_nsh bash + $t16_nps pwsh waiters on one crashed lock: every waiter exited 0" \
+    || bad "mixed crash-recovery exits wrong$t16_ntag (see above)"
+  [ "$n98" = 0 ] && ok "zero spurious 98s$t16_ntag — the claim serialized recovery across implementations" \
+                 || bad "$n98 waiter(s) exited 98$t16_ntag — displacement happened under the claim protocol"
+  [ "$nran" = "$T16_N" ] && ok "all $T16_N waiter commands ran" || bad "only $nran/$T16_N waiter commands ran"
+  [ "$nstole" = 1 ] && ok "exactly ONE STOLE-BY-CLAIM$t16_ntag (the claim serialized the cross-impl recovery)" \
+                    || bad "STOLE-BY-CLAIM x$nstole$t16_ntag (want exactly 1)"
   grep -q "STOLE-BY-CLAIM.*ghost=pid=999 host=ghost" "$WORK/recov-all.log" \
-    && ok "the steal line attributes the crashed ghost cross-impl (wire-format line 2 parsed)" \
-    || bad "STOLE-BY-CLAIM does not carry the ghost's line-2 attribution"
+    && ok "the steal line attributes the crashed ghost cross-impl (wire-format line 2 parsed)$t16_ntag" \
+    || bad "STOLE-BY-CLAIM does not carry the ghost's line-2 attribution$t16_ntag"
   grep -q "CLAIM .*tok=tok\." "$WORK/recov-all.log" \
-    && ok "claim create logged with its per-attempt token (CLAIM ... tok=)" \
-    || bad "no CLAIM line with a token in the recovery logs"
-  [ "$nold" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged" \
-    || bad "'STOLE stale lock' shape appeared x$nold — an unserialized steal lane is present"
-  [ "$ndisp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)" \
-    || bad "STEAL-DISPLACED fired x$ndisp — displacement-repair machinery present?"
-  [ -e "$T16_GRAVESEEN" ] && bad "a move-aside file (.dead.*) existed during recovery — the steal is staged through an intermediate file!" \
-    || ok "no move-aside file (.dead.*) ever existed during recovery (sampler)"
-  [ -e "$LOCK" ] && bad "leftover crash-recovery lock" || ok "no leftover lock"
-  [ -e "$LOCK.next" ] && bad "leftover claim after recovery" || ok "no leftover claim"
+    && ok "claim create logged with its per-attempt token (CLAIM ... tok=)$t16_ntag" \
+    || bad "no CLAIM line with a token in the recovery logs$t16_ntag"
+  [ "$nold" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged$t16_ntag" \
+    || bad "'STOLE stale lock' shape appeared x$nold$t16_ntag — an unserialized steal lane is present"
+  [ "$ndisp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)$t16_ntag" \
+    || bad "STEAL-DISPLACED fired x$ndisp$t16_ntag — displacement-repair machinery present?"
+  [ -e "$T16_GRAVESEEN" ] && bad "a move-aside file (.dead.*) existed during recovery$t16_ntag — the steal is staged through an intermediate file!" \
+    || ok "no move-aside file (.dead.*) ever existed during recovery (sampler)$t16_ntag"
+  [ -e "$LOCK" ] && bad "leftover crash-recovery lock$t16_ntag" || ok "no leftover lock$t16_ntag"
+  [ -e "$LOCK.next" ] && bad "leftover claim after recovery$t16_ntag" || ok "no leftover claim$t16_ntag"
 else
-  bad "T16: no clean run under a conclusive backdate in $T16_TRIES attempts (see above)"
+  bad "T16: no clean run under a conclusive backdate in $T16_TRIES attempts$t16_ntag (see above)"
 fi
+done
 fi
 
 if section "Test 16b: bash claimant vs ps1 claimant racing ONE ghost — one claim winner, cross-impl wire parity"; then
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 8b1aa08..3bffabd 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -184,16 +184,49 @@ if section "Test 2b: crash recovery under CONTENTION — claim-serialized: zero
 # WINNER'S live lock), the attempt is kept only if its outcome is clean and
 # otherwise discarded and retried (bounded), instead of failing assertions
 # the protocol never violated.
-T2B_N=4
+#
+# Waiter count is swept over $T_AXIS_A (Bucket 6): one iteration at N=4 by
+# default (byte-identical to today) and at N=4,12,24 under GCL_TEST_SWEEP=1.
+# Every sweep iteration's assertions carry an " at N=<count>" tag so a sweep
+# failure says which N broke; that tag is SUPPRESSED in the default (non-sweep)
+# run (t2b_ntag empty) so the messages are byte-identical to today — the first
+# assertion already names the count via "$T2B_N waiters". The correctness
+# invariants asserted here (zero 98, exactly one steal, no move-aside, clean
+# final state) stay ok/bad strict (not envelope) at all N — but that requires
+# STALE >> the winner's EFFECTIVE hold, which grows with N under load (the
+# winner is one of N concurrent processes; oversubscription stretches the wall
+# time between its create and release), so STALE is floored to N when sweeping
+# (t2b_stale) — at the default floor it is the same 8 as today. The per-waiter
+# wall-clock budget scales too: MAX_WAIT = 30*N (=> 120 at N=4, today's value)
+# so a wide sweep, where the losing waiters acquire in sequence after the winner
+# releases, has time to drain instead of timing out and looking like a product
+# failure.
 T2B_TRIES=3   # per-round attempts; see the backdate_ghost note
+for T2B_N in $T_AXIS_A; do
+# MAX_WAIT and STALE: today's exact values (120 / 8) in the default (non-sweep)
+# run so the env passed to the library is byte-identical; only the sweep's wider
+# N raise them. MAX_WAIT scales 30*N (=> 120 at N=4 anyway). STALE floors to N so
+# a wide fan-out's load-stretched winner hold (the winner is one of N concurrent
+# processes) can never make its own live lock look stale and trigger a
+# legitimate-but-unwanted second steal.
+if [ "$GCL_TEST_SWEEP" = 1 ]; then
+  t2b_maxwait=$(( 30 * T2B_N ))
+  [ "$T2B_N" -gt 8 ] && t2b_stale="$T2B_N" || t2b_stale=8
+  t2b_ntag=" at N=$T2B_N"
+else
+  t2b_maxwait=120; t2b_stale=8; t2b_ntag=""
+fi
 t2b_fail=0; t2b_stole=0; t2b_old_shape=0; t2b_disp=0; t2b_98=0; t2b_retried=0
 for r in $(seq 1 "$T2B_ROUNDS"); do
   t2b_valid=0
   for try in $(seq 1 "$T2B_TRIES"); do
-    GHOST="tok.ghost.t2b.$r.$try"
+    # Ghost token carries an N segment only when sweeping (distinct per N); the
+    # default keeps today's exact "tok.ghost.t2b.$r.$try" so the lock CONTENT
+    # the library sees is byte-identical.
+    if [ "$GCL_TEST_SWEEP" = 1 ]; then GHOST="tok.ghost.t2b.$T2B_N.$r.$try"; else GHOST="tok.ghost.t2b.$r.$try"; fi
     LOCK="$WORK/recov.$r.lock"; RAN="$WORK/recov.$r.ran"; : > "$RAN"
     GRAVESEEN="$WORK/recov.$r.graveseen"; SAMPSTOP="$WORK/recov.$r.sampstop"
-    rm -f "$GRAVESEEN" "$SAMPSTOP" "$LOCK" "$LOCK.next"
+    rm -f "$GRAVESEEN" "$SAMPSTOP" "$LOCK" "$LOCK.next" "$WORK/recov.$r".*.log
     fabricate_lock "$LOCK" "$GHOST" "pid=999 host=ghost" # fresh mtime: not yet stale
     # Move-aside sampler: ANY .dead.* sighting at ANY moment during the round
     # means the implementation stages the steal through an intermediate file
@@ -207,21 +240,21 @@ for r in $(seq 1 "$T2B_ROUNDS"); do
       done
     ) &
     sampler=$!
-    pids=()
+    pids=(); waiter_logs=()
     for i in $(seq 1 "$T2B_N"); do
       : > "$WORK/recov.$r.$i.log"   # per-waiter logs: concurrent appends to one log drop lines
-      AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov.$r.$i.log" AGENT_LOCK_STALE_SECS=8 \
-        AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \
+      waiter_logs+=("$WORK/recov.$r.$i.log")
+      AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov.$r.$i.log" AGENT_LOCK_STALE_SECS="$t2b_stale" \
+        AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t2b_maxwait" \
         bash "$LIB" run -- bash -c 'echo ran >> "$1"; sleep 0.1' _ "$RAN" 2>/dev/null &
       pids+=($!)
     done
     t2b_sync=1
-    if ! sync_waiting_fresh "$LOCK" 60 "$WORK/recov.$r.1.log" "$WORK/recov.$r.2.log" \
-                            "$WORK/recov.$r.3.log" "$WORK/recov.$r.4.log"; then
+    if ! sync_waiting_fresh "$LOCK" 60 "${waiter_logs[@]}"; then
       t2b_sync=0
       for i in $(seq 1 "$T2B_N"); do
         grep -q "WAITING for lock" "$WORK/recov.$r.$i.log" 2>/dev/null \
-          || echo "  round $r: waiter $i never logged WAITING"
+          || echo "  N=$T2B_N round $r: waiter $i never logged WAITING"
       done
     fi
     backdate_ghost "$LOCK" "$GHOST" 9999; bd=$?   # all waiters now judge the ghost stale together
@@ -230,8 +263,8 @@ for r in $(seq 1 "$T2B_ROUNDS"); do
       wait "${pids[$((i-1))]}"; rc=$?
       case "$rc" in
         0)  ;;
-        98) round_98=$((round_98+1)); echo "  round $r: waiter $i rc=98 — displacement under the claim protocol" ;;
-        *)  round_badrc=$((round_badrc+1)); echo "  round $r: waiter $i rc=$rc (want 0)" ;;
+        98) round_98=$((round_98+1)); echo "  N=$T2B_N round $r: waiter $i rc=98 — displacement under the claim protocol" ;;
+        *)  round_badrc=$((round_badrc+1)); echo "  N=$T2B_N round $r: waiter $i rc=$rc (want 0)" ;;
       esac
     done
     touch "$SAMPSTOP"; wait "$sampler" 2>/dev/null
@@ -254,7 +287,7 @@ for r in $(seq 1 "$T2B_ROUNDS"); do
       { [ -e "$LOCK" ] || [ -e "$LOCK.next" ]; } && round_dirty=1
       if [ "$round_dirty" = 1 ]; then
         t2b_retried=$((t2b_retried+1))
-        echo "  round $r try $try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying"
+        echo "  N=$T2B_N round $r try $try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying"
         rm -f "$LOCK" "$LOCK.next" "$RAN" "$GRAVESEEN" "$SAMPSTOP"
         continue
       fi
@@ -266,38 +299,39 @@ for r in $(seq 1 "$T2B_ROUNDS"); do
     nran="$(grep -c ran "$RAN")"
     [ "$nran" = "$T2B_N" ] || {
       t2b_fail=1
-      echo "  round $r: only $nran/$T2B_N commands ran"
+      echo "  N=$T2B_N round $r: only $nran/$T2B_N commands ran"
     }
     [ -e "$LOCK" ] && {
       t2b_fail=1
-      echo "  round $r: leftover lock"
+      echo "  N=$T2B_N round $r: leftover lock"
     }
     [ -e "$LOCK.next" ] && {
       t2b_fail=1
-      echo "  round $r: leftover claim"
+      echo "  N=$T2B_N round $r: leftover claim"
     }
     [ -e "$GRAVESEEN" ] && {
       t2b_fail=1
-      echo "  round $r: a move-aside file (.dead.*) existed during recovery — the steal is staged through an intermediate file!"
+      echo "  N=$T2B_N round $r: a move-aside file (.dead.*) existed during recovery — the steal is staged through an intermediate file!"
     }
     t2b_stole=$((t2b_stole + $(grep -c "STOLE-BY-CLAIM" "$WORK/recov.$r.all.log")))
     t2b_old_shape=$((t2b_old_shape + $(grep -c "STOLE stale lock" "$WORK/recov.$r.all.log")))
     t2b_disp=$((t2b_disp + $(grep -c "STEAL-DISPLACED" "$WORK/recov.$r.all.log")))
     break
   done
-  [ "$t2b_valid" = 1 ] || { t2b_fail=1; echo "  round $r: no clean round under a conclusive backdate in $T2B_TRIES attempts"; }
+  [ "$t2b_valid" = 1 ] || { t2b_fail=1; echo "  N=$T2B_N round $r: no clean round under a conclusive backdate in $T2B_TRIES attempts"; }
 done
-[ "$t2b_retried" = 0 ] || echo "  note: $t2b_retried discarded attempt(s) — harness backdate race, not a protocol verdict"
+[ "$t2b_retried" = 0 ] || echo "  note: $t2b_retried discarded attempt(s) at N=$T2B_N — harness backdate race, not a protocol verdict"
 [ "$t2b_fail" = 0 ] && ok "$T2B_ROUNDS rounds x $T2B_N waiters on one crashed lock: all ran, clean final state, no move-aside file ever existed" \
-  || bad "crash-recovery contention failure (see above)"
-[ "$t2b_98" = 0 ] && ok "zero spurious 98s — the claim serialized recovery (unserialized: near-certain displacement)" \
-  || bad "$t2b_98 waiter(s) exited 98 — displacement happened under the claim protocol"
-[ "$t2b_stole" = "$T2B_ROUNDS" ] && ok "exactly one STOLE-BY-CLAIM per recovery (x$t2b_stole/$T2B_ROUNDS rounds)" \
-  || bad "STOLE-BY-CLAIM count $t2b_stole != $T2B_ROUNDS rounds (want exactly one steal per recovery)"
-[ "$t2b_old_shape" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged" \
-  || bad "'STOLE stale lock' line appeared x$t2b_old_shape — an unserialized steal lane is present"
-[ "$t2b_disp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)" \
-  || bad "STEAL-DISPLACED fired x$t2b_disp — displacement-repair machinery present?"
+  || bad "crash-recovery contention failure$t2b_ntag (see above)"
+[ "$t2b_98" = 0 ] && ok "zero spurious 98s$t2b_ntag — the claim serialized recovery (unserialized: near-certain displacement)" \
+  || bad "$t2b_98 waiter(s) exited 98$t2b_ntag — displacement happened under the claim protocol"
+[ "$t2b_stole" = "$T2B_ROUNDS" ] && ok "exactly one STOLE-BY-CLAIM per recovery$t2b_ntag (x$t2b_stole/$T2B_ROUNDS rounds)" \
+  || bad "STOLE-BY-CLAIM count $t2b_stole != $T2B_ROUNDS rounds$t2b_ntag (want exactly one steal per recovery)"
+[ "$t2b_old_shape" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged$t2b_ntag" \
+  || bad "'STOLE stale lock' line appeared x$t2b_old_shape$t2b_ntag — an unserialized steal lane is present"
+[ "$t2b_disp" = 0 ] && ok "zero STEAL-DISPLACED lines$t2b_ntag (prevention, not detect-and-repair)" \
+  || bad "STEAL-DISPLACED fired x$t2b_disp$t2b_ntag — displacement-repair machinery present?"
+done
 fi
 
 if section "Test 3: REGRESSION — EMPTY lock file (crash between create and write) is still stolen"; then
@@ -1073,36 +1107,77 @@ if section "Test 20: claim contention — N concurrent stealers, ONE claim winne
 # N stealers race one ancient ghost: exactly one wins the O_EXCL claim and
 # steals (one STOLE-BY-CLAIM); the rest lose the claim create and acquire
 # normally in sequence after the winner releases. No displacement (zero
-# LOST/98), no leftovers. STALE=5 keeps a loaded box from re-stealing the
-# winner's brief hold.
+# LOST/98), no leftovers. STALE keeps a loaded box from re-stealing the
+# winner's brief hold — that bound only holds while STALE >> the winner's
+# effective hold, which (counter-intuitively) grows with N: the WINNER is one
+# of N concurrently-spawned bash processes, so under oversubscription the wall
+# time between its create and its release stretches with the contention. So
+# STALE must scale with N too (see t20_stale below), keeping "exactly one
+# steal" a strict, config-independent correctness invariant at every N.
+#
+# Waiter count is swept (Bucket 6). Unlike Test 2b/16, this test's floor is NOT
+# 4 — it is the MODE-driven $T20_N (5 REDUCED / 10 FULL), the count CI already
+# stresses. So instead of iterating the shared T_AXIS_A ("4 ...") it builds its
+# own list: just $T20_N by default (byte-identical), and $T20_N plus the sweep's
+# higher counts (12, 24) under GCL_TEST_SWEEP=1 — preserving today's per-PR AND
+# full-mode coverage while still widening the sweep. MAX_WAIT scales 30*N (the
+# workers run `true`, so this is ample headroom, never the floor's behaviour).
 LOCK="$WORK/contend.lock"
-fabricate_lock "$LOCK" "tok.ghost.t20" "pid=888 host=ghost"
+T20_FLOOR="$T20_N"
+if [ "$GCL_TEST_SWEEP" = 1 ]; then
+  T20_AXIS="$T20_FLOOR"
+  for _n in 12 24; do [ "$_n" = "$T20_FLOOR" ] || T20_AXIS="$T20_AXIS $_n"; done
+else
+  T20_AXIS="$T20_FLOOR"
+fi
+for T20_N in $T20_AXIS; do
+# N-tag for assertion messages: empty in the default run (byte-identical), set
+# only when sweeping so each N's pass/fail line is attributable.
+if [ "$GCL_TEST_SWEEP" = 1 ]; then t20_ntag=" at N=$T20_N"; else t20_ntag=""; fi
+# MAX_WAIT and STALE: keep today's exact values (120 / 5) in the default
+# (non-sweep) run so the env passed to the library is byte-identical; only the
+# sweep's wider N raise them. MAX_WAIT scales 30*N (workers run `true`, ample
+# headroom). STALE floors to N so a wide fan-out's load-stretched winner hold
+# can NEVER make a live lock look stale -> the "exactly one steal" invariant
+# stays true at N=24 just as at the floor. The fixture ghost token likewise
+# carries an N segment only when sweeping (distinct tokens per N), so the
+# default lock CONTENT the library sees is unchanged too.
+if [ "$GCL_TEST_SWEEP" = 1 ]; then
+  t20_maxwait=$(( 30 * T20_N ))
+  [ "$T20_N" -gt 5 ] && t20_stale="$T20_N" || t20_stale=5
+  t20_ghost="tok.ghost.t20.$T20_N"
+else
+  t20_maxwait=120; t20_stale=5; t20_ghost="tok.ghost.t20"
+fi
+rm -f "$WORK/contend".*.log "$LOCK" "$LOCK.next"
+fabricate_lock "$LOCK" "$t20_ghost" "pid=888 host=ghost"
 backdate "$LOCK" 9999
 pids=(); t20_fail=0
 for i in $(seq 1 "$T20_N"); do
   : > "$WORK/contend.$i.log"
-  AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/contend.$i.log" AGENT_LOCK_STALE_SECS=5 \
-    AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \
+  AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/contend.$i.log" AGENT_LOCK_STALE_SECS="$t20_stale" \
+    AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t20_maxwait" \
     bash "$LIB" run -- bash -c 'true' 2>/dev/null &
   pids+=($!)
 done
 for i in $(seq 1 "$T20_N"); do
   wait "${pids[$((i-1))]}"; rc=$?
-  [ "$rc" = 0 ] || { t20_fail=1; echo "  worker $i rc=$rc (want 0)"; }
+  [ "$rc" = 0 ] || { t20_fail=1; echo "  N=$T20_N worker $i rc=$rc (want 0)"; }
 done
 cat "$WORK/contend."*.log > "$WORK/contend.all.log"
 nst="$(grep -c "STOLE-BY-CLAIM" "$WORK/contend.all.log")"
 nacq="$(grep -c "ACQUIRED" "$WORK/contend.all.log")"
 nrel="$(grep -c "RELEASED" "$WORK/contend.all.log")"
 nlost="$(grep -c "lock LOST" "$WORK/contend.all.log")"
-[ "$t20_fail" = 0 ] && ok "$T20_N concurrent stealers all completed with rc 0" || bad "claim-contention worker failures (see above)"
-[ "$nst" = 1 ] && ok "exactly ONE claim winner stole the ghost (STOLE-BY-CLAIM x$nst)" \
-               || bad "STOLE-BY-CLAIM x$nst (want exactly 1 — the claim must serialize stealers)"
+[ "$t20_fail" = 0 ] && ok "$T20_N concurrent stealers all completed with rc 0" || bad "claim-contention worker failures$t20_ntag (see above)"
+[ "$nst" = 1 ] && ok "exactly ONE claim winner stole the ghost$t20_ntag (STOLE-BY-CLAIM x$nst)" \
+               || bad "STOLE-BY-CLAIM x$nst$t20_ntag (want exactly 1 — the claim must serialize stealers)"
 [ "$nacq" = "$T20_N" ] && [ "$nrel" = "$T20_N" ] && ok "balanced ACQUIRED/RELEASED ($nacq/$nrel of $T20_N)" \
-                                                  || bad "ACQUIRED=$nacq RELEASED=$nrel (want $T20_N each)"
-[ "$nlost" = 0 ] && ok "zero LOST warnings under claim contention" || bad "$nlost LOST warnings under claim contention"
-[ -e "$LOCK" ] && bad "leftover lock after contention" || ok "no leftover lock"
-[ -e "$LOCK.next" ] && bad "leftover claim after contention" || ok "no leftover claim"
+                                                  || bad "ACQUIRED=$nacq RELEASED=$nrel$t20_ntag (want $T20_N each)"
+[ "$nlost" = 0 ] && ok "zero LOST warnings under claim contention$t20_ntag" || bad "$nlost LOST warnings under claim contention$t20_ntag"
+[ -e "$LOCK" ] && bad "leftover lock after contention$t20_ntag" || ok "no leftover lock$t20_ntag"
+[ -e "$LOCK.next" ] && bad "leftover claim after contention$t20_ntag" || ok "no leftover claim$t20_ntag"
+done
 fi
 
 if section "Test 21: crashed-claimant and empty-claim orphans age out; steals resume"; then

From 792ab90e29b2990e89d67dff26c2256c3316ed6a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 05:58:02 +1000
Subject: [PATCH 44/76] Bucket 6c: nightly.yml (load matrix + kcov + idempotent
 issue triage)

New scheduled/dispatch workflow for the load tier (non-blocking, triaged):
- 6 cells (matrix.include, fail-fast: false): ubuntu cpu/disk/both, macos disk,
  windows interop-integration/disk, windows unit/both. R=2 oversubscription via
  with-load.sh; GCL_ENVELOPE_TIER=relax + GCL_TEST_SWEEP=1 + GCL_TEST_FULL=1.
  Each cell uploads logs + load-manifest on success too. concurrency: nightly.
- kcov coverage job (Linux): build kcov v43 from source, run the unit suite FULL
  strict no-load, gate on a 0.80 line-coverage floor (tracks the achieved ~0.83;
  ratchet up as Tier-A coverage lands), upload HTML + cobertura (30d).
- Issue auto-triage (.github/scripts/nightly-triage.sh, issues: write,
  if: always()): per-cell ground-truth (cell-conclusion.txt, not the misleading
  matrix-aggregate result); classes correctness / envelope / infra; idempotent
  one-issue-per-(date,class); empty-round guard (missing artifact != green).
  Added the triage script to the shellcheck lint list.

actionlint clean; nightly-triage.sh shellcheck -S style + bash -n clean; kcov
floor parse verified against the committed 451/543=0.83 fixture. Schedule
auto-disables after ~60d inactivity; workflow_dispatch revives it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/scripts/nightly-triage.sh | 220 +++++++++++++++++++++++
 .github/workflows/nightly.yml     | 284 ++++++++++++++++++++++++++++++
 .github/workflows/tests.yml       |   1 +
 3 files changed, 505 insertions(+)
 create mode 100644 .github/scripts/nightly-triage.sh
 create mode 100644 .github/workflows/nightly.yml

diff --git a/.github/scripts/nightly-triage.sh b/.github/scripts/nightly-triage.sh
new file mode 100644
index 0000000..485764d
--- /dev/null
+++ b/.github/scripts/nightly-triage.sh
@@ -0,0 +1,220 @@
+#!/usr/bin/env bash
+# nightly-triage.sh — classify a nightly stress run's results and file/append a
+# single labelled GitHub issue per (date, class), idempotently.
+#
+# Invoked by the `triage` job in .github/workflows/nightly.yml AFTER it has
+# downloaded every matrix cell's `test-output/` artifact (each into a directory
+# named `nightly-logs-<cell-id>/`) and written the per-cell job conclusions to a
+# JSON file. It reads only files on disk + `gh`; it makes no test decisions of its
+# own beyond parsing the preserved logs.
+#
+# CLASSIFICATION (per the Bucket 6 spec):
+#   correctness  — any `^FAIL:` line in a suite log, OR a cell job concluded
+#                  `failure`. Files/append a `nightly-correctness` issue. The one
+#                  class that demands investigation.
+#   envelope     — no FAIL anywhere, but at least one `WARN[env-relaxed]` line in a
+#                  log of a cell that *succeeded*. Tracked (`nightly-envelope`); the
+#                  three wall-clock envelope assertions stretched under load — by
+#                  design under GCL_ENVELOPE_TIER=relax — so NO investigation action.
+#   infra        — a cell's artifact is missing, the cell job neither succeeded nor
+#                  cleanly failed-on-an-assertion (timeout / cancelled / checkout
+#                  failure / errored before any suite ran), OR — the EMPTY-ROUND
+#                  GUARD — *no* cell produced any log at all. Filed `nightly-infra`.
+#                  Crucially, "0 FAIL across 0 logs" is NEVER read as green: with no
+#                  evidence we classify infra, not success.
+#
+# Idempotency: one open issue per (run-date, class). We search open issues by a
+# stable title prefix + label; if one exists we append a comment, else we create.
+# Re-running triage for the same date therefore appends rather than spamming.
+#
+# All-green (every cell success, no FAIL, no env warn, every artifact present) ⇒
+# NO issue of any kind is filed.
+#
+# Inputs (environment):
+#   ARTIFACTS_DIR   dir holding the downloaded per-cell artifact directories
+#                   (default: ./artifacts). Each cell dir is `nightly-logs-<id>/`.
+#   CONCLUSIONS     path to a JSON object { "<cell-id>": "<conclusion>", ... } of
+#                   each matrix cell job's `result` (success|failure|cancelled|
+#                   skipped). Read from `<cell-dir>/cell-conclusion.txt`, which each
+#                   stress cell writes (always()) into its own artifact — so the
+#                   conclusion is ground truth PER CELL, never a matrix aggregate.
+#   EXPECTED_CELLS  space-separated list of cell ids that were supposed to run
+#                   (default: the six N1..N6 ids). Lets the empty-round / missing-
+#                   artifact guard know what to expect.
+#   RUN_DATE        UTC date stamp for the issue title (default: today, UTC).
+#   GITHUB_REPOSITORY / GH_TOKEN(GITHUB_TOKEN)  the usual `gh` env.
+#   DRY_RUN=1       print the `gh` actions instead of running them (for local tests).
+set -uo pipefail
+
+ARTIFACTS_DIR="${ARTIFACTS_DIR:-./artifacts}"
+EXPECTED_CELLS="${EXPECTED_CELLS:-N1 N2 N3 N4 N5 N6}"
+RUN_DATE="${RUN_DATE:-$(date -u +%Y-%m-%d)}"
+DRY_RUN="${DRY_RUN:-0}"
+
+log() { printf '%s\n' "$*" >&2; }
+
+# A cell's log directory and its suite logs (may be absent ⇒ infra).
+cell_logdir() { printf '%s/nightly-logs-%s' "$ARTIFACTS_DIR" "$1"; }
+
+# ── Read a cell's OWN recorded conclusion from its artifact (ground truth: each
+#    stress cell writes job.status to cell-conclusion.txt under always()). Absent
+#    file ⇒ `unknown` (handled like a missing artifact). ──────────────────────────
+cell_conclusion() {
+  local cell="$1" f val=""
+  f="$(cell_logdir "$cell")/cell-conclusion.txt"
+  if [ -f "$f" ]; then
+    val="$(tr -d '[:space:]' < "$f" 2>/dev/null)"
+  fi
+  printf '%s' "${val:-unknown}"
+}
+
+# ── Classify each expected cell. Accumulate evidence lines per class. ───────────
+correctness_evidence=""
+envelope_evidence=""
+infra_evidence=""
+
+any_log_seen=0          # for the empty-round guard
+
+for cell in $EXPECTED_CELLS; do
+  dir="$(cell_logdir "$cell")"
+  concl="$(cell_conclusion "$cell")"
+
+  # Gather this cell's suite logs (unit/interop/integration *.log under the artifact).
+  logs=()
+  if [ -d "$dir" ]; then
+    while IFS= read -r f; do logs+=("$f"); done \
+      < <(find "$dir" -type f -name '*.log' 2>/dev/null)
+  fi
+
+  if [ "${#logs[@]}" -eq 0 ]; then
+    # No artifact / no logs for an expected cell. Distinguish: a clean job that
+    # somehow uploaded nothing is still suspect ⇒ infra (we cannot prove it green).
+    infra_evidence+="- ${cell}: no logs found (artifact missing or empty; job conclusion='${concl}')"$'\n'
+    log "[$cell] INFRA: no logs (conclusion=$concl)"
+    continue
+  fi
+  any_log_seen=1
+
+  # Scan the logs.
+  cell_fail=0
+  cell_envwarn=0
+  fail_lines=""
+  for f in "${logs[@]}"; do
+    if grep -qE '^FAIL:' "$f" 2>/dev/null; then
+      cell_fail=1
+      # Keep up to 5 FAIL lines per log as evidence.
+      fail_lines+="$(grep -nE '^FAIL:' "$f" 2>/dev/null | head -5 | sed "s#^#    ${f##*/}: #")"$'\n'
+    fi
+    if grep -qE 'WARN\[env-relaxed\]' "$f" 2>/dev/null; then
+      cell_envwarn=1
+    fi
+  done
+
+  if [ "$cell_fail" -eq 1 ] || [ "$concl" = "failure" ]; then
+    correctness_evidence+="- ${cell}: job='${concl}'"
+    [ "$cell_fail" -eq 1 ] && correctness_evidence+=", FAIL lines present:"$'\n'"${fail_lines}" || correctness_evidence+=" (job failed; no ^FAIL: in logs — see job log)"$'\n'
+    log "[$cell] CORRECTNESS (cell_fail=$cell_fail conclusion=$concl)"
+  elif [ "$concl" != "success" ]; then
+    # Logs exist but the job did not cleanly succeed and there is no assertion FAIL:
+    # timeout / cancelled / errored late ⇒ infra, not green.
+    infra_evidence+="- ${cell}: logs present but job conclusion='${concl}' (timeout/cancel/late error)"$'\n'
+    log "[$cell] INFRA (conclusion=$concl, no FAIL)"
+  elif [ "$cell_envwarn" -eq 1 ]; then
+    envelope_evidence+="- ${cell}: succeeded with WARN[env-relaxed] (envelope assertion(s) stretched under load — expected)"$'\n'
+    log "[$cell] ENVELOPE (success + env-relaxed warn)"
+  else
+    log "[$cell] OK (success, no FAIL, no env warn)"
+  fi
+done
+
+# ── EMPTY-ROUND GUARD: if not a single expected cell produced any log, the run
+#    errored before any suite ran (checkout failure, total infra collapse). That is
+#    INFRA, never green — do not let "0 FAIL across 0 logs" pass as success. ──────
+if [ "$any_log_seen" -eq 0 ]; then
+  empty_msg="EMPTY ROUND: none of the expected cells (${EXPECTED_CELLS}) produced any suite log. The workflow errored before any suite ran (checkout failure / total infra collapse) — this is NOT a passing nightly."
+  infra_evidence="${empty_msg}"$'\n'"${infra_evidence}"
+  log "EMPTY-ROUND GUARD fired: no logs from any cell."
+fi
+
+# ── File/append issues, idempotently, one per (date, class). ────────────────────
+# Title prefix is stable per class+date so search-then-append is reliable.
+file_issue() {  # $1=class-label  $2=title  $3=body
+  local label="$1" title="$2" body="$3" existing=""
+
+  if [ "$DRY_RUN" = 1 ]; then
+    log "DRY_RUN: would search open issues label=$label title~='$title'"
+    log "DRY_RUN: title='$title'"
+    log "DRY_RUN: body:"; printf '%s\n' "$body" >&2
+    return 0
+  fi
+
+  # Search OPEN issues with this label whose title exactly matches (idempotency key).
+  # `gh issue list --search` uses GitHub search; we additionally filter the JSON by
+  # exact title to avoid a substring collision.
+  existing="$(gh issue list --state open --label "$label" \
+                --search "$title in:title" --json number,title \
+                --jq ".[] | select(.title == \"$title\") | .number" 2>/dev/null | head -1)"
+
+  if [ -n "$existing" ]; then
+    log "Appending to existing issue #$existing ($label)"
+    if gh issue comment "$existing" --body "$body" >/dev/null; then
+      log "Appended comment to #$existing"
+    else
+      log "WARN: failed to append to #$existing"
+    fi
+  else
+    log "Creating new issue ($label): $title"
+    if gh issue create --title "$title" --label "$label" --body "$body" >/dev/null; then
+      log "Created issue ($label)"
+    else
+      log "WARN: failed to create issue ($label)"
+    fi
+  fi
+}
+
+run_url="${GITHUB_SERVER_URL:-https://github.com}/${GITHUB_REPOSITORY:-}/actions/runs/${GITHUB_RUN_ID:-}"
+filed=0
+
+if [ -n "$correctness_evidence" ]; then
+  body="Nightly stress run on **${RUN_DATE}** has CORRECTNESS failures (a \`FAIL:\` assertion and/or a cell job concluded \`failure\`). **Investigate.**
+
+$correctness_evidence
+Run: ${run_url}
+
+(Auto-filed by nightly-triage.sh; idempotent per (date, class) — re-runs append.)"
+  file_issue "nightly-correctness" "Nightly correctness failure — ${RUN_DATE}" "$body"
+  filed=1
+fi
+
+if [ -n "$infra_evidence" ]; then
+  body="Nightly stress run on **${RUN_DATE}** had INFRA issues (missing artifact / timeout / cancel / errored before suites ran). Not a product failure, but the run did not produce trustworthy results — re-dispatch or investigate the runner.
+
+$infra_evidence
+Run: ${run_url}
+
+(Auto-filed by nightly-triage.sh; idempotent per (date, class).)"
+  file_issue "nightly-infra" "Nightly infra issue — ${RUN_DATE}" "$body"
+  filed=1
+fi
+
+# Envelope is filed ONLY when there is no correctness failure (a correctness issue
+# subsumes it — under a red run the env warns are noise). Tracked, no action.
+if [ -z "$correctness_evidence" ] && [ -n "$envelope_evidence" ]; then
+  body="Nightly stress run on **${RUN_DATE}**: no correctness failures, but envelope (wall-clock) assertions were relaxed under load (\`WARN[env-relaxed]\`). This is EXPECTED under GCL_ENVELOPE_TIER=relax — tracked, **no investigation needed** unless it becomes persistent at low load.
+
+$envelope_evidence
+Run: ${run_url}
+
+(Auto-filed by nightly-triage.sh; idempotent per (date, class).)"
+  file_issue "nightly-envelope" "Nightly envelope warning — ${RUN_DATE}" "$body"
+  filed=1
+fi
+
+if [ "$filed" -eq 0 ]; then
+  log "ALL GREEN: every expected cell succeeded, no FAIL, no env warn, all artifacts present. No issue filed."
+fi
+
+# Triage itself succeeds whenever it ran to completion — it must not red the
+# workflow for finding failures (those are surfaced as issues). It only fails if it
+# could not run at all (handled by `set -uo pipefail` on a genuine scripting error).
+exit 0
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
new file mode 100644
index 0000000..6c72d6a
--- /dev/null
+++ b/.github/workflows/nightly.yml
@@ -0,0 +1,284 @@
+name: nightly
+
+# Scheduled stress run: the test suites under calibrated background load (the
+# `tests/with-load.sh` wrapper) at one oversubscription level R≈2, plus a kcov
+# line-coverage gate and auto-triage of the results into labelled issues.
+#
+# This is NON-BLOCKING: there is no branch protection on this single-dev project
+# (decision 2026-06-18), so nightly never gates a PR. Its job is to catch
+# load-sensitive flakes and coverage regressions that the per-PR `tests.yml`
+# (no-load, strict) cannot.
+#
+# NOTE for a future maintainer: GitHub auto-DISABLES a `schedule` trigger after
+# ~60 days of repo inactivity. If the nightly history is empty, that may mean the
+# schedule was disabled (not that every run passed) — re-enable / revive it with a
+# manual `workflow_dispatch` run from the Actions tab. Rely on `workflow_dispatch`
+# as the always-available manual trigger.
+
+on:
+  schedule:
+    - cron: '23 8 * * *'   # 08:23 UTC daily — off-peak (low GitHub-hosted-runner contention)
+  workflow_dispatch:
+
+# One nightly at a time; a newer run supersedes an in-flight one.
+concurrency:
+  group: nightly
+  cancel-in-progress: true
+
+permissions:
+  contents: read
+
+env:
+  # The suites run at full fan-out, with the envelope (wall-clock) assertions
+  # RELAXED so an oversubscribed runner cannot turn a latency stretch into a red
+  # (only correctness assertions can fail the suite under load), and with the
+  # Axis-A waiter-count sweep {4,12,24} enabled.
+  GCL_TEST_FULL: 1
+  GCL_ENVELOPE_TIER: relax
+  GCL_TEST_SWEEP: 1
+  # One oversubscription level R≈2 (stressors ≈ 2 * nproc per kind, total capped at
+  # GCL_STRESS_RATIO_MAX * nproc by with-load.sh).
+  GCL_STRESS_RATIO: 2
+
+jobs:
+  # ── The 6 stress cells. Each runs the relevant suite(s) wrapped in with-load.sh
+  #    under one GCL_STRESS_KIND. `leg` selects which suites run (mirrors tests.yml):
+  #    ubuntu/macos run the full set; windows splits unit vs interop-integration. ──
+  stress:
+    name: ${{ matrix.id }} ${{ matrix.os }} (${{ matrix.kind }}${{ matrix.leg != 'all' && format(', {0}', matrix.leg) || '' }})
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false   # every cell's verdict is signal — and triage needs them all
+      matrix:
+        include:
+          - { id: N1, os: ubuntu-24.04, leg: all,                  kind: cpu,  job_timeout: 70 }
+          - { id: N2, os: ubuntu-24.04, leg: all,                  kind: disk, job_timeout: 70 }
+          - { id: N3, os: ubuntu-24.04, leg: all,                  kind: both, job_timeout: 70 }
+          - { id: N4, os: macos-15,     leg: all,                  kind: disk, job_timeout: 70 }
+          - { id: N5, os: windows-2025, leg: interop-integration,  kind: disk, job_timeout: 55 }
+          - { id: N6, os: windows-2025, leg: unit,                 kind: both, job_timeout: 60 }
+    timeout-minutes: ${{ matrix.job_timeout }}   # generous: load slows everything; backstop only
+    defaults:
+      run:
+        shell: bash                  # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires
+    env:
+      GCL_STRESS_KIND: ${{ matrix.kind }}
+    steps:
+      - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0   # v6.0.3, SHA-pinned
+        with:
+          persist-credentials: false
+
+      - name: Toolchain versions (for reconstructing failures)
+        run: |
+          uname -a
+          bash --version | head -1
+          git --version
+          command -v stress-ng >/dev/null && stress-ng --version | head -1 || echo "stress-ng: NOT FOUND (with-load.sh uses the portable bash spinner)"
+          if command -v pwsh >/dev/null; then
+            pwsh -NoProfile -Command '"pwsh " + $PSVersionTable.PSVersion.ToString()'
+          else
+            echo "pwsh: NOT FOUND (interop suite will skip; integration runs bash-only)"
+          fi
+          if command -v powershell >/dev/null; then
+            powershell -NoProfile -Command '"powershell " + $PSVersionTable.PSVersion.ToString()'
+          else
+            echo "powershell (Windows PowerShell 5.1): NOT FOUND (interop Test 17 skips; expected on POSIX legs)"
+          fi
+          stat --version 2>/dev/null | head -1 || echo "stat: BSD variant"
+
+      - name: Unit suite (under load)
+        if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 40 || 25 }}   # raised: load + the N=24 sweep stretch wall-clock; a step timeout FAILS the step so the upload still runs
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit
+        run: |
+          mkdir -p test-output
+          bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
+
+      - name: Interop suite (under load; bash + pwsh)
+        if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}   # run even if an earlier suite failed — every signal is useful
+        timeout-minutes: 30
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop
+        run: |
+          mkdir -p test-output
+          bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log
+
+      - name: Integration suite (under load; real concurrent commits)
+        if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}
+        timeout-minutes: 20           # its internal AGENT_LOCK_MAX_WAIT cap is 240s; load + sweep stretch it
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration
+        run: |
+          mkdir -p test-output
+          bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log
+
+      - name: Record this cell's conclusion (ground truth for triage)
+        if: ${{ always() }}   # capture the cell's own status — even on timeout/cancel — into its artifact
+        run: |
+          mkdir -p test-output
+          # job.status here reflects THIS cell's run so far: success | failure | cancelled.
+          # A step timeout fails the step, which makes the job status `failure` by the time
+          # this always() step runs — so a no-FAIL timeout is recorded as `failure`, and the
+          # triage script (seeing logs present but conclusion!=success and no ^FAIL:) classes
+          # it infra. The per-cell status file is the authoritative signal triage reads.
+          printf '%s' "${{ job.status }}" > test-output/cell-conclusion.txt
+          echo "cell ${{ matrix.id }} conclusion: $(cat test-output/cell-conclusion.txt)"
+
+      - name: Upload cell logs + load-manifest (on success too — we read the positives by the negatives)
+        if: ${{ always() }}   # upload whether the cell passed, failed, or timed out — triage needs every cell's evidence
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a   # v7.0.1, SHA-pinned
+        with:
+          name: nightly-logs-${{ matrix.id }}   # unique per cell; the triage job downloads these by name
+          path: test-output/
+          include-hidden-files: true   # lock logs live under the scratch repo's .git/ (hidden); suite-generated, no secrets
+          if-no-files-found: warn
+          retention-days: 14
+
+  # ── kcov line-coverage gate. Linux-only, no load, strict, unit suite at FULL.
+  #    Build kcov v43 from source (no apt package / prebuilt). Gate at 0.80. ──────
+  kcov:
+    name: kcov coverage (Linux, no load, strict)
+    runs-on: ubuntu-24.04
+    timeout-minutes: 30
+    env:
+      COVERAGE_FLOOR: '0.80'   # tracks achieved (~83%) — RATCHET UP toward ~0.90 as Tier-A tests land; do not let it lead coverage
+    steps:
+      - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0   # v6.0.3, SHA-pinned
+        with:
+          persist-credentials: false
+
+      - name: Install kcov build dependencies
+        run: |
+          sudo apt-get update
+          sudo apt-get install -y --no-install-recommends \
+            cmake g++ make pkg-config \
+            libdw-dev libelf-dev binutils-dev libcurl4-openssl-dev zlib1g-dev libiberty-dev
+
+      - name: Build kcov v43 from source
+        run: |
+          set -euo pipefail
+          cd /tmp
+          curl -fsSL https://github.com/SimonKagstrom/kcov/archive/refs/tags/v43.tar.gz | tar xz
+          mkdir kcov-build && cd kcov-build
+          cmake ../kcov-43
+          make -j"$(nproc)"
+          ./src/kcov --version
+
+      - name: Run unit suite under kcov (FULL, strict, no load)
+        env:
+          GCL_TEST_FULL: 1
+          # GCL_ENVELOPE_TIER unset => strict (we want a true, clean coverage run; no load applied)
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/kcov-unit
+        run: |
+          mkdir -p test-output coverage
+          /tmp/kcov-build/src/kcov --include-path="$(pwd)/git-commit-lock.sh" \
+            coverage/kcov-out tests/git-commit-lock.test.sh 2>&1 | tee test-output/kcov-unit-suite.log
+
+      - name: Enforce coverage floor (parse cobertura line-rate)
+        run: |
+          set -euo pipefail
+          # kcov writes a per-binary report under coverage/kcov-out/<binary>.<hash>/ and a
+          # merged top-level coverage/kcov-out/cobertura.xml. For a single-binary run they
+          # are equivalent; pick the one with the highest lines-valid (most complete) so
+          # this is robust either way.
+          cob=""
+          best_valid=-1
+          while IFS= read -r f; do
+            v="$(grep -oE 'lines-valid="[0-9]+"' "$f" 2>/dev/null | head -1 | grep -oE '[0-9]+')"
+            v="${v:-0}"
+            if [ "$v" -gt "$best_valid" ]; then best_valid="$v"; cob="$f"; fi
+          done < <(find coverage/kcov-out -name cobertura.xml 2>/dev/null)
+          if [ -z "$cob" ] || [ ! -f "$cob" ]; then
+            echo "::error::no cobertura.xml found under coverage/kcov-out — kcov produced no report"
+            find coverage/kcov-out -maxdepth 3 -type f 2>/dev/null | sed 's/^/  /'
+            exit 1
+          fi
+          echo "Parsing coverage from: $cob (lines-valid=$best_valid)"
+          # Prefer the precise lines-covered/lines-valid ratio (exact); fall back to the
+          # rounded line-rate attribute. Both live on the top-level <coverage ...> tag.
+          covered="$(grep -oE 'lines-covered="[0-9]+"' "$cob" | head -1 | grep -oE '[0-9]+')"
+          valid="$(grep -oE 'lines-valid="[0-9]+"' "$cob" | head -1 | grep -oE '[0-9]+')"
+          rate="$(grep -oE 'line-rate="[0-9.]+"' "$cob" | head -1 | grep -oE '[0-9.]+')"
+          if [ -n "$covered" ] && [ -n "$valid" ] && [ "$valid" -gt 0 ]; then
+            # exact ratio to 4 dp, integer arithmetic (no bc/python dependency)
+            rate="$(awk -v c="$covered" -v v="$valid" 'BEGIN { printf "%.4f", c / v }')"
+            echo "Line coverage: $covered / $valid = $rate"
+          else
+            echo "Line coverage (from line-rate attribute): $rate (lines-covered/valid unavailable)"
+          fi
+          floor="$COVERAGE_FLOOR"
+          # Compare rate >= floor with awk (float-safe).
+          if awk -v r="$rate" -v f="$floor" 'BEGIN { exit !(r + 0 >= f + 0) }'; then
+            echo "PASS: line coverage $rate >= floor $floor"
+            echo "NOTE: the floor ($floor) tracks the achieved coverage (~0.83); ratchet it up toward ~0.90 as Bucket-2 Tier-A tests land. The Linux ceiling is ~0.94 (~30 lines are platform-gated)."
+          else
+            echo "::error::line coverage $rate is BELOW the floor $floor — coverage regressed"
+            echo "The floor tracks achieved coverage (~0.83) and should only ratchet UP as tests land. A drop means a test stopped exercising lines it used to. Investigate before lowering the floor."
+            exit 1
+          fi
+
+      - name: Upload coverage report (HTML + cobertura)
+        if: ${{ !cancelled() }}
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a   # v7.0.1, SHA-pinned
+        with:
+          name: kcov-coverage
+          path: |
+            coverage/kcov-out/
+            test-output/kcov-unit-suite.log
+          include-hidden-files: true
+          if-no-files-found: warn
+          retention-days: 30
+
+  # ── Auto-triage. Downloads every cell's artifact, classifies (correctness /
+  #    envelope / infra), and files/append ONE labelled issue per (date, class).
+  #    Runs always() so a failed/cancelled cell is still triaged; the empty-round
+  #    guard prevents "0 FAIL across 0 logs" being read as green. ─────────────────
+  triage:
+    name: Triage nightly results
+    needs: [stress, kcov]
+    if: ${{ always() }}
+    runs-on: ubuntu-24.04
+    timeout-minutes: 10
+    permissions:
+      issues: write
+      contents: read
+    steps:
+      - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0   # v6.0.3, SHA-pinned
+        with:
+          persist-credentials: false
+
+      - name: Download all cell artifacts
+        uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53   # v6.0.0, SHA-pinned
+        with:
+          path: artifacts
+          # pattern restricts to the per-cell logs (not kcov-coverage); merge-multiple off
+          # so each lands in its own `nightly-logs-<id>/` dir, as the triage script expects.
+          pattern: nightly-logs-*
+        continue-on-error: true   # a totally-missing artifact set must reach the empty-round guard, not error the job
+
+      - name: Ensure triage labels exist (idempotent)
+        env:
+          GH_TOKEN: ${{ github.token }}
+        run: |
+          set -uo pipefail
+          gh label create nightly-correctness -c '#d73a4a' -d 'Nightly stress: a correctness assertion failed — investigate' --force || true
+          gh label create nightly-envelope    -c '#fbca04' -d 'Nightly stress: wall-clock envelope relaxed under load — expected, tracked' --force || true
+          gh label create nightly-infra        -c '#0e8a16' -d 'Nightly stress: infra issue (missing artifact / timeout / errored) — not a product failure' --force || true
+
+      - name: Classify results and file/append issues
+        env:
+          GH_TOKEN: ${{ github.token }}
+          ARTIFACTS_DIR: artifacts
+          EXPECTED_CELLS: 'N1 N2 N3 N4 N5 N6'
+          GITHUB_SERVER_URL: ${{ github.server_url }}
+          GITHUB_REPOSITORY: ${{ github.repository }}
+          GITHUB_RUN_ID: ${{ github.run_id }}
+        run: |
+          set -uo pipefail
+          # Each cell's status is ground truth from its OWN artifact
+          # (nightly-logs-<id>/cell-conclusion.txt, written by the stress job under
+          # always()), so the script never relies on the misleading matrix-aggregate
+          # `needs.stress.result`. The empty-round guard fires if NO cell artifact exists.
+          echo "Artifacts present:"; ls -la artifacts 2>/dev/null || echo "  (none)"
+          bash .github/scripts/nightly-triage.sh
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 2156133..8ebffcc 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -133,6 +133,7 @@ jobs:
             tests/git-commit-lock.test.sh \
             tests/git-commit-lock.interop.test.sh \
             tests/git-commit-lock.integration.test.sh \
+            .github/scripts/nightly-triage.sh \
             install.sh
 
       - name: PSScriptAnalyzer (gate at warning severity)

From 9cce97d3e5d12d02330167a03213706e77882a9e Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 05:58:04 +1000
Subject: [PATCH 45/76] Bucket 6d: deep-sweep.yml (on-demand deep flake hunt)

New workflow_dispatch-only workflow (never gates): inputs stress_kind /
stress_load / repeat / envelope_tier (default relax). Per-run-unique concurrency
(group: deep-<run_id>, cancel-in-progress: false) so many parallel dispatches
coexist and queue. Matrix mirrors tests.yml's 4 cells with distinct deep-* job
names, each wrapping the suites in with-load.sh at FULL + SWEEP. The `repeat`
input loops the suite N times, sanitized to a positive int, failing fast (via
PIPESTATUS, since set -e is off) on the first bad iteration with the index named.
Artifacts uploaded on success too.

actionlint clean; YAML valid; repeat loop + concurrency + inputs->with-load.sh
reasoned through.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/deep-sweep.yml | 187 +++++++++++++++++++++++++++++++
 1 file changed, 187 insertions(+)
 create mode 100644 .github/workflows/deep-sweep.yml

diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml
new file mode 100644
index 0000000..7ac74e9
--- /dev/null
+++ b/.github/workflows/deep-sweep.yml
@@ -0,0 +1,187 @@
+# deep-sweep — Tier D of the load-testing strategy (docs/load-testing-strategy.md §9).
+#
+# ON-DEMAND ONLY. This workflow is `workflow_dispatch`-only: it NEVER runs on push
+# or pull_request, and it NEVER gates anything (it is not a required check — this is
+# a single-dev project with no branch protection; see the Phase-2 build plan's
+# Bucket 6 decision box). It exists purely as a deep flake-hunting tool — the
+# "50-clean hunt" instrument from the load-testing strategy: dispatch it (often many
+# times in parallel), pick a stress kind/magnitude, and repeat the full suite N
+# times per job to surface intermittent, scheduling-sensitive flakes that a single
+# zero-load per-PR run would never reproduce.
+#
+# Deep + loaded runs are SLOW (heavy CPU/disk oversubscription stretches every
+# wall-clock-derived step), so timeouts here are deliberately generous and the
+# envelope tier defaults to `relax` (an oversubscribed runner must not turn a
+# latency miss into a red — only a real correctness FAIL should).
+#
+# The job names are intentionally distinct (`deep-*`). With no branch protection
+# there is no required `tests-passed` context to avoid publishing, so this is now
+# only cosmetic / for clarity — but kept so a deep run is never confused with the
+# per-PR `tests` matrix in the checks UI.
+
+name: deep-sweep
+
+on:
+  workflow_dispatch:
+    inputs:
+      stress_kind:
+        description: 'Background load kind to apply via tests/with-load.sh'
+        type: choice
+        options: [none, cpu, disk, both]
+        default: both
+      stress_load:
+        description: 'Raw per-kind hog count override (GCL_STRESS_LOAD). Blank = use the ratio.'
+        type: string
+        default: ''
+      repeat:
+        description: 'How many times to repeat the suite run within each job (intermittent-flake hunt).'
+        type: string
+        default: '1'
+      envelope_tier:
+        description: 'GCL_ENVELOPE_TIER — relax (default) warns on latency misses; strict fails them.'
+        type: string
+        default: relax
+
+# Per-run-unique group so MANY parallel dispatches each get their own group and run
+# concurrently (a fresh dispatch never cancels or is cancelled by an in-flight one);
+# cancel-in-progress:false means a re-dispatch into the same run_id (impossible —
+# run_id is unique per run) would still queue rather than cancel. In practice every
+# dispatch is its own run, so the deep sweeps fan out freely and accept queue waves.
+concurrency:
+  group: deep-${{ github.run_id }}
+  cancel-in-progress: false
+
+permissions:
+  contents: read
+
+jobs:
+  deep:
+    name: deep-${{ matrix.os }}${{ matrix.leg != 'all' && format(' ({0})', matrix.leg) || '' }}
+    runs-on: ${{ matrix.os }}
+    strategy:
+      fail-fast: false               # every cell's verdict is a useful deep signal; let the rest finish
+      matrix:
+        # Mirrors the tests.yml 4-cell set (ubuntu all / macos all / windows unit /
+        # windows interop+integration). Windows stays split because the bash-only
+        # unit suite is the wall-clock bottleneck there and the suites must not run
+        # concurrently inside one timing-sensitive 2-core runner. Generous deep
+        # timeouts: deep + loaded + repeated is far slower than the per-PR gate.
+        include:
+          - { os: ubuntu-24.04, leg: all, job_timeout: 180 }
+          - { os: macos-15, leg: all, job_timeout: 180 }
+          - { os: windows-2025, leg: unit, job_timeout: 120 }
+          - { os: windows-2025, leg: interop-integration, job_timeout: 120 }
+    timeout-minutes: ${{ matrix.job_timeout }}   # backstop only: repeat * (loaded suite budgets) + upload headroom
+    defaults:
+      run:
+        shell: bash                  # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires
+    env:
+      GCL_TEST_FULL: 1               # full fan-out — CI runners are dedicated
+      GCL_TEST_SWEEP: 1              # deep runs exercise the Axis-A waiter-count sweep too
+      GCL_ENVELOPE_TIER: ${{ inputs.envelope_tier }}
+      GCL_STRESS_KIND: ${{ inputs.stress_kind }}
+      GCL_STRESS_LOAD: ${{ inputs.stress_load }}   # blank => with-load.sh falls back to the ratio
+    steps:
+      - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0   # v6.0.3, SHA-pinned
+        with:
+          persist-credentials: false   # no job uses the token after fetch
+
+      - name: Toolchain versions (for reconstructing failures)
+        run: |
+          uname -a
+          bash --version | head -1
+          git --version
+          if command -v pwsh >/dev/null; then
+            pwsh -NoProfile -Command '"pwsh " + $PSVersionTable.PSVersion.ToString()'
+          else
+            echo "pwsh: NOT FOUND (interop suite will skip; integration runs bash-only)"
+          fi
+          if command -v powershell >/dev/null; then
+            powershell -NoProfile -Command '"powershell " + $PSVersionTable.PSVersion.ToString()'
+          else
+            echo "powershell (Windows PowerShell 5.1): NOT FOUND (interop Test 17 skips; expected on POSIX legs)"
+          fi
+          stat --version 2>/dev/null | head -1 || echo "stat: BSD variant"
+          command -v stress-ng >/dev/null && stress-ng --version | head -1 || echo "stress-ng: NOT FOUND (with-load.sh uses the portable spinner)"
+          echo "dispatch inputs: kind=${GCL_STRESS_KIND} load='${GCL_STRESS_LOAD}' repeat=${{ inputs.repeat }} envelope=${GCL_ENVELOPE_TIER}"
+
+      # Each suite is repeated `repeat` times under load. The loop fails fast: the
+      # first failing iteration `exit 1`s the step (so the step — and job — go red on
+      # the earliest flake), and every iteration names its index in the log so a
+      # failure is attributable to a specific repeat. `set -e` is NOT in effect
+      # (default bash here), so we check with-load.sh's propagated rc explicitly.
+      - name: Unit suite (deep, looped x repeat, under load)
+        if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 100 || 90 }}
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit
+        run: |
+          mkdir -p test-output
+          n='${{ inputs.repeat }}'
+          case "$n" in ''|*[!0-9]*) n=1 ;; esac
+          [ "$n" -lt 1 ] && n=1
+          echo "== unit: repeating $n time(s) under load =="
+          for i in $(seq 1 "$n"); do
+            echo "== unit iteration $i/$n =="
+            bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 \
+              | tee "test-output/unit-suite.iter$i.log"
+            rc=${PIPESTATUS[0]}
+            if [ "$rc" -ne 0 ]; then
+              echo "== unit iteration $i/$n FAILED (rc=$rc) — stopping deep sweep =="
+              exit 1
+            fi
+          done
+
+      - name: Interop suite (deep, looped x repeat, under load)
+        if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}   # run even if an earlier suite failed — every signal is useful
+        timeout-minutes: 90
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop
+        run: |
+          mkdir -p test-output
+          n='${{ inputs.repeat }}'
+          case "$n" in ''|*[!0-9]*) n=1 ;; esac
+          [ "$n" -lt 1 ] && n=1
+          echo "== interop: repeating $n time(s) under load =="
+          for i in $(seq 1 "$n"); do
+            echo "== interop iteration $i/$n =="
+            bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 \
+              | tee "test-output/interop-suite.iter$i.log"
+            rc=${PIPESTATUS[0]}
+            if [ "$rc" -ne 0 ]; then
+              echo "== interop iteration $i/$n FAILED (rc=$rc) — stopping deep sweep =="
+              exit 1
+            fi
+          done
+
+      - name: Integration suite (deep, looped x repeat, under load)
+        if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}
+        timeout-minutes: 60           # its internal AGENT_LOCK_MAX_WAIT cap is 240s; x repeat under load
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration
+        run: |
+          mkdir -p test-output
+          n='${{ inputs.repeat }}'
+          case "$n" in ''|*[!0-9]*) n=1 ;; esac
+          [ "$n" -lt 1 ] && n=1
+          echo "== integration: repeating $n time(s) under load =="
+          for i in $(seq 1 "$n"); do
+            echo "== integration iteration $i/$n =="
+            bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 \
+              | tee "test-output/integration-suite.iter$i.log"
+            rc=${PIPESTATUS[0]}
+            if [ "$rc" -ne 0 ]; then
+              echo "== integration iteration $i/$n FAILED (rc=$rc) — stopping deep sweep =="
+              exit 1
+            fi
+          done
+
+      - name: Upload deep-sweep artifacts (logs + load manifests, on success too)
+        if: ${{ always() }}   # deep runs want the negatives to read the positives; upload even when green or cancelled
+        uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a   # v7.0.1, SHA-pinned
+        with:
+          name: deep-logs-${{ matrix.os }}-${{ matrix.leg }}-${{ inputs.stress_kind }}   # unique per (os, leg, kind)
+          path: test-output/
+          include-hidden-files: true   # lock logs + the load-manifest live under the scratch .git/ and test-output/; suite-generated, no secrets
+          if-no-files-found: warn
+          retention-days: 14

From 309cf3912510c1cf2891297fecd49e588250e0aa Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 06:06:02 +1000
Subject: [PATCH 46/76] docs(failure-modes): mark F1/F2/F4/J1/E3 TESTED; F3
 document-only
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The fault-injection + steering tests now exist (Buckets 2A/2B), so flip the §2
coverage table + §3 prose markers from "test planned" to tested, with citations:
- E3 (mtime unreadable -> fail-safe) -> Test 42
- F1 (ENOSPC on create/write) -> Test 50 (Linux+sudo tmpfs; skip elsewhere)
- F2 (failing log path) / J1 (logging failure) -> Test 49 (portable ENOTDIR)
- F4 (unwritable lock dir -> clean 97) -> Test 48 (POSIX chmod 0555)
- F3 (FD/inode exhaustion) -> document-only (no deterministic portable injection)
- D3 row: cite Test 37 (rename-refused / wrong-type-at-path mid-steal)
4.5 item 5 gets a "Status (done)" block recording the above; Ben's override
rationale preserved.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md | 98 +++++++++++++++++++++++++++----------------
 1 file changed, 61 insertions(+), 37 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index a187c15..3f54abe 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -121,16 +121,16 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | C4 | Leaked claim (unverifiable unlink) | Leaked-token memory keeps ownership discoverable | 1 | ✓ U:1549-1758, U:2013-2164 | **In scope.** Keep. |
 | D1 | Atomic rename-over (steal install) | `mv -T` / `File.Move(...,true)` / 5.1 unlink+move | 1 (local FS) | ✓ U:212-346, I:16d S:1141 | **In scope on local FS.** Boundary = D-axis. |
 | D2 | O_EXCL atomic create | `set -C` redirect / `FileMode.CreateNew` | 1 (local FS) | ✓ throughout | **In scope on local FS.** |
-| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262, ~(plat) | **In scope.** ps1-on-POSIX residual = accept. |
+| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262/Test 37 (rename-refused mid-steal), ~(plat) | **In scope.** ps1-on-POSIX residual = accept. |
 | D4 | Non-lock CONTENT at path (user file) | Never stolen (content guard); warn | 1 | ✓ U:1034-1076 | **In scope.** Two accepted residuals (§D4). |
 | D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. |
 | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
 | E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. |
-| E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ○ | **Accept** — fails safe + announced. See §E3. |
-| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ → test planned | **Add test** (§4.5) + document. See §F1. |
-| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ → test planned | **Add test** (§4.5); logging best-effort, lock unaffected. |
-| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, FD via `ulimit`), document. |
-| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, highest-value). See §F4. |
+| E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ✓ U:Test 42 | **Accept** — fails safe + announced. See §E3. |
+| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ✓ U:Test 50 (Linux+sudo tmpfs; (plat) skip elsewhere) | **Tested** (§4.5) + document. See §F1. |
+| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ✓ U:Test 49 (portable failing-log path) | **Tested** (§4.5); logging best-effort, lock unaffected. |
+| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ (document-only) | **Document-only**: no deterministic portable injection. See §F3. |
+| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ✓ U:Test 48 (POSIX `chmod 0555`; (plat) skip on Windows) | **Tested** (§4.5, highest-value). See §F4. |
 | G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. |
 | G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. |
 | G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. |
@@ -141,7 +141,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. |
 | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
 | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
-| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ → test planned | **Add test** (§4.5, via F2); logging never blocks the lock. |
+| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ✓ U:Test 49 (via F2) | **Tested** (§4.5, via F2); logging never blocks the lock. |
 | K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. |
 | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. |
 
@@ -452,12 +452,14 @@ stale** — the floor guard `[ "$mt" -gt 946684800 ]` fails closed to "fresh"
 lock whose age it cannot establish, so no premature steal and no corruption — but
 **recovery of a genuinely crashed holder is disabled**, and waiters block to
 MAX_WAIT (97). *Tier 2 (safety held, recovery lost — and loudly announced).*
-Untested (no stat-failure injection). **Recommend: accept and document** — it is a
+Tested: unit Test 42 shadows the inner mtime probe to return empty on a present,
+stale ghost and asserts the fail-safe lane — the "Staleness detection is BROKEN"
+warn-once fires, the ghost is NOT stolen (left in place), and the waiter blocks to
+MAX_WAIT → 97. **Recommend: accept and document** — it is a
 host/FS-health failure the tool already detects and announces, and it fails *safe*
-(no false steal). Fault injection is low-ROI; the loud warning is the right
-behavior. This is also the clean reason recovery is a *Tier-1-within-envelope*
-property, not unconditional (see the tier split under §1): it presumes a readable
-clock.
+(no false steal); the loud warning is the right behavior. This is also the clean
+reason recovery is a *Tier-1-within-envelope* property, not unconditional (see the
+tier split under §1): it presumes a readable clock.
 
 ### F. Resource exhaustion
 
@@ -468,36 +470,47 @@ comment at `:1341-1343`). A created-but-write-failed file is an empty orphan tha
 ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the
 accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud,
 fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write
-manual-fix residual).* Reasoned from code, **not tested** (no ENOSPC fault
-injection). **Recommend: document + add a fault-injection test (per §4.5).** ENOSPC
-is a host-health failure; the tool degrades safely (no corruption, no false hold)
-and the one sharp edge (sub-`tok.` torn write needing manual `rm`) is already
-documented. Per Ben's §4.5 decision, add an ENOSPC test where it can be injected
-deterministically and portably (e.g. a small dedicated tmpfs/quota); if portable
-injection proves impractical, say so in the plan rather than shipping a flaky test.
+manual-fix residual).* **Tested** (per §4.5): unit Test 50 mounts a small 64k
+tmpfs, fills it to ENOSPC, and asserts the waiter times out at 97 with the wrapped
+command never running — no corruption, no false hold. ENOSPC injection needs a full
+FS (root via a tmpfs; `ulimit -f` raises SIGXFSZ — the wrong lane), so the test runs
+on **Linux with passwordless sudo** (the Linux CI leg) and skips-with-note elsewhere.
+ENOSPC is a host-health failure; the tool degrades safely (no corruption, no false
+hold) and the one sharp edge (sub-`tok.` torn write needing manual `rm`) is already
+documented.
 
 **F2 — ENOSPC during a LOG write.** All log writes end in `|| true`
 (`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.*
-**Recommend: accept + add a test (per §4.5)** — logging is best-effort by explicit
-design (it must never block or fail the lock); the only downside is reduced
-post-mortem signal under disk pressure. Add a test that an unwritable/failing log
-path leaves the lock fully working (the write is swallowed) — this also covers J1.
+**Tested** (per §4.5): unit Test 49 points `AGENT_LOCK_LOG` at a path *under a
+regular file*, so every open/append fails ENOTDIR, and asserts the lock still
+acquires + releases cleanly (rc 0), the wrapped command runs, the lock is cleaned
+up, and no log file appears — i.e. the failing log write is swallowed and the lock
+is unaffected. This is a portable injection (no chmod/perms), and it **also covers
+J1**. Logging is best-effort by explicit design (it must never block or fail the
+lock); the only downside is reduced post-mortem signal under disk pressure.
 
 **F3 — Inode / FD exhaustion.** Same shape as F1: a create that can't get an
 inode fails → wait → eventually 97. The tool holds at most a couple of FDs
-briefly. *Tier 2.* Untested. **Recommend: document + add a test (per §4.5)** as
-host-health — an FD-exhaustion test via `ulimit -n` is the deterministic, portable
-one; add inode exhaustion only if it can be injected cleanly.
+briefly. *Tier 2.* **Document-only — no deterministic portable injection.** A
+`ulimit -n` FD cap can't be driven deterministically here: the create needs only
+~1 FD, so an FD-exhaustion test would have to pin the process at *exactly* the
+limit across a poll loop without starving the harness itself — not portable or
+stable. Inode exhaustion needs a full FS the way F1 does (and F1/Test 50 already
+exercises the create-fails-→-wait-→-97 lane that F3 shares). So F3 is recorded as
+a reasoned-but-untested boundary rather than given a flaky test; the safe-degrade
+behaviour is the same as F1, which is tested.
 
 **F4 — Read-only / unwritable lock dir or parent.** `lock_acquire` does a
 best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is
 unwritable the create fails every poll and the waiter times out at 97. No
 corruption, no false hold. A *release* unlink blocked by an unwritable parent
-routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* Untested directly.
-**Recommend: add a test (per §4.5 — the highest-value one).** An unwritable lock
-dir → clean 97 is cheap and deterministic to write. A correct, if blunt, outcome
-(97); an *earlier, clearer* error would be nicer but is optional polish, low
-priority.
+routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* **Tested** (per §4.5 — the
+highest-value one): unit Test 48 `chmod 0555`s the lock-dir parent and asserts the
+waiter times out at 97, the wrapped command never runs, no lock file is created,
+and the WAITING/TIMEOUT lines are logged — no corruption, no false hold. POSIX-only
+(`chmod 0555` is a no-op for writes on Git-Bash/NTFS, so it skips-with-note on
+Windows; the Linux/macOS CI legs exercise it). A correct, if blunt, outcome (97); an
+*earlier, clearer* error would be nicer but is optional polish, low priority.
 
 **F5 — Memory exhaustion.** The scripts allocate trivially (a few shell vars; the
 leaked-token list is "almost always empty"). Not a meaningful failure surface.
@@ -632,11 +645,12 @@ than rotating (`git-commit-lock.sh:554-562`). A broken log never blocks or fails
 the lock. Under a redirected git dir, log *content* (the owner line) is
 attacker-influenceable — one-line text spoofing, no execution; the tool itself
 writes only its token, owner line, and protocol events, never secrets
-(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept + covered by the
-F2 log-failure test (per §4.5)** — logging is best-effort by design, which is the
-right call for a lock that must keep working when the disk is full or the log path
-is bad. The follow-on (unchanged): don't build automation that *trusts* log text
-from an untrusted repo (already documented).
+(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Tested — covered by the F2
+log-failure test (per §4.5): unit Test 49** proves a failing log path leaves the
+lock fully working. Logging is best-effort by design, which is the right call for a
+lock that must keep working when the disk is full or the log path is bad. The
+follow-on (unchanged): don't build automation that *trusts* log text from an
+untrusted repo (already documented).
 
 ### K. Behavior under extreme load / scheduling pressure, and internal time budgets
 
@@ -787,7 +801,7 @@ edge cases make the tool more maintainable and give future users confidence), ra
 
 5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write
    failure F2/J1).** These degrade safely (wait/97, or silent best-effort log
-   loss) but have **no fault-injection tests** — they are reasoned-correct, not
+   loss) but had **no fault-injection tests** — they were reasoned-correct, not
    verified. *Decision (Ben — overrides the prior "accept untested"):* **add test
    coverage** for these lanes. Rationale: actually-tested edge cases make the
    project easier to maintain and give future users confidence, versus
@@ -800,6 +814,16 @@ edge cases make the tool more maintainable and give future users confidence), ra
    FDs). Flag in the plan any lane that proves genuinely impractical to fault-inject
    portably, rather than forcing a flaky test.
 
+   *Status (done):* coverage added — **F4** unit Test 48 (POSIX `chmod 0555`,
+   skip-with-note on Windows), **F2/J1** unit Test 49 (portable failing-log path via
+   ENOTDIR), **F1** unit Test 50 (Linux + passwordless-sudo 64k tmpfs filled to
+   ENOSPC; skip-with-note elsewhere). **F3** (inode/FD exhaustion) proved impractical
+   to fault-inject deterministically and portably — the create needs only ~1 FD, so a
+   `ulimit -n` cap can't be driven deterministically across a poll loop without
+   starving the harness, and inode exhaustion needs a full FS the way F1 does (F1/Test
+   50 already exercises the shared create-fails-→-wait-→-97 lane). Per the "flag any
+   impractical lane" instruction above, F3 stays **document-only**, not a flaky test.
+
 6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope,
    confirm.** The first degrades to detection (98), never silent, and is covered
    by the "upgrade both together" note. The second is a non-issue. *Recommendation:*

From d6d643f4e8627de181290e11155f9b88d3158f66 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 06:24:27 +1000
Subject: [PATCH 47/76] Plan: Windows unit-suite CI sharding subplan (Phase 2,
 under review)

Subplan for splitting the windows-2025 unit leg (the ~2x CI bottleneck) into two
parallel shards via a GCL_TEST_SHARD=i/n round-robin gate in section(). Records
the mechanism, the by-construction partition guarantee + a self-contained
per-shard expected-count guard, alternatives rejected, CI wiring (windows-unit
only; not kcov/nightly/interop), edge cases, phasing, and a logging design.
Awaiting review convergence + Ben's go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...06-18-ci-stress-windows-unit-shard-plan.md | 176 ++++++++++++++++++
 1 file changed, 176 insertions(+)
 create mode 100644 .plans/2026-06-18-ci-stress-windows-unit-shard-plan.md

diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
new file mode 100644
index 0000000..1ada89d
--- /dev/null
+++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
@@ -0,0 +1,176 @@
+# Subplan: split the Windows unit CI leg into parallel shards
+
+Status: **PROPOSAL (Phase 2) — under review.** A small follow-on to the Bucket-6 CI
+work, building on the `section()`/selector machinery (commit `4ee5899`) and the shared
+`tests/_harness.sh` (`b8e2951`). No implementation until the review converges and Ben
+gives the go.
+
+## Review issues (record at top; do not renumber on resolution)
+*(reviewers: add numbered findings here; resolutions noted inline)*
+
+---
+
+## Motivation
+The `windows-2025 unit` leg is the CI wall-clock bottleneck: a full reduced unit run is
+~4m38s and the Windows leg is ~2× every other leg (interop ~100s, integration ~28s). A
+measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on
+the 2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So
+running the unit suite as **two parallel shards on two runners ~halves** that leg's
+wall-clock and speeds up the per-PR dev-feedback loop. **CI-only** — local dev runs are
+unaffected (sharding is opt-in via an env var, unset by default).
+
+## Decision context
+- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell
+  has **zero required-context fallout** — no aggregator, no gating concern. `tests.yml`
+  reports per-cell contexts directly.
+- The enabling work is already done: every unit test is a `section "Test N: …"`-gated
+  block, proven individually selectable with no cross-test ordering dependencies (the
+  `GCL_TEST_ONLY` selector work). A shard is just "run the subset of sections assigned to
+  me," which slots into the same `section()` gate.
+
+## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, inside `section()`
+A new opt-in env var `GCL_TEST_SHARD=<i>/<n>` (e.g. `1/2`) read in `tests/_harness.sh`
+alongside the existing `GCL_TAP`/`GCL_TEST_ONLY`/`GCL_TEST_SWEEP` reads. Implementation
+(~10 lines in `_harness.sh`):
+
+- **A monotonic section index** `SECTION_IDX`, bumped in `section()` on **every** call
+  (every test, in file order), *independent of* whether the test runs. This is the stable
+  shard-assignment key — it does not depend on `GCL_TEST_ONLY`/`GCL_TEST_SWEEP`.
+- **Parse + validate** `GCL_TEST_SHARD` once at suite top: split `i/n`; require `n` a
+  positive integer and `1 ≤ i ≤ n`; on malformed, **bail loudly** (`exit 1`) rather than
+  silently running all/none (same spirit as the zero-match guard).
+- **Shard gate** in `section()`: a test runs iff `(SECTION_IDX-1) % n == (i-1)`
+  (round-robin). Composed with the existing `GCL_TEST_ONLY` gate by **AND** (both must
+  pass to run); `SECTIONS_RUN` still bumps only when the test actually runs.
+
+```sh
+# in _harness.sh, near the GCL_* reads:
+GCL_TEST_SHARD="${GCL_TEST_SHARD:-}"
+SHARD_I=0; SHARD_N=0; SECTION_IDX=0
+if [ -n "$GCL_TEST_SHARD" ]; then
+  case "$GCL_TEST_SHARD" in
+    */*) SHARD_I=${GCL_TEST_SHARD%/*}; SHARD_N=${GCL_TEST_SHARD#*/} ;;
+    *)   echo "Bail out! GCL_TEST_SHARD must be i/n (got '$GCL_TEST_SHARD')" >&2; exit 1 ;;
+  esac
+  case "$SHARD_I$SHARD_N" in *[!0-9]*) echo "Bail out! GCL_TEST_SHARD i/n must be integers" >&2; exit 1 ;; esac
+  if [ "$SHARD_N" -lt 1 ] || [ "$SHARD_I" -lt 1 ] || [ "$SHARD_I" -gt "$SHARD_N" ]; then
+    echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need 1<=i<=n, n>=1)" >&2; exit 1
+  fi
+fi
+
+section() {
+  SECTION_IDX=$((SECTION_IDX + 1))
+  echo "== $1 =="
+  # GCL_TEST_ONLY gate (unchanged)
+  if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi
+  # GCL_TEST_SHARD gate (round-robin partition)
+  if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then
+    return 1
+  fi
+  SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
+}
+```
+
+## Why round-robin (alternatives rejected)
+- **Round-robin by index (CHOSEN):** auto-balancing and **zero-maintenance** — new tests
+  distribute themselves; nothing to hand-edit. Measured imbalance ~10% at n=2 (well within
+  "roughly halve"). The heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are
+  scattered through the file, so interleaving balances them naturally.
+- **Contiguous halves:** ~17%+ imbalance (worse, because the heavy tests aren't evenly
+  placed) and still needs the same machinery. Rejected.
+- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** works today with no code, but
+  **fails the maintainability bar** — a new test that matches neither list silently runs in
+  *no* shard (a coverage hole). Rejected for the standing config.
+- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck
+  entries. Rejected.
+
+## Coverage safety (the cardinal risk + the guarantee)
+The risk: a shard scheme that drops a test reads as green → silent coverage hole.
+
+- **Primary guarantee — partition by construction.** Round-robin over a single stable
+  ordering (`SECTION_IDX` in file order) assigns every section index to **exactly one**
+  residue class. So for any `n`, the shards are a true partition: union == full suite, no
+  overlap, no drops — *by construction*, as long as every test goes through `section()`
+  (all 57 do).
+- **Self-contained per-shard guard (belt-and-suspenders).** In the suite verdict (extend
+  `selector_report` in `_harness.sh`), when `GCL_TEST_SHARD` is set, compute the
+  **expected** run-count from the totals the shard already has —
+  `expected = number of k in 1..SECTION_IDX with (k-1)%n == (i-1)` — and assert
+  `SECTIONS_RUN == expected`; **bail loudly** otherwise. This catches a modulo bug or a
+  `section()` regression *within a single shard* (no cross-job artifacts needed). It does
+  not need an unsharded baseline (each shard sees all `SECTION_IDX` section calls).
+- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that
+  dies early bails), the `1..$TAPN` plan line (partial-but-correct per shard), and the
+  zero-match-style guard (a shard that legitimately runs 0 sections — only possible when
+  `n` > section count — is a misconfiguration and bails).
+- **Local union proof (build phase, one-time):** run all `n` shards for `n∈{2,3}` and
+  assert the concatenation of run-section labels equals the unsharded run's set, with no
+  duplicates. This validates the implementation before wiring CI. (Belt-and-suspenders on
+  top of the by-construction argument.)
+
+## Interaction with existing machinery
+- **`GCL_TEST_ONLY` + `GCL_TEST_SHARD`:** AND semantics (run iff selected *and* in-shard).
+  Independent gates; `SECTION_IDX` counts all sections regardless, so a sharded selector
+  run is well-defined.
+- **`GCL_TEST_FULL` / reduced:** sharding is orthogonal — it partitions *which* sections
+  run, not *how* each runs. The per-shard expected-count guard uses the shard's own
+  `SECTION_IDX` total, which is identical full vs reduced (same 57 sections), so the guard
+  is mode-independent.
+- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests
+  *that land in its shard*. Fine for nightly (not sharded; see scope) and harmless if ever
+  combined.
+- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario) and
+  already note-and-ignores `GCL_TEST_ONLY`; it must **note-and-ignore `GCL_TEST_SHARD`**
+  the same way (loud stderr note, run the whole scenario). Add `GCL_TEST_SHARD` to that note.
+
+## CI wiring (`.github/workflows/tests.yml`) — Windows unit only
+- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` matrix cell with
+  **two** cells carrying `shard: 1` / `shard: 2` (same `job_timeout`, or slightly lower
+  since each runs ~half — keep generous to avoid flakiness; a half-run finishes well within
+  20 min).
+- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` (unset on cells without a `shard:` key, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged).
+- **Artifact name** must include the shard (`test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}`) — v4+ rejects duplicate artifact names.
+- The job-name template already includes `leg`; extend it to include the shard so the two
+  Windows-unit jobs are distinguishable in the checks list.
+- **Scope:** Windows unit **only**. Do **not** shard: the fast legs (interop ~100s,
+  integration ~28s, all of ubuntu/macos — not bottlenecks), `nightly.yml` (background, not
+  dev-blocking; optional future), or the **kcov** job (coverage needs the whole suite in
+  one process — sharding would break it).
+- **Runner budget:** today's matrix is ~5 jobs (3 OS legs split into 4 + lint); going to 5
+  test jobs + lint is well under GitHub's concurrency ceiling — no queueing.
+
+## Logging / observability (per engineering practices)
+- Each sharded run logs a single greppable line at the verdict:
+  `GCL_TEST_SHARD=i/n: ran R of T sections (expected E)` — captured in the CI suite log
+  (`tee test-output/unit-suite.log`) and the uploaded artifact, so a future agent can
+  reconstruct which shard ran which tests.
+- The partition guard's failure message is a loud `Bail out! shard i/n ran R, expected E`
+  → the step fails and the artifact (with the per-test `== Test N ==` headers, which
+  `section()` echoes for *every* test, run or skipped) shows exactly which tests landed
+  where. The per-shard CI job name (`… (unit, shard 1)`) makes a red attributable.
+
+## Phasing (implementation)
+1. **`_harness.sh`:** add the `GCL_TEST_SHARD` parse/validate + `SECTION_IDX` + the
+   `section()` shard gate + the `selector_report` expected-count guard. Integration suite:
+   add `GCL_TEST_SHARD` to its note-and-ignore.
+2. **Local union proof:** confirm (a) default (no shard) byte-identical — unit 315/0,
+   interop 141/0; (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose section sets
+   union to the full 57 and whose assertion counts sum to the unsharded 315; (c) the
+   expected-count guard fires on a deliberately-broken modulo; (d) malformed
+   `GCL_TEST_SHARD` bails; (e) `shellcheck -S style` clean. Also confirm `GCL_TEST_SHARD`
+   composes with `GCL_TEST_ONLY` (AND) and is orthogonal to `GCL_TEST_FULL`/`GCL_TEST_SWEEP`.
+3. **`tests.yml`:** split the windows-unit cell into shard 1/2 (env + artifact name + job
+   name). `actionlint -shellcheck=` clean.
+4. **CI verification:** dispatch `tests.yml`; confirm both Windows-unit shards are green,
+   each runs ~half (~halved wall-clock), artifact names are unique, and the full legs
+   (ubuntu/macos/windows-interop) are unchanged.
+5. Commit incrementally under the lock; this ships with the ci-stress branch and lands on
+   `main` via the same merge PR.
+
+## Out of scope
+- Sharding the interop/integration suites or the nightly/deep-sweep tiers (interop is not
+  the bottleneck; nightly is background). Notable only as a possible future `n>2` or
+  cross-OS extension.
+- Cost-aware (greedy bin-packing) sharding — ~0% imbalance but needs a maintained per-test
+  cost table; round-robin's ~10% is sufficient and maintenance-free.
+- Any product-code change. This is test-harness + CI only.

From 849ed827e69dea16701d56f82be300c17432f299 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 06:34:44 +1000
Subject: [PATCH 48/76] Plan: fold round-1 review of windows-shard subplan (3
 reviewers)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round 1 = 2 fresh Claude reviewers + independent Codex. Two blocking defects in
the original snippet + two Codex caught the Claude pair missed, all folded:
- malformed input (empty component / leading-zero octal trap) now rejected by a
  single regex ^([1-9][0-9]*)/([1-9][0-9]*)$;
- GCL_TEST_ONLY and GCL_TEST_SHARD made MUTUALLY EXCLUSIVE (simpler than AND +
  guard-fallback; no real use case combines them);
- GCL_TEST_SHARD parsed LAZILY (first section() call) so the integration suite
  (no section() blocks) never parses/bails — it note-and-ignores;
- union proof + per-shard logging use run-only PASS/FAIL signals, not the
  == Test N == headers (which section() prints before gating, so skipped too);
- guard asserts expected>=1 (catches n>section-count) and its rationale reworded
  (catches a section()-coverage regression, not a correlated modulo bug).
Confirm round (fresh reviewer + Codex) pending before declaring converged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...06-18-ci-stress-windows-unit-shard-plan.md | 311 ++++++++++--------
 1 file changed, 177 insertions(+), 134 deletions(-)

diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
index 1ada89d..dbbe1f2 100644
--- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
+++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
@@ -1,69 +1,106 @@
 # Subplan: split the Windows unit CI leg into parallel shards
 
-Status: **PROPOSAL (Phase 2) — under review.** A small follow-on to the Bucket-6 CI
-work, building on the `section()`/selector machinery (commit `4ee5899`) and the shared
-`tests/_harness.sh` (`b8e2951`). No implementation until the review converges and Ben
-gives the go.
+Status: **PROPOSAL (Phase 2) — round-1 review folded; confirm round pending.** A small
+follow-on to the Bucket-6 CI work, building on the `section()`/selector machinery (commit
+`4ee5899`) and the shared `tests/_harness.sh` (`b8e2951`). No implementation until the review
+converges and Ben gives the go.
 
 ## Review issues (record at top; do not renumber on resolution)
-*(reviewers: add numbered findings here; resolutions noted inline)*
+
+**Round 1 (2026-06-18)** — 2 fresh Claude reviewers (correctness/coverage; CI/simplicity) +
+independent Codex. Dispositions (all FIXED in the body below; a confirm round still follows):
+
+1. **[blocking — FIXED] Malformed `GCL_TEST_SHARD` not rejected → mid-suite crash.** The old
+   combined `case "$SHARD_I$SHARD_N"` digit check passed `1/`/`/2`/`/` (empty component), then
+   `[ "" -lt 1 ]`/`% ""` errored falsy under `set -uo pipefail` (no `set -e`) instead of
+   bailing. Codex also flagged **leading zeros** (`08/10`) as a bash-arithmetic **octal** trap.
+   **Fix:** validate with a single regex `^([1-9][0-9]*)/([1-9][0-9]*)$` (rejects empty,
+   non-digit, leading-zero, extra slashes in one shot), then the `i ≤ n` range check.
+2. **[blocking — FIXED] Guard vs `GCL_TEST_ONLY` composition.** The plan advertised AND
+   semantics, but the exact-count guard ignored the selector → false bail; both Claude-A and
+   Codex flagged it. Codex offered the simpler resolution, adopted: **`GCL_TEST_ONLY` and
+   `GCL_TEST_SHARD` are now mutually exclusive** (bail if both set). There is no real use case
+   for combining them, and it removes the guard-fallback edge case entirely — the exact-count
+   guard then *always* applies in shard mode.
+3. **[blocking (Codex, NEW) — FIXED] Eager parse bails the integration suite.** Parsing/bailing
+   `GCL_TEST_SHARD` at `_harness.sh` source-time runs for *all* suites, including integration
+   (which sources the harness before its note-and-ignore) — so malformed input would `exit 1`
+   integration instead of being ignored. **Fix: parse lazily** on the first `section()` call.
+   Integration never calls `section()`, so it neither parses nor bails; its note-and-ignore
+   just prints a notice if the var is set.
+4. **[non-blocking (Codex, NEW) — FIXED] `== Test N ==` headers are NOT a run-set.**
+   `section()` echoes the header *before* gating, so skipped sections print one too. The union
+   proof / per-shard logging must use **run-only** signals (the `PASS:`/`FAIL:` lines, which a
+   skipped test never emits) — optionally a run-only `RAN:` marker for attribution.
+5. **[FIXED] Guard must assert `expected ≥ 1`** — `n` > section-count (e.g. `58/58`) yields
+   `expected==0` which `0==0` would pass silently green. Also: the *existing* `selector_report`
+   zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT cover pure-shard mode
+   — the new guard's `expected ≥ 1` does.
+6. **[FIXED] Unsharded runs stay byte-identical.** All shard logic gated on
+   `[ -n "$GCL_TEST_SHARD" ]`; the interop suite (shares `section()`/`selector_report`, never
+   sharded, on every leg) and unit-on-ubuntu/macos run exactly as today.
+7. **[FIXED] Guard rationale reworded.** It catches a **`section()`-coverage regression** (a
+   test added *outside* the gate), NOT a "modulo bug" (a wrong `%` would be *correlated* between
+   `section()` and the guard). The union proof is a one-time implementation sanity check (n=2),
+   secondary to the by-construction guarantee.
+8. **[FIXED] Job-count prose:** 4 test cells (+`lint`) = 5 jobs → 5 test cells (+`lint`) = 6
+   jobs; well under the concurrency ceiling.
+
+Round-1 verdicts: Reviewer A *needs-changes (1,2)*; Codex *not-sound-yet (1,2,3)*; Reviewer B
+*sound-to-implement*. All folded. **Confirm round (fresh reviewer) pending before declaring
+converged.**
 
 ---
 
 ## Motivation
 The `windows-2025 unit` leg is the CI wall-clock bottleneck: a full reduced unit run is
 ~4m38s and the Windows leg is ~2× every other leg (interop ~100s, integration ~28s). A
-measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on
-the 2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So
-running the unit suite as **two parallel shards on two runners ~halves** that leg's
-wall-clock and speeds up the per-PR dev-feedback loop. **CI-only** — local dev runs are
-unaffected (sharding is opt-in via an env var, unset by default).
+measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on the
+2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So running the
+unit suite as **two parallel shards on two runners ~halves** that leg's wall-clock and speeds
+the per-PR dev-feedback loop. **CI-only** — sharding is opt-in via an env var, unset by default,
+so local dev runs are unaffected.
 
 ## Decision context
-- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell
-  has **zero required-context fallout** — no aggregator, no gating concern. `tests.yml`
-  reports per-cell contexts directly.
-- The enabling work is already done: every unit test is a `section "Test N: …"`-gated
-  block, proven individually selectable with no cross-test ordering dependencies (the
-  `GCL_TEST_ONLY` selector work). A shard is just "run the subset of sections assigned to
-  me," which slots into the same `section()` gate.
-
-## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, inside `section()`
-A new opt-in env var `GCL_TEST_SHARD=<i>/<n>` (e.g. `1/2`) read in `tests/_harness.sh`
-alongside the existing `GCL_TAP`/`GCL_TEST_ONLY`/`GCL_TEST_SWEEP` reads. Implementation
-(~10 lines in `_harness.sh`):
-
-- **A monotonic section index** `SECTION_IDX`, bumped in `section()` on **every** call
-  (every test, in file order), *independent of* whether the test runs. This is the stable
-  shard-assignment key — it does not depend on `GCL_TEST_ONLY`/`GCL_TEST_SWEEP`.
-- **Parse + validate** `GCL_TEST_SHARD` once at suite top: split `i/n`; require `n` a
-  positive integer and `1 ≤ i ≤ n`; on malformed, **bail loudly** (`exit 1`) rather than
-  silently running all/none (same spirit as the zero-match guard).
-- **Shard gate** in `section()`: a test runs iff `(SECTION_IDX-1) % n == (i-1)`
-  (round-robin). Composed with the existing `GCL_TEST_ONLY` gate by **AND** (both must
-  pass to run); `SECTIONS_RUN` still bumps only when the test actually runs.
+- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell has
+  **zero required-context fallout** — no aggregator, no gating concern; `tests.yml` reports
+  per-cell contexts directly.
+- The enabling work is done: every unit test is a `section "Test N: …"`-gated block, proven
+  individually selectable with no cross-test ordering deps (the `GCL_TEST_ONLY` selector work).
+  A shard is just "run the subset of sections assigned to me," which slots into the same gate.
+
+## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, lazy-parsed in `section()`
+A new opt-in env var `GCL_TEST_SHARD=<i>/<n>` (e.g. `1/2`) handled in `tests/_harness.sh`.
+Key design choices (from review): **lazy parse** (so non-`section()` suites ignore it),
+**mutually exclusive** with `GCL_TEST_ONLY`, **regex-validated** (rejects empty/non-digit/
+leading-zero). ~15 lines:
 
 ```sh
-# in _harness.sh, near the GCL_* reads:
+# declarations near the GCL_* reads (NO eager parse — keeps integration unaffected):
 GCL_TEST_SHARD="${GCL_TEST_SHARD:-}"
-SHARD_I=0; SHARD_N=0; SECTION_IDX=0
-if [ -n "$GCL_TEST_SHARD" ]; then
-  case "$GCL_TEST_SHARD" in
-    */*) SHARD_I=${GCL_TEST_SHARD%/*}; SHARD_N=${GCL_TEST_SHARD#*/} ;;
-    *)   echo "Bail out! GCL_TEST_SHARD must be i/n (got '$GCL_TEST_SHARD')" >&2; exit 1 ;;
-  esac
-  case "$SHARD_I$SHARD_N" in *[!0-9]*) echo "Bail out! GCL_TEST_SHARD i/n must be integers" >&2; exit 1 ;; esac
-  if [ "$SHARD_N" -lt 1 ] || [ "$SHARD_I" -lt 1 ] || [ "$SHARD_I" -gt "$SHARD_N" ]; then
-    echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need 1<=i<=n, n>=1)" >&2; exit 1
+SHARD_I=0; SHARD_N=0; SECTION_IDX=0; SHARD_PARSED=0
+
+_shard_init() {                      # runs once, lazily, on the first section() call
+  SHARD_PARSED=1
+  [ -z "$GCL_TEST_SHARD" ] && return 0
+  if [ -n "${GCL_TEST_ONLY:-}" ]; then           # mutually exclusive (review #2)
+    echo "Bail out! GCL_TEST_ONLY and GCL_TEST_SHARD are mutually exclusive" >&2; exit 1
+  fi
+  if [[ "$GCL_TEST_SHARD" =~ ^([1-9][0-9]*)/([1-9][0-9]*)$ ]]; then   # review #1 (no empty/zero/octal)
+    SHARD_I=${BASH_REMATCH[1]}; SHARD_N=${BASH_REMATCH[2]}
+  else
+    echo "Bail out! GCL_TEST_SHARD must be i/n positive integers (got '$GCL_TEST_SHARD')" >&2; exit 1
   fi
-fi
+  if [ "$SHARD_I" -gt "$SHARD_N" ]; then
+    echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need i<=n)" >&2; exit 1
+  fi
+}
 
 section() {
-  SECTION_IDX=$((SECTION_IDX + 1))
+  [ "$SHARD_PARSED" = 1 ] || _shard_init        # lazy: only suites that call section() parse
+  SECTION_IDX=$((SECTION_IDX + 1))              # file-order index, bumped for EVERY test
   echo "== $1 =="
-  # GCL_TEST_ONLY gate (unchanged)
   if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi
-  # GCL_TEST_SHARD gate (round-robin partition)
   if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then
     return 1
   fi
@@ -71,106 +108,112 @@ section() {
 }
 ```
 
+(`SECTION_IDX` bumps unconditionally in file order — independent of `GCL_TEST_ONLY`/
+`GCL_TEST_SWEEP`/`GCL_TEST_FULL` — so it is the stable shard-assignment key.)
+
 ## Why round-robin (alternatives rejected)
-- **Round-robin by index (CHOSEN):** auto-balancing and **zero-maintenance** — new tests
-  distribute themselves; nothing to hand-edit. Measured imbalance ~10% at n=2 (well within
-  "roughly halve"). The heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are
-  scattered through the file, so interleaving balances them naturally.
-- **Contiguous halves:** ~17%+ imbalance (worse, because the heavy tests aren't evenly
-  placed) and still needs the same machinery. Rejected.
-- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** works today with no code, but
-  **fails the maintainability bar** — a new test that matches neither list silently runs in
-  *no* shard (a coverage hole). Rejected for the standing config.
-- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck
-  entries. Rejected.
+- **Round-robin by index (CHOSEN):** auto-balancing, **zero-maintenance** — new tests
+  distribute themselves. Measured imbalance ~10% at n=2 (well within "roughly halve"); the
+  heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are scattered, so interleaving
+  balances them.
+- **Contiguous halves:** ~17%+ imbalance (heavy tests unevenly placed), same machinery. Rejected.
+- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** a new test matching neither list
+  silently runs in no shard (coverage hole). Rejected.
+- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck entries. Rejected.
 
 ## Coverage safety (the cardinal risk + the guarantee)
-The risk: a shard scheme that drops a test reads as green → silent coverage hole.
+The risk: a shard scheme that drops a test reads green → silent coverage hole.
 
-- **Primary guarantee — partition by construction.** Round-robin over a single stable
-  ordering (`SECTION_IDX` in file order) assigns every section index to **exactly one**
-  residue class. So for any `n`, the shards are a true partition: union == full suite, no
-  overlap, no drops — *by construction*, as long as every test goes through `section()`
-  (all 57 do).
+- **Primary guarantee — partition by construction.** Round-robin over the single stable
+  `SECTION_IDX` ordering assigns every section index to **exactly one** residue class. For any
+  `n`, the shards are a true partition (union == full, no overlap, no drops) — by construction,
+  as long as every test goes through `section()` (all 57 do).
 - **Self-contained per-shard guard (belt-and-suspenders).** In the suite verdict (extend
-  `selector_report` in `_harness.sh`), when `GCL_TEST_SHARD` is set, compute the
-  **expected** run-count from the totals the shard already has —
-  `expected = number of k in 1..SECTION_IDX with (k-1)%n == (i-1)` — and assert
-  `SECTIONS_RUN == expected`; **bail loudly** otherwise. This catches a modulo bug or a
-  `section()` regression *within a single shard* (no cross-job artifacts needed). It does
-  not need an unsharded baseline (each shard sees all `SECTION_IDX` section calls).
-- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that
-  dies early bails), the `1..$TAPN` plan line (partial-but-correct per shard), and the
-  zero-match-style guard (a shard that legitimately runs 0 sections — only possible when
-  `n` > section count — is a misconfiguration and bails).
-- **Local union proof (build phase, one-time):** run all `n` shards for `n∈{2,3}` and
-  assert the concatenation of run-section labels equals the unsharded run's set, with no
-  duplicates. This validates the implementation before wiring CI. (Belt-and-suspenders on
-  top of the by-construction argument.)
+  `selector_report`), when `GCL_TEST_SHARD` is set, compute the **expected** run-count the
+  shard already has all the info for — `expected = #{k in 1..SECTION_IDX : (k-1)%n == (i-1)}`
+  — and assert `SECTIONS_RUN == expected` **and `expected ≥ 1`**; **bail loudly** otherwise.
+  Because `GCL_TEST_ONLY` and `GCL_TEST_SHARD` are mutually exclusive, this exact-count assert
+  *always* applies in shard mode (no selector-composition fallback needed). **What it actually
+  catches:** a **`section()`-coverage regression** — a test added *outside* the `section()`
+  gate, so it stops bumping `SECTION_IDX` (NOT a "modulo bug": a wrong `%` would be *correlated*
+  between `section()` and this guard, which recomputes the same arithmetic). No cross-job
+  artifacts, no unsharded baseline (each shard sees all `SECTION_IDX` calls). The `expected ≥ 1`
+  clause also catches the `n` > section-count misconfiguration (e.g. `58/58`).
+- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that dies
+  early bails) and the `1..$TAPN` plan line (partial-but-correct per shard). Note the *existing*
+  `selector_report` zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT fire
+  in pure-shard mode — the new `expected ≥ 1` clause is what covers an empty shard.
+- **Local union proof (one-time implementation sanity check; secondary to the by-construction
+  guarantee).** Once during implementation, run `GCL_TEST_SHARD=1/2` and `=2/2` and assert their
+  **`PASS:`/`FAIL:` line sets** (run-only — a *skipped* test emits none; the `== Test N ==`
+  headers do NOT work here because `section()` prints them before gating) union to the full
+  unsharded set with no duplicates. Not a standing CI step.
 
 ## Interaction with existing machinery
-- **`GCL_TEST_ONLY` + `GCL_TEST_SHARD`:** AND semantics (run iff selected *and* in-shard).
-  Independent gates; `SECTION_IDX` counts all sections regardless, so a sharded selector
-  run is well-defined.
-- **`GCL_TEST_FULL` / reduced:** sharding is orthogonal — it partitions *which* sections
-  run, not *how* each runs. The per-shard expected-count guard uses the shard's own
-  `SECTION_IDX` total, which is identical full vs reduced (same 57 sections), so the guard
-  is mode-independent.
-- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests
-  *that land in its shard*. Fine for nightly (not sharded; see scope) and harmless if ever
-  combined.
-- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario) and
-  already note-and-ignores `GCL_TEST_ONLY`; it must **note-and-ignore `GCL_TEST_SHARD`**
-  the same way (loud stderr note, run the whole scenario). Add `GCL_TEST_SHARD` to that note.
+- **`GCL_TEST_ONLY` vs `GCL_TEST_SHARD`: mutually exclusive** (bail if both set). No real use
+  case combines them, and exclusivity removes the guard's hardest edge case.
+- **`GCL_TEST_FULL` / reduced:** orthogonal — sharding partitions *which* sections run, not
+  *how*. The `SECTION_IDX` total (57) is identical full vs reduced, so the partition + guard are
+  mode-independent.
+- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests in its
+  shard. (Not combined in CI; harmless if ever combined.)
+- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario). With
+  **lazy parse**, it never calls `section()` → never parses/bails `GCL_TEST_SHARD`. It should
+  **note-and-ignore** the var the same way it does `GCL_TEST_ONLY` (loud stderr note if set,
+  *without* parsing), using the harness-initialized `GCL_TEST_SHARD` (pre-set `""` so no
+  `set -u` trap).
+- **Unsharded runs stay byte-identical.** All shard logic is gated on `[ -n "$GCL_TEST_SHARD" ]`,
+  so the interop suite (shares the helpers, never sharded — every leg) and unit-on-ubuntu/macos
+  (`leg: all`, full) run exactly as today.
 
 ## CI wiring (`.github/workflows/tests.yml`) — Windows unit only
-- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` matrix cell with
-  **two** cells carrying `shard: 1` / `shard: 2` (same `job_timeout`, or slightly lower
-  since each runs ~half — keep generous to avoid flakiness; a half-run finishes well within
-  20 min).
-- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` (unset on cells without a `shard:` key, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged).
-- **Artifact name** must include the shard (`test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}`) — v4+ rejects duplicate artifact names.
-- The job-name template already includes `leg`; extend it to include the shard so the two
-  Windows-unit jobs are distinguishable in the checks list.
-- **Scope:** Windows unit **only**. Do **not** shard: the fast legs (interop ~100s,
-  integration ~28s, all of ubuntu/macos — not bottlenecks), `nightly.yml` (background, not
-  dev-blocking; optional future), or the **kcov** job (coverage needs the whole suite in
-  one process — sharding would break it).
-- **Runner budget:** today's matrix is ~5 jobs (3 OS legs split into 4 + lint); going to 5
-  test jobs + lint is well under GitHub's concurrency ceiling — no queueing.
+- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` cell with **two** cells
+  carrying `shard: 1` / `shard: 2` (same `job_timeout`; keep the existing step timeout — a
+  half-run finishes well within it; generous-over-tight matches the repo's "backstop only"
+  philosophy and avoids flakiness).
+- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` — yields `1/2`/`2/2` on the shard cells and `''` (effectively unset, per the harness's `${GCL_TEST_SHARD:-}`) on every other cell, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged. (`/2` is hardcoded; the harness is `n`-generic, so only this one CI string ties to 2 — easy to extend later. NB GHA treats `0` as falsy, so keep shard indices 1-based.)
+- **Artifact name** gains the shard: `test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}` → `…-unit-1`/`…-unit-2` (v4+ rejects duplicate names); other cells' names are byte-identical to today.
+- The job-name template (already includes `leg`) gains the shard so the two unit jobs are distinguishable.
+- **Scope:** Windows unit **only**. Do NOT shard the fast legs (interop, integration, all of
+  ubuntu/macos), `nightly.yml` (background, not dev-blocking; optional future), or the **kcov**
+  job (coverage needs the whole suite in one process — sharding would break it).
+- **Runner budget:** 4 test cells + `lint` = 5 jobs today → 5 test cells + `lint` = 6 jobs;
+  well under GitHub's concurrency ceiling — no queueing.
 
 ## Logging / observability (per engineering practices)
-- Each sharded run logs a single greppable line at the verdict:
-  `GCL_TEST_SHARD=i/n: ran R of T sections (expected E)` — captured in the CI suite log
-  (`tee test-output/unit-suite.log`) and the uploaded artifact, so a future agent can
-  reconstruct which shard ran which tests.
-- The partition guard's failure message is a loud `Bail out! shard i/n ran R, expected E`
-  → the step fails and the artifact (with the per-test `== Test N ==` headers, which
-  `section()` echoes for *every* test, run or skipped) shows exactly which tests landed
-  where. The per-shard CI job name (`… (unit, shard 1)`) makes a red attributable.
+- Each sharded run logs one greppable verdict line: `GCL_TEST_SHARD=i/n: ran R of T sections
+  (expected E)` — captured in the CI suite log (`tee … unit-suite.log`) and the uploaded
+  artifact, so a future agent can reconstruct which shard ran what.
+- For per-test attribution, `section()` emits a **run-only** marker (e.g. `RAN: <label>` inside
+  the `SECTIONS_RUN++` branch) — needed because the `== Test N ==` headers print for *skipped*
+  tests too (header echoed before gating), so they are not a run-set.
+- The guard's failure is a loud `Bail out! shard i/n ran R, expected E` → the step fails and
+  the per-shard CI job name (`… (unit, shard 1)`) makes the red attributable.
 
 ## Phasing (implementation)
-1. **`_harness.sh`:** add the `GCL_TEST_SHARD` parse/validate + `SECTION_IDX` + the
-   `section()` shard gate + the `selector_report` expected-count guard. Integration suite:
-   add `GCL_TEST_SHARD` to its note-and-ignore.
-2. **Local union proof:** confirm (a) default (no shard) byte-identical — unit 315/0,
-   interop 141/0; (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose section sets
-   union to the full 57 and whose assertion counts sum to the unsharded 315; (c) the
-   expected-count guard fires on a deliberately-broken modulo; (d) malformed
-   `GCL_TEST_SHARD` bails; (e) `shellcheck -S style` clean. Also confirm `GCL_TEST_SHARD`
-   composes with `GCL_TEST_ONLY` (AND) and is orthogonal to `GCL_TEST_FULL`/`GCL_TEST_SWEEP`.
-3. **`tests.yml`:** split the windows-unit cell into shard 1/2 (env + artifact name + job
-   name). `actionlint -shellcheck=` clean.
-4. **CI verification:** dispatch `tests.yml`; confirm both Windows-unit shards are green,
-   each runs ~half (~halved wall-clock), artifact names are unique, and the full legs
-   (ubuntu/macos/windows-interop) are unchanged.
-5. Commit incrementally under the lock; this ships with the ci-stress branch and lands on
-   `main` via the same merge PR.
+1. **`_harness.sh`:** add the lazy `_shard_init` (regex-validated, mutually-exclusive with
+   `GCL_TEST_ONLY`) + `SECTION_IDX` + the `section()` shard gate + the run-only `RAN:` marker +
+   the `selector_report` expected-count/`expected ≥ 1` guard. Integration suite: add the
+   `GCL_TEST_SHARD` note-and-ignore (no parse).
+2. **Local proof:** confirm (a) default (no shard) byte-identical — unit 315/0, interop 141/0
+   (current counts); (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose **`PASS:`/`FAIL:`
+   sets** union to the unsharded set (sum to 315) with no dup, and whose section counts sum to
+   57; (c) the guard fires when a test is moved *outside* a `section()` gate, and bails when
+   `expected==0` (`58/58`); (d) malformed `GCL_TEST_SHARD` — `1/0`, `3/2`, `a/b`, `1/`, `/2`,
+   `2/3/4`, `08/10` — each bails cleanly, and `GCL_TEST_ONLY`+`GCL_TEST_SHARD` together bails;
+   `''` is a no-op; (e) integration with `GCL_TEST_SHARD` set prints the ignore note and runs
+   all 12; (f) `shellcheck -S style` clean.
+3. **`tests.yml`:** split the windows-unit cell into shard 1/2 (env + artifact name + job name).
+   `actionlint -shellcheck=` clean.
+4. **CI verification:** dispatch `tests.yml`; confirm both Windows-unit shards green, each ~half
+   wall-clock (~halved leg), artifact names unique, and the full legs (ubuntu/macos/
+   windows-interop) unchanged.
+5. Commit incrementally under the lock; ships with `ci-stress` and lands on `main` via the same
+   merge PR.
 
 ## Out of scope
-- Sharding the interop/integration suites or the nightly/deep-sweep tiers (interop is not
-  the bottleneck; nightly is background). Notable only as a possible future `n>2` or
-  cross-OS extension.
-- Cost-aware (greedy bin-packing) sharding — ~0% imbalance but needs a maintained per-test
-  cost table; round-robin's ~10% is sufficient and maintenance-free.
-- Any product-code change. This is test-harness + CI only.
+- Sharding the interop/integration suites or the nightly/deep-sweep tiers; `n>2` or cross-OS
+  extension (the harness is already `n`-generic — only the CI string is 2-bound).
+- Cost-aware (greedy) sharding — ~0% imbalance but needs a maintained per-test cost table;
+  round-robin's ~10% is sufficient and maintenance-free.
+- Any product-code change. Test-harness + CI only.

From e277de3253d1cec7e19fe43ebe54a6a856c63cba Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 06:43:58 +1000
Subject: [PATCH 49/76] Plan: fold round-2 (confirm) review of windows-shard
 subplan + kcov note
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round 2 = fresh Claude (sound) + independent Codex (2 accuracy defects). Folded:
- union-proof run-line set corrected to ^(PASS:|FAIL:|PASS[env]:|WARN[env-relaxed]:)
  (ok_envelope/bad_envelope emit the bracketed forms; a bare PASS:/FAIL: grep
  undercounts the 315) — or use GCL_TAP=1;
- the per-shard guard reframed HONESTLY: it does NOT catch a test added outside a
  section() gate (that bumps neither counter); its value is the empty-shard
  (expected>=1) check + a cheap modulo cross-check. The union proof's no-duplicate
  check is what catches an ungated test (runs in both shards);
- explicit selector_report shard-guard snippet (gated on GCL_TEST_SHARD set);
- RAN: attribution marker gated on GCL_TEST_SHARD set (unsharded stays identical);
- kcov-interaction note (Ben asked): coverage job runs the full suite UNSHARDED;
  sharding code is inert when GCL_TEST_SHARD unset -> no interaction.
Mechanism verified sound by 4 reviewers across 2 rounds; final Codex spot-confirm running.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...06-18-ci-stress-windows-unit-shard-plan.md | 107 ++++++++++++++----
 1 file changed, 82 insertions(+), 25 deletions(-)

diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
index dbbe1f2..7f5ed91 100644
--- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
+++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
@@ -1,6 +1,6 @@
 # Subplan: split the Windows unit CI leg into parallel shards
 
-Status: **PROPOSAL (Phase 2) — round-1 review folded; confirm round pending.** A small
+Status: **PROPOSAL (Phase 2) — rounds 1-2 review folded; final spot-confirm pending.** A small
 follow-on to the Bucket-6 CI work, building on the `section()`/selector machinery (commit
 `4ee5899`) and the shared `tests/_harness.sh` (`b8e2951`). No implementation until the review
 converges and Ben gives the go.
@@ -47,8 +47,31 @@ independent Codex. Dispositions (all FIXED in the body below; a confirm round st
    jobs; well under the concurrency ceiling.
 
 Round-1 verdicts: Reviewer A *needs-changes (1,2)*; Codex *not-sound-yet (1,2,3)*; Reviewer B
-*sound-to-implement*. All folded. **Confirm round (fresh reviewer) pending before declaring
-converged.**
+*sound-to-implement*. All folded.
+
+**Round 2 — confirm (2026-06-18)** — fresh Claude (*sound-to-implement*) + independent Codex
+(*not-sound-yet*: 2 accuracy defects). All FIXED below:
+
+9. **[Codex — FIXED] Union-proof run-line set understated.** `PASS:`/`FAIL:` alone undercounts:
+   `ok_envelope` emits `PASS[env]:` and (relaxed) `bad_envelope` emits `WARN[env-relaxed]:`,
+   which the "sum to 315" relies on. Run-line set is
+   `^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)` (or use `GCL_TAP=1`). (Verification runs
+   already used the full regex; this corrects the prose.)
+10. **[Codex — FIXED] Guard does NOT catch an ungated test (was overclaimed).** A test *outside*
+    a `section()` block bumps neither `SECTION_IDX` nor `SECTIONS_RUN`, so the guard stays
+    balanced. Reframed honestly: the guard's value is the **empty-shard `expected ≥ 1`** check
+    + a cheap modulo cross-check (otherwise near-tautological in shard mode). **An ungated test
+    is caught by the union proof's no-duplicate check** (it runs in *both* shards).
+11. **[Claude — FIXED] `RAN:` marker gated on `GCL_TEST_SHARD` set** (shard logic; an
+    unconditional emit would break unsharded byte-identicality).
+12. **[Claude — FIXED] Explicit `selector_report` shard-guard snippet** added (gated on
+    `[ -n "$GCL_TEST_SHARD" ]`).
+
+Plus a **kcov-interaction** note (Ben asked): the coverage job runs the full suite unsharded;
+the sharding code is inert when `GCL_TEST_SHARD` is unset — no interaction.
+
+**Convergence:** the mechanism is verified sound by 4 reviewers across 2 rounds; round-2 fixes
+are validation-method/accuracy corrections. A final Codex spot-confirm follows before the go.
 
 ---
 
@@ -111,6 +134,23 @@ section() {
 (`SECTION_IDX` bumps unconditionally in file order — independent of `GCL_TEST_ONLY`/
 `GCL_TEST_SWEEP`/`GCL_TEST_FULL` — so it is the stable shard-assignment key.)
 
+The verdict helper `selector_report` (already called by the unit + interop suites) gains a
+shard branch, **gated so unsharded runs are untouched** (no `% SHARD_N=0`):
+
+```sh
+# in selector_report, when sharding is active:
+if [ -n "$GCL_TEST_SHARD" ]; then
+  exp=0; k=1
+  while [ "$k" -le "$SECTION_IDX" ]; do
+    [ $(( (k-1) % SHARD_N )) -eq $(( SHARD_I - 1 )) ] && exp=$((exp+1)); k=$((k+1))
+  done
+  echo "GCL_TEST_SHARD=$SHARD_I/$SHARD_N: ran $SECTIONS_RUN of $SECTION_IDX sections (expected $exp)"
+  if [ "$SECTIONS_RUN" -ne "$exp" ] || [ "$exp" -lt 1 ]; then
+    echo "Bail out! shard $SHARD_I/$SHARD_N ran $SECTIONS_RUN, expected $exp" >&2; exit 1
+  fi
+fi
+```
+
 ## Why round-robin (alternatives rejected)
 - **Round-robin by index (CHOSEN):** auto-balancing, **zero-maintenance** — new tests
   distribute themselves. Measured imbalance ~10% at n=2 (well within "roughly halve"); the
@@ -129,25 +169,31 @@ The risk: a shard scheme that drops a test reads green → silent coverage hole.
   `n`, the shards are a true partition (union == full, no overlap, no drops) — by construction,
   as long as every test goes through `section()` (all 57 do).
 - **Self-contained per-shard guard (belt-and-suspenders).** In the suite verdict (extend
-  `selector_report`), when `GCL_TEST_SHARD` is set, compute the **expected** run-count the
-  shard already has all the info for — `expected = #{k in 1..SECTION_IDX : (k-1)%n == (i-1)}`
-  — and assert `SECTIONS_RUN == expected` **and `expected ≥ 1`**; **bail loudly** otherwise.
-  Because `GCL_TEST_ONLY` and `GCL_TEST_SHARD` are mutually exclusive, this exact-count assert
-  *always* applies in shard mode (no selector-composition fallback needed). **What it actually
-  catches:** a **`section()`-coverage regression** — a test added *outside* the `section()`
-  gate, so it stops bumping `SECTION_IDX` (NOT a "modulo bug": a wrong `%` would be *correlated*
-  between `section()` and this guard, which recomputes the same arithmetic). No cross-job
-  artifacts, no unsharded baseline (each shard sees all `SECTION_IDX` calls). The `expected ≥ 1`
-  clause also catches the `n` > section-count misconfiguration (e.g. `58/58`).
+  `selector_report`), when `GCL_TEST_SHARD` is set, compute
+  `expected = #{k in 1..SECTION_IDX : (k-1)%n == (i-1)}` and assert `SECTIONS_RUN == expected`
+  **and `expected ≥ 1`**; **bail loudly** otherwise. (Mutual exclusion of `GCL_TEST_ONLY`/
+  `GCL_TEST_SHARD` makes this always-valid in shard mode.) **What it actually catches, stated
+  honestly:** the high-value part is the **empty-shard misconfiguration** (`expected==0` when
+  `n` > section-count, e.g. `58/58`) via the `expected ≥ 1` clause; plus a cheap cross-check
+  that the gate's modulo and the verdict's modulo agree. It is otherwise **near-tautological**
+  in pure-shard mode (`SECTIONS_RUN` and `expected` both derive from the same `SECTION_IDX` via
+  the same arithmetic), and it does **NOT** catch a test added *outside* a `section()` block
+  (that bumps neither counter, so the accounting stays balanced) — that case is caught by the
+  union proof's no-duplicate check below. No cross-job artifacts, no unsharded baseline.
 - **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that dies
   early bails) and the `1..$TAPN` plan line (partial-but-correct per shard). Note the *existing*
   `selector_report` zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT fire
   in pure-shard mode — the new `expected ≥ 1` clause is what covers an empty shard.
 - **Local union proof (one-time implementation sanity check; secondary to the by-construction
-  guarantee).** Once during implementation, run `GCL_TEST_SHARD=1/2` and `=2/2` and assert their
-  **`PASS:`/`FAIL:` line sets** (run-only — a *skipped* test emits none; the `== Test N ==`
-  headers do NOT work here because `section()` prints them before gating) union to the full
-  unsharded set with no duplicates. Not a standing CI step.
+  guarantee — and the only thing that catches an ungated test).** Once during implementation,
+  run `GCL_TEST_SHARD=1/2` and `=2/2` and assert their **run-line sets** union to the full
+  unsharded set **with no duplicates**. The run-line set is the assertion lines (run-only — a
+  *skipped* test emits none; the `== Test N ==` headers do NOT work, since `section()` prints
+  them before gating): `^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)` — note `ok_envelope`
+  emits `PASS[env]:` and relaxed `bad_envelope` emits `WARN[env-relaxed]:`, so a bare
+  `PASS:`/`FAIL:` grep would undercount the 315 — or simply diff `GCL_TAP=1` TAP counts. The
+  **no-duplicate** half is what catches a test accidentally left *outside* a `section()` gate
+  (it would run in both shards → appear twice). Not a standing CI step.
 
 ## Interaction with existing machinery
 - **`GCL_TEST_ONLY` vs `GCL_TEST_SHARD`: mutually exclusive** (bail if both set). No real use
@@ -175,8 +221,15 @@ The risk: a shard scheme that drops a test reads green → silent coverage hole.
 - **Artifact name** gains the shard: `test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}` → `…-unit-1`/`…-unit-2` (v4+ rejects duplicate names); other cells' names are byte-identical to today.
 - The job-name template (already includes `leg`) gains the shard so the two unit jobs are distinguishable.
 - **Scope:** Windows unit **only**. Do NOT shard the fast legs (interop, integration, all of
-  ubuntu/macos), `nightly.yml` (background, not dev-blocking; optional future), or the **kcov**
-  job (coverage needs the whole suite in one process — sharding would break it).
+  ubuntu/macos) or `nightly.yml` (background, not dev-blocking; optional future).
+- **kcov coverage is orthogonal — leave it whole.** The kcov job (`nightly.yml`, Linux) runs
+  the **full unit suite unsharded** in one process, because line coverage of `git-commit-lock.sh`
+  is only meaningful measured across the whole suite in one run, and it's gated on the 0.80
+  floor. It never sets `GCL_TEST_SHARD`, and the sharding code is **inert when `GCL_TEST_SHARD`
+  is unset** (lazy parse → no shard gate), so the kcov run is byte-identical to today — no
+  interaction with this change. (If one ever wanted coverage *from* sharded runs, kcov can merge
+  per-shard output dirs, but that's strictly more machinery for no gain over the single whole
+  run — so we don't.)
 - **Runner budget:** 4 test cells + `lint` = 5 jobs today → 5 test cells + `lint` = 6 jobs;
   well under GitHub's concurrency ceiling — no queueing.
 
@@ -184,9 +237,11 @@ The risk: a shard scheme that drops a test reads green → silent coverage hole.
 - Each sharded run logs one greppable verdict line: `GCL_TEST_SHARD=i/n: ran R of T sections
   (expected E)` — captured in the CI suite log (`tee … unit-suite.log`) and the uploaded
   artifact, so a future agent can reconstruct which shard ran what.
-- For per-test attribution, `section()` emits a **run-only** marker (e.g. `RAN: <label>` inside
-  the `SECTIONS_RUN++` branch) — needed because the `== Test N ==` headers print for *skipped*
-  tests too (header echoed before gating), so they are not a run-set.
+- For per-test attribution in a sharded run, `section()` emits a **run-only** marker
+  (e.g. `RAN: <label>`) **only when `GCL_TEST_SHARD` is set** (it is shard logic — an
+  unconditional emit would add lines to unsharded runs and break byte-identicality) — needed
+  because the `== Test N ==` headers print for *skipped* tests too (echoed before gating), so
+  they are not a run-set.
 - The guard's failure is a loud `Bail out! shard i/n ran R, expected E` → the step fails and
   the per-shard CI job name (`… (unit, shard 1)`) makes the red attributable.
 
@@ -196,9 +251,11 @@ The risk: a shard scheme that drops a test reads green → silent coverage hole.
    the `selector_report` expected-count/`expected ≥ 1` guard. Integration suite: add the
    `GCL_TEST_SHARD` note-and-ignore (no parse).
 2. **Local proof:** confirm (a) default (no shard) byte-identical — unit 315/0, interop 141/0
-   (current counts); (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose **`PASS:`/`FAIL:`
-   sets** union to the unsharded set (sum to 315) with no dup, and whose section counts sum to
-   57; (c) the guard fires when a test is moved *outside* a `section()` gate, and bails when
+   (current counts); (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose **run-line
+   sets** (`^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)`) union to the unsharded set
+   (sum to 315) with no dup, and whose section counts sum to 57; (c) the **union proof's
+   no-duplicate check** catches a test left *outside* a `section()` gate (it runs in both
+   shards) — the guard does NOT (an ungated test bumps neither counter); the guard bails when
    `expected==0` (`58/58`); (d) malformed `GCL_TEST_SHARD` — `1/0`, `3/2`, `a/b`, `1/`, `/2`,
    `2/3/4`, `08/10` — each bails cleanly, and `GCL_TEST_ONLY`+`GCL_TEST_SHARD` together bails;
    `''` is a no-op; (e) integration with `GCL_TEST_SHARD` set prints the ignore note and runs

From 5095645f00a397f34ca5252795f09c35e5c749b7 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 06:47:27 +1000
Subject: [PATCH 50/76] Plan: windows-shard subplan CONVERGED (round-3 Codex
 clean, sound-to-implement)

Final independent Codex spot-confirm returned no findings. Mechanism verified
sound across 3 review rounds (Claude x3 + Codex x3). Status -> converged; ready
for implementation on Ben's go.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .plans/2026-06-18-ci-stress-windows-unit-shard-plan.md | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
index 7f5ed91..075233f 100644
--- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
+++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
@@ -1,6 +1,7 @@
 # Subplan: split the Windows unit CI leg into parallel shards
 
-Status: **PROPOSAL (Phase 2) — rounds 1-2 review folded; final spot-confirm pending.** A small
+Status: **CONVERGED (Phase 2) — 3 review rounds (Claude ×3 + Codex ×3); final Codex clean,
+"sound-to-implement". Ready for Ben's go on implementation.** A small
 follow-on to the Bucket-6 CI work, building on the `section()`/selector machinery (commit
 `4ee5899`) and the shared `tests/_harness.sh` (`b8e2951`). No implementation until the review
 converges and Ben gives the go.
@@ -70,8 +71,11 @@ Round-1 verdicts: Reviewer A *needs-changes (1,2)*; Codex *not-sound-yet (1,2,3)
 Plus a **kcov-interaction** note (Ben asked): the coverage job runs the full suite unsharded;
 the sharding code is inert when `GCL_TEST_SHARD` is unset — no interaction.
 
-**Convergence:** the mechanism is verified sound by 4 reviewers across 2 rounds; round-2 fixes
-are validation-method/accuracy corrections. A final Codex spot-confirm follows before the go.
+**Convergence (REACHED):** round 3 — a final independent Codex spot-confirm — returned **no
+findings, "sound-to-implement"** (verified the run-line regex, the honest guard framing, the
+gated `selector_report` snippet's bash-correctness under `set -uo pipefail`, the shard-only
+`RAN:` marker, and the kcov note). The mechanism is verified sound across 3 rounds (Claude ×3 +
+Codex ×3). **Ready for implementation on Ben's go.**
 
 ---
 

From a01a8e3511f1abcc2446075357175930bbca5b86 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 08:26:54 +1000
Subject: [PATCH 51/76] Test sharding: GCL_TEST_SHARD=i/n round-robin gate in
 the harness
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Implements the converged windows-unit-shard subplan (Phase 1 — mechanism). Opt-in
GCL_TEST_SHARD=<i>/<n> splits a section()-gated suite into n round-robin shards by
file-order section index:
- tests/_harness.sh: lazy _shard_init (parsed on first section() call, so suites
  that never call section() neither parse nor bail) — mutually exclusive with
  GCL_TEST_ONLY, validated by ^([1-9][0-9]*)/([1-9][0-9]*)$ (rejects empty /
  non-digit / leading-zero octal trap / extra slash), then i<=n. section() bumps
  SECTION_IDX for every test before gating (stable shard key), adds the residue
  gate, and emits a run-only RAN: marker ONLY in shard mode. selector_report gains
  a gated guard: recompute expected from the same residue mapping, log one
  greppable verdict line, bail if SECTIONS_RUN != expected or expected < 1 (catches
  an empty shard, e.g. n>section-count). Unset/empty => no-op, so unsharded runs
  are byte-identical.
- integration suite note-and-ignores GCL_TEST_SHARD (one indivisible scenario, no
  section() blocks).

Partition is guaranteed by construction (round-robin over one stable ordering) +
a one-time local union proof. Validated locally: unsharded unit 315/0 + interop
141/0 byte-identical; shard 1/2 + 2/2 disjoint, union == unsharded (no dup),
148+167=315 run-lines / 29+28=57 sections; all malformed inputs bail; integration
prints the note + runs all 12; shellcheck -S style clean.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/_harness.sh                         | 74 ++++++++++++++++++++---
 tests/git-commit-lock.integration.test.sh |  6 ++
 2 files changed, 73 insertions(+), 7 deletions(-)

diff --git a/tests/_harness.sh b/tests/_harness.sh
index 88b344c..34f8483 100644
--- a/tests/_harness.sh
+++ b/tests/_harness.sh
@@ -34,6 +34,14 @@
 PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
 GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
 GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
+# Opt-in CI shard selector GCL_TEST_SHARD=<i>/<n> (round-robin over file-order
+# section index). Parsed LAZILY on the first section() call (see _shard_init) so
+# non-section() suites (integration) just note-and-ignore it; unset/empty is a
+# no-op so all unsharded runs stay byte-identical. SHARD_I/SHARD_N hold the parsed
+# pair; SECTION_IDX is the stable file-order shard key; SHARD_PARSED is the
+# once-only parse guard.
+GCL_TEST_SHARD="${GCL_TEST_SHARD:-}"
+SHARD_I=0; SHARD_N=0; SECTION_IDX=0; SHARD_PARSED=0
 
 # Axis-A waiter-count sweep (Bucket 6). GCL_TEST_SWEEP=1 (nightly/deep CI) widens
 # the fan-out/contention tests over several waiter counts to wring more coverage
@@ -61,17 +69,54 @@ ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
 bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
         [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
+# Lazy one-time parse+validate of GCL_TEST_SHARD. Called from section() (NOT at
+# source time) so suites that never call section() (integration) neither parse
+# nor bail — they only note-and-ignore the var. An empty/unset var is a no-op.
+# Validation bails LOUDLY (exit 1) on any malformed input so a typo can never
+# silently run a partial suite green:
+#   * GCL_TEST_ONLY + GCL_TEST_SHARD are mutually exclusive (no real combined use
+#     case, and exclusivity makes the per-shard count guard always valid).
+#   * The single regex ^([1-9][0-9]*)/([1-9][0-9]*)$ rejects empty components,
+#     non-digits, leading zeros (a bash-arithmetic octal trap), and extra slashes
+#     in one shot; BASH_REMATCH then yields i and n.
+#   * i <= n range check.
+_shard_init() {
+  SHARD_PARSED=1
+  [ -z "$GCL_TEST_SHARD" ] && return 0
+  if [ -n "${GCL_TEST_ONLY:-}" ]; then
+    echo "Bail out! GCL_TEST_ONLY and GCL_TEST_SHARD are mutually exclusive" >&2; exit 1
+  fi
+  if [[ "$GCL_TEST_SHARD" =~ ^([1-9][0-9]*)/([1-9][0-9]*)$ ]]; then
+    SHARD_I=${BASH_REMATCH[1]}; SHARD_N=${BASH_REMATCH[2]}
+  else
+    echo "Bail out! GCL_TEST_SHARD must be i/n positive integers (got '$GCL_TEST_SHARD')" >&2; exit 1
+  fi
+  if [ "$SHARD_I" -gt "$SHARD_N" ]; then
+    echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need i<=n)" >&2; exit 1
+  fi
+}
+
 # Per-test gate: echoes the block header (so a normal run is byte-unchanged) and
-# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label.
-# Each top-level `== Test N: <desc> ==` block is wrapped `if section "..."; then ... fi`.
-# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard (selector_report)
-# can catch a selector regex that matched nothing.
+# returns success iff the test is selected. Each top-level `== Test N: <desc> ==`
+# block is wrapped `if section "..."; then ... fi`. SECTION_IDX bumps for EVERY
+# section in file order (before any gating) — it is the stable shard-assignment
+# key, independent of GCL_TEST_ONLY/SWEEP/FULL. SECTIONS_RUN bumps only when the
+# block actually runs, so the verdict guards (selector_report) can catch a
+# zero-match selector or a miscounted shard. Two gates compose: the GCL_TEST_ONLY
+# regex selector, then the GCL_TEST_SHARD round-robin (mutually exclusive, so at
+# most one is active). Both are no-ops when their var is empty, so unsharded /
+# unselected runs are byte-identical (the RAN: marker is emitted ONLY in shard
+# mode for the same reason).
 section() {
+  [ "$SHARD_PARSED" = 1 ] || _shard_init        # lazy: only section()-using suites parse
+  SECTION_IDX=$((SECTION_IDX + 1))              # file-order index, bumped for EVERY test before gating
   echo "== $1 =="
-  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
-    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
+  if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi
+  if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then
+    return 1
   fi
-  return 1
+  [ -n "$GCL_TEST_SHARD" ] && echo "RAN: $1"    # run-only attribution marker (shard mode only — keeps unsharded byte-identical)
+  SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
 }
 
 # Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
@@ -104,6 +149,21 @@ selector_report() {
     exit 1
   fi
   [ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)"
+  # Shard mode (gated so unsharded runs are byte-identical and never hit % SHARD_N=0):
+  # recompute the expected run-count from the SAME one-based residue mapping the
+  # section() gate uses, log one greppable verdict line, and bail loudly if the
+  # actual run-count disagrees OR the shard is empty (expected < 1 — e.g. n >
+  # section-count like 58/58 — which would otherwise pass vacuously green).
+  if [ -n "$GCL_TEST_SHARD" ]; then
+    local exp=0 k=1
+    while [ "$k" -le "$SECTION_IDX" ]; do
+      [ $(( (k - 1) % SHARD_N )) -eq $(( SHARD_I - 1 )) ] && exp=$((exp + 1)); k=$((k + 1))
+    done
+    echo "GCL_TEST_SHARD=$SHARD_I/$SHARD_N: ran $SECTIONS_RUN of $SECTION_IDX sections (expected $exp)"
+    if [ "$SECTIONS_RUN" -ne "$exp" ] || [ "$exp" -lt 1 ]; then
+      echo "Bail out! shard $SHARD_I/$SHARD_N ran $SECTIONS_RUN, expected $exp" >&2; exit 1
+    fi
+  fi
   return 0
 }
 
diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh
index 49badf8..7086493 100644
--- a/tests/git-commit-lock.integration.test.sh
+++ b/tests/git-commit-lock.integration.test.sh
@@ -108,6 +108,12 @@ LK_ENV=(AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=2
 if [ -n "$GCL_TEST_ONLY" ]; then
     echo "NOTE: integration suite ignores GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" — Tests 1-3 are one indivisible scenario (shared repo + ALL_IDS audit); running the whole suite." >&2
 fi
+# Same for the CI shard selector (read by _harness.sh, pre-set to "" so no set -u
+# trap). This suite calls no section(), so the harness never parses/bails the var
+# — note it loudly and run the whole indivisible scenario as normal.
+if [ -n "$GCL_TEST_SHARD" ]; then
+    echo "NOTE: integration suite ignores GCL_TEST_SHARD=\"$GCL_TEST_SHARD\" — Tests 1-3 are one indivisible scenario (no section() blocks to shard); running the whole suite." >&2
+fi
 
 # --- scratch repo ------------------------------------------------------------
 REPO="$WORK/repo"; OUTD="$WORK/out"; NOHOOKS="$WORK/nohooks"

From 2de66ffeaf020ad12774060eb5eb7a6841d864e3 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 08:26:56 +1000
Subject: [PATCH 52/76] CI: split the windows-unit leg into 2 round-robin
 shards (tests.yml)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The windows-2025 unit leg is the CI wall-clock bottleneck (~2x the others;
process-spawn overhead on the 2-core runner). Split its one matrix cell into two
(shard 1/2) running in parallel to ~halve that leg:
- Unit-suite step sets GCL_TEST_SHARD: matrix.shard && '{0}/2' || '' — only the two
  windows-unit cells carry matrix.shard, so ubuntu/macos (leg:all) and windows
  interop-integration get '' and run the FULL unit suite unchanged.
- artifact name + job-name template gain a shard suffix (v4+ rejects duplicate
  artifact names). No other cell / SHA-pin / step touched.
Coverage (kcov, nightly, Linux) stays whole + unsharded — orthogonal.
actionlint clean; cross-platform CI verification follows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/tests.yml | 17 +++++++++++++----
 1 file changed, 13 insertions(+), 4 deletions(-)

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 8ebffcc..d3db772 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -27,20 +27,23 @@ permissions:
 
 jobs:
   test:
-    name: ${{ matrix.os }}${{ matrix.leg != 'all' && format(' ({0})', matrix.leg) || '' }}
+    name: ${{ matrix.os }}${{ matrix.leg != 'all' && format(' ({0})', matrix.leg) || '' }}${{ matrix.shard && format(', shard {0}', matrix.shard) || '' }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false               # an OS-specific failure is the signal we want; let the others finish
       matrix:
         # Windows splits into two parallel jobs — the bash-only unit suite is the
         # wall-clock bottleneck there (~309s vs interop 100s + integration 28s;
-        # process-spawn overhead, not the PowerShell engines). Suites must NOT run
+        # process-spawn overhead, not the PowerShell engines). The unit suite is
+        # further split into two round-robin shards (GCL_TEST_SHARD=i/2) on two
+        # runners to ~halve that leg's wall-clock. Suites must NOT run
         # concurrently inside one runner: they're timing-sensitive on 2-core
         # runners. POSIX legs are fast enough to stay single-job.
         include:
           - { os: ubuntu-24.04, leg: all, job_timeout: 35 }
           - { os: macos-15, leg: all, job_timeout: 35 }
-          - { os: windows-2025, leg: unit, job_timeout: 20 }
+          - { os: windows-2025, leg: unit, shard: 1, job_timeout: 20 }
+          - { os: windows-2025, leg: unit, shard: 2, job_timeout: 20 }
           - { os: windows-2025, leg: interop-integration, job_timeout: 22 }
     timeout-minutes: ${{ matrix.job_timeout }}   # backstop only: sum of the leg's step budgets + upload headroom
     defaults:
@@ -75,6 +78,12 @@ jobs:
         timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }}   # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit
+          # Only the windows-unit cells carry matrix.shard → 1/2 or 2/2; every other
+          # cell (ubuntu/macos leg:all, windows interop-integration) gets '' (unset
+          # per the harness's ${GCL_TEST_SHARD:-}), so they run the FULL unit suite.
+          # /2 is hardcoded here (harness is n-generic); GHA treats 0 as falsy so
+          # shard indices stay 1-based.
+          GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}
         run: |
           mkdir -p test-output
           bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
@@ -101,7 +110,7 @@ jobs:
         if: ${{ failure() || cancelled() }}   # failure() covers step timeouts (they fail the step); cancelled() is best-effort cover for manual cancels / the job-level backstop
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a   # v7.0.1, SHA-pinned
         with:
-          name: test-logs-${{ matrix.os }}-${{ matrix.leg }}   # leg suffix keeps the two windows jobs' artifact names unique (v4+ uploads reject duplicates)
+          name: test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}   # leg (+ shard) suffix keeps every windows job's artifact name unique (v4+ uploads reject duplicates); other cells unchanged
           path: test-output/
           include-hidden-files: true   # the integration suite's key diagnostics (lock log, repo state) live under the scratch repo's .git/ — hidden paths are excluded by default since upload-artifact v4.4; contents are suite-generated, no secrets
           if-no-files-found: warn

From 5e881c8cd6ae3c1110da9c19bd382f8bbb59a27a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 08:34:19 +1000
Subject: [PATCH 53/76] Plan: record windows-shard CI result (33% faster;
 balance underdelivered)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI run 27723744798 all green, both shards pass. Overall CI 360s->242s (~33%);
windows-unit no longer the 2x outlier. But round-robin balance was 242 vs 99
(~2.4x), not the planned ~10% — the estimate used reduced-mode timings while CI
runs full mode (the 8x25 canary clusters in shard 1). Recorded the data + the
lesson (estimate balance from the run mode) + the decision rationale: accept
as-is (macos 210s is the floor, so re-balancing gains only ~32s for a cost-table
the plan rejected). Mechanism is correct + green regardless.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...06-18-ci-stress-windows-unit-shard-plan.md | 22 +++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
index 075233f..6a41924 100644
--- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
+++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
@@ -272,6 +272,28 @@ The risk: a shard scheme that drops a test reads green → silent coverage hole.
 5. Commit incrementally under the lock; ships with `ci-stress` and lands on `main` via the same
    merge PR.
 
+## Results (CI verification, 2026-06-18 — run 27723744798, all green)
+Implemented in `a01a8e3` (harness mechanism) + `2de66ff` (tests.yml). Local proof passed
+(unsharded byte-identical 315/141; shards disjoint, union==unsharded no-dup, 148+167=315 /
+29+28=57; malformed bails; lint clean). CI cross-platform run **succeeded**, both shards green:
+
+| | windows-unit | macos | ubuntu | win-interop | overall (slowest) |
+|---|---|---|---|---|---|
+| **before** (`27716080146`) | **360s** | 194s | 182s | 140s | **360s** |
+| **after** (`27723744798`) | shard1 **242s** ‖ shard2 **99s** | 210s | 181s | 142s | **242s** |
+
+- **Overall CI 360s → 242s (≈33% faster); windows-unit is no longer the ~2× outlier** (242s ≈
+  macos 210s). The stated goal (windows-unit "twice as long as everything else") is met.
+- **Balance was poor: 242 vs 99 (≈2.4×), NOT the planned ~10%.** Root cause: the ~10% estimate
+  used **reduced-mode** per-section timings, but CI runs **full mode** (`GCL_TEST_FULL=1`), where
+  the full-only 8×25 canary (Test 1 → index 1 → shard 1) and other heavies cluster in shard 1.
+  **Lesson: estimate shard balance from the mode CI actually runs.**
+- **Decision — accept as-is (recommended):** a perfectly balanced split (~170/170) could not beat
+  **macos's 210s**, which becomes the floor, so re-balancing would gain only ~32s more (242→210)
+  while reintroducing the maintained cost-table this plan deliberately rejected. The 118s win is
+  already captured; round-robin's imbalance is an acceptable, zero-maintenance trade. (Mechanism
+  is correct + green regardless of balance.)
+
 ## Out of scope
 - Sharding the interop/integration suites or the nightly/deep-sweep tiers; `n>2` or cross-OS
   extension (the harness is already `n`-generic — only the CI string is 2-bound).

From f45e5d2aaf7765e575ca16c6edee5a8b1f2b1eb5 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 08:46:08 +1000
Subject: [PATCH 54/76] Plan: fixed shard split = "Test 1" vs "not Test 1"
 (measured, balanced)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-robin balanced poorly (242 vs 99) because ONE test dominates: Test 1 (the
8x25 FULL-width canary) is ~151s, ~half the 309s suite; the other 56 tests sum to
~158s (measured from run 27723744798's job-log timestamps). So the balanced fixed
split is simply Test 1 alone on one shard, the rest on the other (~151 vs ~158).
Windows-unit -> ~167s, so macOS's 210s becomes the overall floor (242->210).

Replaces the round-robin assignment with a static _shard_of() in _harness.sh
(case "Test 1:"* -> shard 1, else shard 2) — a fixed split derived once from a
measurement, NOT a maintained cost table. New tests default to shard 2; the
per-shard wall-clock log surfaces drift for occasional re-tune. n stays 2 (Test
1's 151s is an irreducible floor). Ben endorsed the Test-1-vs-rest split.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 ...2026-06-18-ci-stress-shard-balance-plan.md | 123 ++++++++++++++++++
 1 file changed, 123 insertions(+)
 create mode 100644 .plans/2026-06-18-ci-stress-shard-balance-plan.md

diff --git a/.plans/2026-06-18-ci-stress-shard-balance-plan.md b/.plans/2026-06-18-ci-stress-shard-balance-plan.md
new file mode 100644
index 0000000..1cebd44
--- /dev/null
+++ b/.plans/2026-06-18-ci-stress-shard-balance-plan.md
@@ -0,0 +1,123 @@
+# Subplan: balance the Windows-unit shards with a fixed (measured) split
+
+Status: **ENDORSED by Ben (2026-06-18) — split = "Test 1" vs "not Test 1"; implementing.**
+The change is a tiny assignment swap on the already-3-round-reviewed shard mechanism, so the
+local proof + CI run are the gates (no separate review rounds). Follow-on to
+`2026-06-18-ci-stress-windows-unit-shard-plan.md` (the shard *mechanism*, shipped in `a01a8e3`
++ `2de66ff`). That used naive round-robin-by-index and balanced poorly in practice (242s vs
+99s). This plan replaces the *assignment* with a **fixed, measured split** — still a static
+deterministic assignment (no live cost-table maintenance, per Ben), but chosen to balance.
+No implementation until review converges + Ben's go.
+
+## Review issues (record at top; do not renumber on resolution)
+*(reviewers: add numbered findings here)*
+
+---
+
+## The finding that drives the design (measured, not estimated)
+Per-test **full-mode Windows** durations, parsed from the green CI run `27723744798`'s job-log
+timestamps (each `== Test N ==` header line is timestamped; the delta to the next header is that
+test's duration; combined across both shard logs). Method is reproducible from the run log via
+`gh run view <id> --log`; raw table in `.agent-testing/shard-timing/` (gitignored).
+
+- **Test 1 (the 8×25 FULL-width concurrency canary) = ~151s — about HALF of the entire ~309s
+  suite.** It is one indivisible test.
+- The other 56 tests sum to ~158s; the next-largest are Test 22 (~20s), Test 2b (~12s), Test 17
+  (~9s), Test 33 (~8s), then a long tail ≤7s.
+- So the round-robin imbalance (shard1 odd = 226s vs shard2 even = 83s of test time) was **not**
+  "heavies scattered on odd indices" — it was **one dominant test (the canary, index 1 → shard
+  1)** plus the rest happening to land light on shard 2.
+
+**Consequences:**
+- A balanced n=2 split is nearly trivial: **canary alone on one shard (~151s), the other 56
+  tests on the other (~158s).** ~151 vs ~158 — well balanced.
+- Windows-unit leg wall-clock → ~**167s** (151 + ~16s job overhead). That is **below macOS's
+  210s**, so macOS becomes the overall CI floor: **overall 242s → ~210s** (the ~32s the previous
+  plan predicted, now confirmed and explained).
+- **More shards don't help:** Test 1's 151s is an irreducible per-shard floor; n=3 still yields a
+  ~151s shard. So **n stays 2**.
+
+## Approach: a fixed, measured assignment (NOT round-robin, NOT a live cost table)
+Replace the round-robin gate with a **static per-test→shard assignment**, derived **once** from
+the measured costs by greedy LPT (longest-processing-time: sort tests desc, put each on the
+currently-lighter shard) and **frozen** into a small hard-coded list in `tests/_harness.sh`.
+
+For the current data the greedy result is essentially **shard 1 = {Test 1}; shard 2 = {all
+others}** (151 vs 158). Because shard 1's membership is tiny, encode it as "shard-1 label
+prefixes; everything else → shard 2":
+
+```sh
+# n=2 fixed split (measured 2026-06-18; re-tune if the per-shard wall-clock drifts — see below).
+# Test 1 (the FULL-width canary) is ~half the suite, so it gets its own shard.
+_shard_of() {   # echoes the shard (1..n) that owns the test label "$1"
+  case "$1" in
+    "Test 1:"*) echo 1 ;;
+    *)          echo 2 ;;
+  esac
+}
+```
+
+`section()` (still gated on `GCL_TEST_SHARD=i/n`, lazy-parsed, mutually exclusive with
+`GCL_TEST_ONLY` — all unchanged from the shipped mechanism) runs a block iff
+`[ "$(_shard_of "$1")" = "$SHARD_I" ]` instead of the round-robin residue test. The CI interface
+(`tests.yml` matrix passing `1/2` and `2/2`) is unchanged.
+
+### Why this is a "fixed split," not the rejected "cost-aware split"
+- It is a **static, hand-frozen assignment** set from **one** measurement — no per-run cost
+  computation, no maintained cost table, no dynamic bin-packing in the harness.
+- New/unknown tests fall to the **default shard (2)** — they always run (never dropped), and a
+  new *light* test just nudges shard 2 (which has ~7s of headroom and is the lighter side
+  anyway). Only a new *heavy* test (or the canary changing) would need a re-tune, which the
+  drift log surfaces (below). That is occasional manual re-tuning, not continuous cost tracking.
+
+## Coverage-safety
+- **Partition by construction:** `_shard_of` is a total function returning exactly one shard per
+  label, so every test belongs to exactly one shard — union == full suite, no overlap, for any
+  membership list. (Same guarantee the round-robin had, via a different total function.)
+- **Empty-shard guard** (keep): in shard mode, `selector_report` bails if `SECTIONS_RUN < 1`
+  (a misconfigured shard with no members). The exact-count guard is dropped as near-tautological
+  (it recomputes `_shard_of`, the same function the gate uses — established in the mechanism
+  plan's round-2 review).
+- **One-time union proof** (the real partition check): run `GCL_TEST_SHARD=1/2` + `=2/2`, assert
+  their run-line sets (`^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)`) union to the unsharded
+  set with **no duplicates** — catches any assignment bug (a label in both/neither shard).
+
+## Maintenance / drift (the low-maintenance story)
+- Each sharded run already logs `GCL_TEST_SHARD=i/n: ran R of T sections` and the CI job
+  duration is visible. If the two shards' wall-clock skews materially (say >25%), re-measure
+  (parse a fresh run log the same way) and adjust the `_shard_of` list. Expected cadence:
+  rarely — only when the canary's cost changes or a new ≥~30s test lands.
+- The measurement method is recorded above so a successor can regenerate the cost table.
+
+## Phasing (implementation)
+1. **`tests/_harness.sh`:** replace the round-robin residue gate in `section()` with
+   `_shard_of`; add the static `_shard_of` (current measured assignment). Drop the now-unused
+   round-robin residue arithmetic + the exact-count guard branch (keep the empty-shard guard).
+   `SECTION_IDX` is no longer needed for assignment — keep it only if still used elsewhere
+   (it isn't, post-change), else remove it and the `RAN:` marker stays shard-gated.
+2. **Local proof:** (a) unsharded byte-identical (315/0, 141/0); (b) `1/2` runs only Test 1
+   (1 section, ~the canary), `2/2` runs the other 56; union == unsharded, no dup; (c) empty/
+   malformed/mutual-exclusion bails unchanged; (d) `shellcheck -S style` clean.
+3. **CI verify:** dispatch `tests.yml`; confirm shard 1 ≈ shard 2 (~167s / ~174s incl. overhead),
+   overall CI ≈ 210s (macOS-gated), both green, full legs unchanged.
+4. Commit incrementally under the lock; ships on `ci-stress`, lands via the merge PR.
+
+`tests.yml` needs **no change** (the matrix already passes `1/2`/`2/2`); the assignment swap is
+entirely in the harness.
+
+## Logging / observability
+- Keep the per-shard verdict line (`ran R of T sections`) + the shard-gated `RAN:` marker.
+- The CI job-log timestamp method (above) is the standing way to re-measure per-test cost — no
+  permanent timing instrumentation needed (kept out to avoid output churn).
+
+## Related observation (out of scope here — flagging for a separate decision)
+The canary (Test 1) being **~50% of the whole suite** is the real cost driver; sharding only
+works *around* it. If its FULL width (8×25) could be reduced without losing meaningful
+concurrency coverage, that would lower the ~151s floor and help more than sharding — but that's
+a **test-design change** (the width *is* its coverage), so it's deliberately out of scope for
+this balance plan. Worth raising separately.
+
+## Out of scope
+- `n > 2` (Test 1's 151s floor makes more shards pointless), cost-aware/dynamic bin-packing
+  (rejected — this is the fixed alternative), sharding other legs/suites/kcov, or changing the
+  canary itself (above).

From 89de803367d5e340c0c929c6aad10e1f01d60617 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 09:32:36 +1000
Subject: [PATCH 55/76] Revert "CI: split the windows-unit leg into 2
 round-robin shards (tests.yml)"

This reverts commit 2de66ffeaf020ad12774060eb5eb7a6841d864e3.
---
 .github/workflows/tests.yml | 17 ++++-------------
 1 file changed, 4 insertions(+), 13 deletions(-)

diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index d3db772..8ebffcc 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -27,23 +27,20 @@ permissions:
 
 jobs:
   test:
-    name: ${{ matrix.os }}${{ matrix.leg != 'all' && format(' ({0})', matrix.leg) || '' }}${{ matrix.shard && format(', shard {0}', matrix.shard) || '' }}
+    name: ${{ matrix.os }}${{ matrix.leg != 'all' && format(' ({0})', matrix.leg) || '' }}
     runs-on: ${{ matrix.os }}
     strategy:
       fail-fast: false               # an OS-specific failure is the signal we want; let the others finish
       matrix:
         # Windows splits into two parallel jobs — the bash-only unit suite is the
         # wall-clock bottleneck there (~309s vs interop 100s + integration 28s;
-        # process-spawn overhead, not the PowerShell engines). The unit suite is
-        # further split into two round-robin shards (GCL_TEST_SHARD=i/2) on two
-        # runners to ~halve that leg's wall-clock. Suites must NOT run
+        # process-spawn overhead, not the PowerShell engines). Suites must NOT run
         # concurrently inside one runner: they're timing-sensitive on 2-core
         # runners. POSIX legs are fast enough to stay single-job.
         include:
           - { os: ubuntu-24.04, leg: all, job_timeout: 35 }
           - { os: macos-15, leg: all, job_timeout: 35 }
-          - { os: windows-2025, leg: unit, shard: 1, job_timeout: 20 }
-          - { os: windows-2025, leg: unit, shard: 2, job_timeout: 20 }
+          - { os: windows-2025, leg: unit, job_timeout: 20 }
           - { os: windows-2025, leg: interop-integration, job_timeout: 22 }
     timeout-minutes: ${{ matrix.job_timeout }}   # backstop only: sum of the leg's step budgets + upload headroom
     defaults:
@@ -78,12 +75,6 @@ jobs:
         timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }}   # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit
-          # Only the windows-unit cells carry matrix.shard → 1/2 or 2/2; every other
-          # cell (ubuntu/macos leg:all, windows interop-integration) gets '' (unset
-          # per the harness's ${GCL_TEST_SHARD:-}), so they run the FULL unit suite.
-          # /2 is hardcoded here (harness is n-generic); GHA treats 0 as falsy so
-          # shard indices stay 1-based.
-          GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}
         run: |
           mkdir -p test-output
           bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
@@ -110,7 +101,7 @@ jobs:
         if: ${{ failure() || cancelled() }}   # failure() covers step timeouts (they fail the step); cancelled() is best-effort cover for manual cancels / the job-level backstop
         uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a   # v7.0.1, SHA-pinned
         with:
-          name: test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}   # leg (+ shard) suffix keeps every windows job's artifact name unique (v4+ uploads reject duplicates); other cells unchanged
+          name: test-logs-${{ matrix.os }}-${{ matrix.leg }}   # leg suffix keeps the two windows jobs' artifact names unique (v4+ uploads reject duplicates)
           path: test-output/
           include-hidden-files: true   # the integration suite's key diagnostics (lock log, repo state) live under the scratch repo's .git/ — hidden paths are excluded by default since upload-artifact v4.4; contents are suite-generated, no secrets
           if-no-files-found: warn

From 143e280fe6637c05084a866ed9bc1e24d06d0fb8 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 09:32:37 +1000
Subject: [PATCH 56/76] Revert "Test sharding: GCL_TEST_SHARD=i/n round-robin
 gate in the harness"

This reverts commit a01a8e3511f1abcc2446075357175930bbca5b86.
---
 tests/_harness.sh                         | 74 +++--------------------
 tests/git-commit-lock.integration.test.sh |  6 --
 2 files changed, 7 insertions(+), 73 deletions(-)

diff --git a/tests/_harness.sh b/tests/_harness.sh
index 34f8483..88b344c 100644
--- a/tests/_harness.sh
+++ b/tests/_harness.sh
@@ -34,14 +34,6 @@
 PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
 GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
 GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
-# Opt-in CI shard selector GCL_TEST_SHARD=<i>/<n> (round-robin over file-order
-# section index). Parsed LAZILY on the first section() call (see _shard_init) so
-# non-section() suites (integration) just note-and-ignore it; unset/empty is a
-# no-op so all unsharded runs stay byte-identical. SHARD_I/SHARD_N hold the parsed
-# pair; SECTION_IDX is the stable file-order shard key; SHARD_PARSED is the
-# once-only parse guard.
-GCL_TEST_SHARD="${GCL_TEST_SHARD:-}"
-SHARD_I=0; SHARD_N=0; SECTION_IDX=0; SHARD_PARSED=0
 
 # Axis-A waiter-count sweep (Bucket 6). GCL_TEST_SWEEP=1 (nightly/deep CI) widens
 # the fan-out/contention tests over several waiter counts to wring more coverage
@@ -69,54 +61,17 @@ ok()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*"
 bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
         [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; }
 
-# Lazy one-time parse+validate of GCL_TEST_SHARD. Called from section() (NOT at
-# source time) so suites that never call section() (integration) neither parse
-# nor bail — they only note-and-ignore the var. An empty/unset var is a no-op.
-# Validation bails LOUDLY (exit 1) on any malformed input so a typo can never
-# silently run a partial suite green:
-#   * GCL_TEST_ONLY + GCL_TEST_SHARD are mutually exclusive (no real combined use
-#     case, and exclusivity makes the per-shard count guard always valid).
-#   * The single regex ^([1-9][0-9]*)/([1-9][0-9]*)$ rejects empty components,
-#     non-digits, leading zeros (a bash-arithmetic octal trap), and extra slashes
-#     in one shot; BASH_REMATCH then yields i and n.
-#   * i <= n range check.
-_shard_init() {
-  SHARD_PARSED=1
-  [ -z "$GCL_TEST_SHARD" ] && return 0
-  if [ -n "${GCL_TEST_ONLY:-}" ]; then
-    echo "Bail out! GCL_TEST_ONLY and GCL_TEST_SHARD are mutually exclusive" >&2; exit 1
-  fi
-  if [[ "$GCL_TEST_SHARD" =~ ^([1-9][0-9]*)/([1-9][0-9]*)$ ]]; then
-    SHARD_I=${BASH_REMATCH[1]}; SHARD_N=${BASH_REMATCH[2]}
-  else
-    echo "Bail out! GCL_TEST_SHARD must be i/n positive integers (got '$GCL_TEST_SHARD')" >&2; exit 1
-  fi
-  if [ "$SHARD_I" -gt "$SHARD_N" ]; then
-    echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need i<=n)" >&2; exit 1
-  fi
-}
-
 # Per-test gate: echoes the block header (so a normal run is byte-unchanged) and
-# returns success iff the test is selected. Each top-level `== Test N: <desc> ==`
-# block is wrapped `if section "..."; then ... fi`. SECTION_IDX bumps for EVERY
-# section in file order (before any gating) — it is the stable shard-assignment
-# key, independent of GCL_TEST_ONLY/SWEEP/FULL. SECTIONS_RUN bumps only when the
-# block actually runs, so the verdict guards (selector_report) can catch a
-# zero-match selector or a miscounted shard. Two gates compose: the GCL_TEST_ONLY
-# regex selector, then the GCL_TEST_SHARD round-robin (mutually exclusive, so at
-# most one is active). Both are no-ops when their var is empty, so unsharded /
-# unselected runs are byte-identical (the RAN: marker is emitted ONLY in shard
-# mode for the same reason).
+# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label.
+# Each top-level `== Test N: <desc> ==` block is wrapped `if section "..."; then ... fi`.
+# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard (selector_report)
+# can catch a selector regex that matched nothing.
 section() {
-  [ "$SHARD_PARSED" = 1 ] || _shard_init        # lazy: only section()-using suites parse
-  SECTION_IDX=$((SECTION_IDX + 1))              # file-order index, bumped for EVERY test before gating
   echo "== $1 =="
-  if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi
-  if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then
-    return 1
+  if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then
+    SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
   fi
-  [ -n "$GCL_TEST_SHARD" ] && echo "RAN: $1"    # run-only attribution marker (shard mode only — keeps unsharded byte-identical)
-  SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
+  return 1
 }
 
 # Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with
@@ -149,21 +104,6 @@ selector_report() {
     exit 1
   fi
   [ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)"
-  # Shard mode (gated so unsharded runs are byte-identical and never hit % SHARD_N=0):
-  # recompute the expected run-count from the SAME one-based residue mapping the
-  # section() gate uses, log one greppable verdict line, and bail loudly if the
-  # actual run-count disagrees OR the shard is empty (expected < 1 — e.g. n >
-  # section-count like 58/58 — which would otherwise pass vacuously green).
-  if [ -n "$GCL_TEST_SHARD" ]; then
-    local exp=0 k=1
-    while [ "$k" -le "$SECTION_IDX" ]; do
-      [ $(( (k - 1) % SHARD_N )) -eq $(( SHARD_I - 1 )) ] && exp=$((exp + 1)); k=$((k + 1))
-    done
-    echo "GCL_TEST_SHARD=$SHARD_I/$SHARD_N: ran $SECTIONS_RUN of $SECTION_IDX sections (expected $exp)"
-    if [ "$SECTIONS_RUN" -ne "$exp" ] || [ "$exp" -lt 1 ]; then
-      echo "Bail out! shard $SHARD_I/$SHARD_N ran $SECTIONS_RUN, expected $exp" >&2; exit 1
-    fi
-  fi
   return 0
 }
 
diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh
index 7086493..49badf8 100644
--- a/tests/git-commit-lock.integration.test.sh
+++ b/tests/git-commit-lock.integration.test.sh
@@ -108,12 +108,6 @@ LK_ENV=(AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=2
 if [ -n "$GCL_TEST_ONLY" ]; then
     echo "NOTE: integration suite ignores GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" — Tests 1-3 are one indivisible scenario (shared repo + ALL_IDS audit); running the whole suite." >&2
 fi
-# Same for the CI shard selector (read by _harness.sh, pre-set to "" so no set -u
-# trap). This suite calls no section(), so the harness never parses/bails the var
-# — note it loudly and run the whole indivisible scenario as normal.
-if [ -n "$GCL_TEST_SHARD" ]; then
-    echo "NOTE: integration suite ignores GCL_TEST_SHARD=\"$GCL_TEST_SHARD\" — Tests 1-3 are one indivisible scenario (no section() blocks to shard); running the whole suite." >&2
-fi
 
 # --- scratch repo ------------------------------------------------------------
 REPO="$WORK/repo"; OUTD="$WORK/out"; NOHOOKS="$WORK/nohooks"

From 80268f43cd40af9399ad94c1cdb4ff838ce335f2 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 09:36:57 +1000
Subject: [PATCH 57/76] Plan: extract the concurrency canary (Test 1) into its
 own file; supersede sharding
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

After unwinding the GCL_TEST_SHARD sharding (reverts 89de803 + 143e280), plan the
simpler approach: move Test 1 (the ~151s full-width concurrency canary, ~half the
Windows unit-suite runtime) into tests/git-commit-lock.canary.test.sh, so it runs
as a naturally-parallel CI job — zero sharding machinery, same wall-clock win.

The canary runs as its own cell on ALL arches (Ben's call: uniform, cheap extra
POSIX job). New 7-cell matrix (canary leg on ubuntu/macos/windows). Test 1 moves
verbatim (sources _harness.sh; tiny preamble: LIB, WORK/cleanup/trap, T1_* knobs,
INCR). Coverage-safe (union across cells == the original 57; counts reconcile).

Marked the two shard plans (windows-unit-shard, shard-balance) SUPERSEDED.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-18-ci-stress-canary-split-plan.md | 108 ++++++++++++++++++
 ...2026-06-18-ci-stress-shard-balance-plan.md |   5 +
 ...06-18-ci-stress-windows-unit-shard-plan.md |   4 +
 3 files changed, 117 insertions(+)
 create mode 100644 .plans/2026-06-18-ci-stress-canary-split-plan.md

diff --git a/.plans/2026-06-18-ci-stress-canary-split-plan.md b/.plans/2026-06-18-ci-stress-canary-split-plan.md
new file mode 100644
index 0000000..aaa58b7
--- /dev/null
+++ b/.plans/2026-06-18-ci-stress-canary-split-plan.md
@@ -0,0 +1,108 @@
+# Plan: extract the concurrency canary (Test 1) into its own suite file
+
+Status: **PROPOSAL (Phase 2) — for Ben's review.** Supersedes the sharding approach (the
+`GCL_TEST_SHARD` mechanism + the fixed-split balance plan), which has been **unwound** via
+explicit `git revert` (`89de803` + `143e280`; verified byte-identical to the pre-shard tree).
+No implementation until Ben's go.
+
+## Why
+The Windows-unit CI leg is the wall-clock bottleneck (~360s, ~2× the others) and **one test
+drives ~half of it**: Test 1, the full-width concurrency **canary** (25 workers × 8 rounds
+racing the lock), measures **~151s on the Windows runner** (the other 56 unit tests sum to
+~158s). It is *cheap* on Linux/macOS (fast process spawn) — pathological only on Windows.
+
+Rather than shard one file across runners (assignment machinery, a maintained split, a guard),
+**move the canary into its own file** so it runs as a naturally-parallel CI job. Same wall-clock
+win (~360s → macOS-gated ~210s or better) with **zero sharding machinery**. Test 1 is genuinely
+a *different kind* of test — a statistical concurrency canary ("repetition at width is its
+coverage") vs the targeted unit/steering tests — so the seam is natural, not arbitrary.
+
+## The extraction (mechanically clean — feasibility confirmed by exploration)
+**New file `tests/git-commit-lock.canary.test.sh`** — sources `tests/_harness.sh` like the other
+suites; copies the minimal preamble the canary needs and the Test 1 block **verbatim**:
+- Preamble to copy from `tests/git-commit-lock.test.sh`: the `set -uo pipefail` + shellcheck
+  disables; the `_HARNESS_DIR`/source idiom; `DIR`/`ROOT`/`LIB`; the `GCL_TEST_FULL` →
+  `GCL_MODE`/`T1_ROUNDS`/`T1_N` width block (only the `T1_*` knobs are needed); `WORK` +
+  `cleanup()` + `trap finish EXIT`; the `INCR` critical-section string (**used by Test 1 only**).
+- The **Test 1 `if section "Test 1: …"; then … fi` block moves verbatim** (it namespaces all its
+  files under `$WORK`; zero cross-test coupling — nothing else reads/produces its state).
+- Tail: `selector_report` + `DONE=1` + the `RESULT`/`1..$TAPN` lines + `[ "$FAIL" = 0 ]` (copy
+  from the unit suite's end). (`GCL_TEST_ONLY` is near-pointless in a one-test file but the call
+  is zero-cost and keeps the `finish`/zero-match scaffolding uniform.)
+- **Do NOT copy** the unit-file-local helpers the canary doesn't use: `clone_fn`+`export -f`,
+  `wait_for_file`, the `ok_envelope`/`bad_envelope` envelope tier, `T_AXIS_A`/sweep. (Verified
+  unused by Test 1.)
+
+**`tests/git-commit-lock.test.sh`:** delete the Test 1 block (lines of the `if section "Test 1:
+…"; then … fi`). The suite's count self-adjusts — `TAPN` is a running counter, so the `1..N`
+plan line and `RESULT` drop by Test 1's assertions automatically (no hardcoded total to edit);
+`DONE`/`finish`/`selector_report` are count-agnostic. `INCR` moves out with Test 1 (confirmed no
+other unit test uses it).
+
+## CI wiring (`.github/workflows/tests.yml`) — canary as its own cell on ALL arches
+Per Ben: run the canary in parallel on every arch (uniform; the extra POSIX job is cheap). Four
+suite files now; the `canary` leg is a separate cell on ubuntu, macOS, and Windows.
+
+Proposed `matrix.include` (7 test cells + `lint`):
+```yaml
+- { os: ubuntu-24.04,  leg: all,                  job_timeout: 35 }   # unit+interop+integration (NOT canary)
+- { os: ubuntu-24.04,  leg: canary,               job_timeout: 15 }
+- { os: macos-15,      leg: all,                  job_timeout: 35 }
+- { os: macos-15,      leg: canary,               job_timeout: 15 }
+- { os: windows-2025,  leg: unit,                 job_timeout: 20 }   # unit minus canary
+- { os: windows-2025,  leg: interop-integration,  job_timeout: 22 }
+- { os: windows-2025,  leg: canary,               job_timeout: 15 }
+```
+Step gating (so the canary runs in exactly one cell per arch, never doubled):
+- **New "Canary suite" step:** `if: ${{ matrix.leg == 'canary' }}` → `bash tests/git-commit-lock.canary.test.sh` (own `GCL_TEST_PRESERVE_DIR=…/failed-work/canary`; step `timeout-minutes` ~7 Windows / ~6 POSIX, sized from ~151s Windows + headroom).
+- **Unit step:** `if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}` (unchanged form) → unit suite (now minus canary). So `leg: all` runs unit+interop+integration but **not** canary (its step only fires on `leg: canary`).
+- **Interop / Integration steps:** unchanged (`!cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration')`).
+- Job-name template + artifact name already key on `matrix.leg` → the `canary` leg gets a unique name/artifact for free (no shard suffix needed).
+
+Other CI bookkeeping:
+- Add `tests/git-commit-lock.canary.test.sh` to the **shellcheck file list** in the `lint` job.
+- Update the "Sourced by all three suites" comment in `_harness.sh` (and any "three suites" prose) → **four**.
+
+## Coverage-safety
+- **No test is lost or doubled:** Test 1 runs in exactly the `canary` cell on each arch; the
+  other 56 run in the `all`/`unit` cells. Union across cells == the original 57 on every arch.
+  (The canary step gates only on `leg == 'canary'`; the unit step never runs canary.)
+- **Verification (local proof, Phase-2 of impl):** (a) the new canary file runs standalone green
+  (Test 1's same assertions); (b) the unit suite runs green minus Test 1 (count = old 315 − Test
+  1's assertions); (c) canary-count + unit-count == the old 315 (no assertion lost); (d) interop
+  141/0, integration 12/0 unchanged; (e) `shellcheck -S style` clean (incl. the new file);
+  `actionlint` clean.
+- **Cross-platform CI** is the authoritative gate: all 7 cells green; the canary runs on each arch.
+
+## Predicted timings
+- Windows: `unit` (minus canary) ~158s ‖ `canary` ~151s ‖ `interop-integration` ~140s → Windows
+  wall-clock ~max ≈ **~174s** (incl. overhead), down from ~360s.
+- ubuntu/macOS: `all` (minus the now-tiny canary) ≈ unchanged-to-slightly-lower (~180/~190s) ‖
+  `canary` cheap (~tens of s).
+- **Overall CI gated by the slowest cell ≈ macOS `all` (~190–210s)** — the same win as sharding,
+  with no sharding machinery. (Exact numbers confirmed by the post-implementation CI run.)
+
+## Phasing (implementation — on Ben's go)
+1. Create `tests/git-commit-lock.canary.test.sh` (preamble + Test 1 verbatim + tail); delete the
+   Test 1 block from `tests/git-commit-lock.test.sh`; add the canary file to the shellcheck list;
+   fix the "three suites" → "four" comment.
+2. **Local proof** (the coverage-safety checks above) — canary standalone green, unit-minus-canary
+   green, counts reconcile to the old 315, lint clean.
+3. Rewire `tests.yml` matrix (the 7 cells + the canary step).
+4. Push + **CI verify** cross-platform (all 7 cells green; the ~174s Windows / macOS-gated overall).
+5. Commit incrementally under the lock; ships on `ci-stress`, lands via the merge PR.
+
+## Logging / observability
+- The canary file keeps the standard `RESULT`/`1..$TAPN`/`finish`-sentinel output, so its CI job
+  log is self-describing. Per-test timing (if ever re-measured) uses the CI job-log timestamps,
+  as before.
+
+## Supersedes
+- `.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md` (the `GCL_TEST_SHARD` mechanism) and
+  `.plans/2026-06-18-ci-stress-shard-balance-plan.md` (the fixed Test-1-vs-rest split) — both
+  obsoleted by this file-extraction approach; the sharding was unwound (`89de803`+`143e280`).
+  (Leave those plan files in place per "leave history be"; add a superseded-by pointer at their top.)
+
+## Out of scope
+- Reducing the canary's own ~151s width (a test-design change — the width *is* its coverage;
+  worth a separate look, not here). Sharding/`GCL_TEST_SHARD` (removed). `n>2` (N/A — files, not shards).
diff --git a/.plans/2026-06-18-ci-stress-shard-balance-plan.md b/.plans/2026-06-18-ci-stress-shard-balance-plan.md
index 1cebd44..9188417 100644
--- a/.plans/2026-06-18-ci-stress-shard-balance-plan.md
+++ b/.plans/2026-06-18-ci-stress-shard-balance-plan.md
@@ -1,5 +1,10 @@
 # Subplan: balance the Windows-unit shards with a fixed (measured) split
 
+**SUPERSEDED 2026-06-18** by `.plans/2026-06-18-ci-stress-canary-split-plan.md`. The "Test 1 vs
+rest" insight here was right, but the cleaner realization is to make Test 1 its own *file* (no
+sharding at all) — so the `GCL_TEST_SHARD` machinery was unwound (`89de803` + `143e280`) and the
+canary is extracted instead. Original status retained below for record.
+
 Status: **ENDORSED by Ben (2026-06-18) — split = "Test 1" vs "not Test 1"; implementing.**
 The change is a tiny assignment swap on the already-3-round-reviewed shard mechanism, so the
 local proof + CI run are the gates (no separate review rounds). Follow-on to
diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
index 6a41924..3e5be59 100644
--- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
+++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
@@ -1,5 +1,9 @@
 # Subplan: split the Windows unit CI leg into parallel shards
 
+**SUPERSEDED 2026-06-18** by `.plans/2026-06-18-ci-stress-canary-split-plan.md`. The sharding
+was unwound via explicit revert (`89de803` + `143e280`); we extract the canary (Test 1) to its
+own file instead — same CI win, zero sharding machinery. Original status retained below for record.
+
 Status: **CONVERGED (Phase 2) — 3 review rounds (Claude ×3 + Codex ×3); final Codex clean,
 "sound-to-implement". Ready for Ben's go on implementation.** A small
 follow-on to the Bucket-6 CI work, building on the `section()`/selector machinery (commit

From 57ade631945fcb7192e03f5f917ebd12a00f1afc Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 09:47:41 +1000
Subject: [PATCH 58/76] Plan: fold single Codex review of the canary-split plan
 (2 real catches)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Codex (independent single review) caught two gaps, both folded:
- ENV_WARN under set -u: the unit RESULT line expands $ENV_WARN (defined in the
  envelope section we're not copying) -> the canary file must define ENV_WARN=0.
- The other workflow callers: nightly.yml (stress cells + the kcov coverage job)
  and deep-sweep.yml reach Test 1 only via git-commit-lock.test.sh; after
  extraction they must also run the new canary file (preserve canary-under-load
  stress + the canary's kcov coverage contribution, ~3pp from the 0.80 floor).
  Principle: the canary is a 4th suite file — every suite-runner must include it.
Updated phasing to rewire all three workflows + the kcov merged-coverage run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-18-ci-stress-canary-split-plan.md | 33 +++++++++++++++++--
 1 file changed, 31 insertions(+), 2 deletions(-)

diff --git a/.plans/2026-06-18-ci-stress-canary-split-plan.md b/.plans/2026-06-18-ci-stress-canary-split-plan.md
index aaa58b7..7ec5985 100644
--- a/.plans/2026-06-18-ci-stress-canary-split-plan.md
+++ b/.plans/2026-06-18-ci-stress-canary-split-plan.md
@@ -29,6 +29,10 @@ suites; copies the minimal preamble the canary needs and the Test 1 block **verb
 - Tail: `selector_report` + `DONE=1` + the `RESULT`/`1..$TAPN` lines + `[ "$FAIL" = 0 ]` (copy
   from the unit suite's end). (`GCL_TEST_ONLY` is near-pointless in a one-test file but the call
   is zero-cost and keeps the `finish`/zero-match scaffolding uniform.)
+  - **`ENV_WARN` (review catch):** the unit suite's `RESULT` line expands `$ENV_WARN`, which is
+    defined in the envelope section we are NOT copying — so under `set -u` the canary's RESULT
+    line would crash. Fix: define `ENV_WARN=0` near the canary's inits (the canary uses plain
+    `ok`/`bad`, no envelope), so the standard RESULT line works unchanged.
 - **Do NOT copy** the unit-file-local helpers the canary doesn't use: `clone_fn`+`export -f`,
   `wait_for_file`, the `ok_envelope`/`bad_envelope` envelope tier, `T_AXIS_A`/sweep. (Verified
   unused by Test 1.)
@@ -63,6 +67,25 @@ Other CI bookkeeping:
 - Add `tests/git-commit-lock.canary.test.sh` to the **shellcheck file list** in the `lint` job.
 - Update the "Sourced by all three suites" comment in `_harness.sh` (and any "three suites" prose) → **four**.
 
+## Other workflow callers (review catch — the canary is now a 4th suite file)
+The canary currently runs **only** via `tests/git-commit-lock.test.sh`, which three other CI
+spots invoke. After extraction each must also run `tests/git-commit-lock.canary.test.sh`, or it
+silently loses the canary:
+- **`nightly.yml` stress cells** (run the unit suite under load): add the canary so it's still
+  stress-tested under oversubscription (concurrency + load is the highest-value canary scenario).
+  Run it in the relevant cells (sequentially after the unit suite is fine — nightly isn't
+  dev-blocking; no separate parallel cell needed there).
+- **`nightly.yml` kcov job** (measures `git-commit-lock.sh` line coverage from the unit suite,
+  gated at the **0.80** floor with only ~3pp headroom): **run the unit suite AND the canary file
+  under kcov (merged output)** so the canary's coverage contribution is preserved — otherwise
+  the floor could regress. (kcov merges multiple runs into one `--include-path` output dir.)
+- **`deep-sweep.yml`** (on-demand deep flake hunt under load+repeat): add the canary file — the
+  concurrency canary is exactly what a deep hunt should exercise.
+Principle: treat the canary like any new suite file — every workflow/job that enumerates the
+suites (and the shellcheck lint list) must include it. (`tests.yml` is the only one that gets the
+*parallel-cell* treatment, for the per-PR wall-clock win; the others just add the file to what
+they already run.)
+
 ## Coverage-safety
 - **No test is lost or doubled:** Test 1 runs in exactly the `canary` cell on each arch; the
   other 56 run in the `all`/`unit` cells. Union across cells == the original 57 on every arch.
@@ -88,8 +111,14 @@ Other CI bookkeeping:
    fix the "three suites" → "four" comment.
 2. **Local proof** (the coverage-safety checks above) — canary standalone green, unit-minus-canary
    green, counts reconcile to the old 315, lint clean.
-3. Rewire `tests.yml` matrix (the 7 cells + the canary step).
-4. Push + **CI verify** cross-platform (all 7 cells green; the ~174s Windows / macOS-gated overall).
+3. Rewire **all** workflows to include the canary file: `tests.yml` (7-cell matrix + the canary
+   step — the parallel-cell win); `nightly.yml` (add the canary to the stress cells + make the
+   kcov job run unit **and** canary under kcov, merged); `deep-sweep.yml` (add the canary to its
+   cells). `actionlint` clean on all three.
+4. Push + **CI verify** cross-platform: all 7 `tests.yml` cells green; the ~174s Windows /
+   macOS-gated overall. (nightly/deep-sweep can't dispatch until on `main`, but their canary
+   wiring is statically validated; the kcov merged-coverage stays ≥ the 0.80 floor since the
+   same tests run, just split across two files.)
 5. Commit incrementally under the lock; ships on `ci-stress`, lands via the merge PR.
 
 ## Logging / observability

From 5fe15c927e48dd0af8efc5ad791f12c1c9564d7e Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 10:13:42 +1000
Subject: [PATCH 59/76] Extract the concurrency canary (Test 1) into its own
 suite file
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Test 1 (the full-width 25x8 concurrency canary) is ~half the Windows unit-suite
wall-clock and is a different KIND of test (a statistical concurrency canary;
repetition-at-width is its coverage) from the targeted unit/steering tests. Move
it verbatim into tests/git-commit-lock.canary.test.sh so it can run as a
naturally-parallel CI job — replacing the (now-reverted) GCL_TEST_SHARD sharding
with a far simpler file split.

- New tests/git-commit-lock.canary.test.sh: sources _harness.sh; minimal preamble
  (LIB, the GCL_TEST_FULL->T1_* width knobs, WORK/cleanup/trap finish, INCR), the
  Test 1 block verbatim, standard RESULT/TAP tail. Defines ENV_WARN=0 (the RESULT
  line expands it but it lives in the envelope tier we don't copy).
- tests/git-commit-lock.test.sh: Test 1 block + the now-orphaned INCR removed
  (no other unit test uses INCR); count self-adjusts (TAPN is dynamic).
- _harness.sh: "three suites" -> "four".

Validated: canary standalone 2/0; unit-minus-canary 313/0; 313+2=315 (the
pre-extraction count) with disjoint, no-duplicate, union==original; interop
141/0, integration 12/0; shellcheck -S style clean (incl the new file).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/_harness.sh                    |   5 +-
 tests/git-commit-lock.canary.test.sh | 124 +++++++++++++++++++++++++++
 tests/git-commit-lock.test.sh        |  44 ++--------
 3 files changed, 135 insertions(+), 38 deletions(-)
 create mode 100644 tests/git-commit-lock.canary.test.sh

diff --git a/tests/_harness.sh b/tests/_harness.sh
index 88b344c..9ff10ca 100644
--- a/tests/_harness.sh
+++ b/tests/_harness.sh
@@ -1,8 +1,9 @@
 # shellcheck shell=bash
 # tests/_harness.sh — shared test harness for the git-commit-lock suites.
 #
-# Sourced by all three suites (git-commit-lock.test.sh, .interop.test.sh,
-# .integration.test.sh) to share the bits they all copy-pasted: the PASS/FAIL/
+# Sourced by all four suites (git-commit-lock.test.sh, .canary.test.sh,
+# .interop.test.sh, .integration.test.sh) to share the bits they all
+# copy-pasted: the PASS/FAIL/
 # TAP counters, the GCL_TAP / GCL_TEST_ONLY reads, ok()/bad(), section(), the
 # end-of-suite DONE sentinel (finish), and the per-test selector verdict helper.
 # Pure deduplication — ZERO behaviour change vs the inline copies it replaces.
diff --git a/tests/git-commit-lock.canary.test.sh b/tests/git-commit-lock.canary.test.sh
new file mode 100644
index 0000000..0f30461
--- /dev/null
+++ b/tests/git-commit-lock.canary.test.sh
@@ -0,0 +1,124 @@
+#!/usr/bin/env bash
+# git-commit-lock.canary.test.sh — the concurrency CANARY, extracted from the
+# unit suite (git-commit-lock.test.sh) into its own file so it runs as a
+# naturally-parallel CI job.
+#
+# Runs entirely against throwaway temp dirs, so it never touches the repo you
+# launch it from. Exit 0 == pass.
+#   bash tests/git-commit-lock.canary.test.sh
+#
+# This is a STATISTICAL concurrency canary — N workers race the lock over
+# repeated rounds; repetition at width is its coverage. It is cheap on
+# Linux/macOS (fast process spawn) but pathological on Windows (~half the
+# Windows unit wall-clock), which is exactly why it lives in its own cell.
+#
+# Fan-out: defaults to REDUCED width so routine dev runs don't lag a live shared
+# machine; set GCL_TEST_FULL=1 (CI does) for the full-strength 8x25 canary. The
+# file prints which mode ran — a reduced pass must never masquerade as the full one.
+#
+# On failure the work dir is PRESERVED (path printed) for post-mortem; set
+# GCL_TEST_PRESERVE_DIR=<dir> to additionally copy all logs/outputs there
+# regardless of outcome (used by CI).
+#
+# shellcheck disable=SC2015  # The pervasive `<assert> && ok ... || bad ...`
+# idiom is deliberate throughout: ok/bad are echo+counter helpers that cannot
+# fail, so the classic A && B || C pitfall (C running after B fails) is moot.
+# shellcheck disable=SC2310,SC2312  # info-level, deliberate: helper functions
+# and command substitutions run inside conditions all over a test suite; the
+# suite runs WITHOUT errexit (set -uo only) and asserts on values, not on
+# implicit exit propagation.
+# shellcheck disable=SC2016  # $INCR is single-quoted on purpose: it expands
+# inside the worker's `bash -c`, not here.
+set -uo pipefail
+
+# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad,
+# section, the finish EXIT-trap sentinel (calls our cleanup below). Resolved from
+# THIS script's own dir so it sources regardless of CWD; sourced EARLY (before any
+# use of the inits/helpers below).
+_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)"
+# shellcheck source=tests/_harness.sh
+. "$_HARNESS_DIR/_harness.sh"
+
+DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
+ROOT="$(cd "$DIR/.." && pwd)"   # the implementations live at the repo root
+LIB="$ROOT/git-commit-lock.sh"
+
+if [ "${GCL_TEST_FULL:-0}" = 1 ]; then
+  GCL_MODE="FULL"; T1_ROUNDS=8; T1_N=25
+else
+  GCL_MODE="REDUCED"; T1_ROUNDS=3; T1_N=8
+fi
+echo "fan-out mode: $GCL_MODE (T1 ${T1_ROUNDS} rounds x ${T1_N} workers)"
+[ "$GCL_MODE" = REDUCED ] && echo "  (set GCL_TEST_FULL=1 for the full-strength 8x25 canary — CI runs it)"
+
+WORK="$(mktemp -d 2>/dev/null || echo "${TMPDIR:-/tmp}/git-commit-lock-test.$$")"
+mkdir -p "$WORK"
+cleanup() {
+  if [ -n "${GCL_TEST_PRESERVE_DIR:-}" ]; then
+    mkdir -p "$GCL_TEST_PRESERVE_DIR" 2>/dev/null || true
+    cp -R "$WORK"/. "$GCL_TEST_PRESERVE_DIR"/ 2>/dev/null || true
+    echo "note: copied test artifacts to $GCL_TEST_PRESERVE_DIR"
+  fi
+  if [ "${FAIL:-0}" -gt 0 ]; then
+    echo "note: failures detected — work dir preserved for post-mortem: $WORK"
+  else
+    rm -rf "$WORK" 2>/dev/null || true
+  fi
+}
+# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup()
+# above and fails loudly if the suite died before setting DONE=1.
+trap finish EXIT
+
+# The RESULT line below expands $ENV_WARN, which in the unit suite is maintained
+# by the envelope-tier assertions (ok_envelope/bad_envelope). The canary uses
+# only plain ok/bad (no envelope assertions), so define it to 0 here so the
+# standard RESULT line works unchanged under set -u.
+ENV_WARN=0
+
+# Critical section that loses updates without a mutex: read, gap, write+1.
+INCR='n="$(cat "$1")"; sleep 0.03; echo $((n+1)) > "$1"'
+
+if section "Test 1: concurrent workers, mutual exclusion (repeated rounds, $GCL_MODE width)"; then
+# A single pass is too weak to trust a rare exclusion race (the release-steal
+# bug found 2026-05-30 lost ~1 update per 25 only intermittently). Repeat
+# several rounds; ANY lost update across ALL rounds fails the test.
+# MAX_WAIT caps a regression at 180s per worker instead of the 420s default;
+# STALE stays comfortably above any realistic hold so nothing is ever stolen.
+N=$T1_N; ROUNDS=$T1_ROUNDS; t1_fail=0; T1ERR="$WORK/excl.err"; : > "$T1ERR"
+for r in $(seq 1 "$ROUNDS"); do
+  COUNTER="$WORK/counter.$r"; echo 0 > "$COUNTER"
+  LOCK="$WORK/excl.$r.lock"; LOG="$WORK/excl.$r.log"; : > "$LOG"; pids=()
+  for _ in $(seq 1 "$N"); do
+    AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=120 \
+      AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=180 \
+      bash "$LIB" run -- bash -c "$INCR" _ "$COUNTER" 2>> "$T1ERR" &
+    pids+=($!)
+  done
+  for p in "${pids[@]}"; do wait "$p"; done
+  c="$(cat "$COUNTER")"; a="$(grep -c ACQUIRED "$LOG")"; rl="$(grep -c RELEASED "$LOG")"
+  if [ "$c" != "$N" ] || [ "$a" != "$N" ] || [ "$rl" != "$N" ] || [ -e "$LOCK" ]; then
+    t1_fail=1; echo "  round $r: counter=$c acquired=$a released=$rl leftover=$([ -e "$LOCK" ] && echo yes || echo no)"
+  fi
+done
+[ "$t1_fail" = 0 ] && ok "$ROUNDS rounds x $N workers ($GCL_MODE): no lost updates, balanced acquire/release, no leftover lock" \
+                    || bad "mutual-exclusion failure in at least one round (see above)"
+# Regression: under contention the lock file routinely vanishes mid-mtime-probe;
+# that must NOT be misdiagnosed as "staleness detection broken" (false WARNING
+# observed 2026-06-10 before the probe got its retry loop).
+grep -q "Staleness detection is BROKEN" "$T1ERR" \
+  && bad "spurious mtime-probe WARNING under contention (see $T1ERR)" \
+  || ok "no spurious mtime-probe warnings under contention"
+fi
+
+# Zero-match guard + selector-report line (shared helper in _harness.sh): a
+# set-but-non-matching GCL_TEST_ONLY ran NO test block, which without the guard
+# would fall through to a vacuous PASS=0 FAIL=0 "green". Near-pointless in a
+# one-test file, but zero-cost and keeps the finish/zero-match scaffolding
+# uniform with the other suites.
+selector_report
+
+DONE=1
+echo
+echo "==== RESULT: $PASS passed, $FAIL failed, $ENV_WARN envelope warning(s) (fan-out: $GCL_MODE) ===="
+[ "$GCL_TAP" = 1 ] && echo "1..$TAPN"
+[ "$FAIL" = 0 ]
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 3bffabd..ea2cc67 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -21,8 +21,10 @@
 # and command substitutions run inside conditions all over a test suite; the
 # suite runs WITHOUT errexit (set -uo only) and asserts on values, not on
 # implicit exit propagation.
-# shellcheck disable=SC2016  # $INCR is single-quoted on purpose: it expands
-# inside the worker's `bash -c`, not here.
+# shellcheck disable=SC2016  # Single-quoted strings carrying `$…` on purpose —
+# steering-shell bodies (the T*_INNER `bash -c` programs) and grep patterns that
+# match literal `$_LOCK_*` text in the library — expand in their own context, not
+# here.
 set -uo pipefail
 
 # Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad,
@@ -112,40 +114,10 @@ wait_for_file() {
   [ -e "$f" ]
 }
 
-# Critical section that loses updates without a mutex: read, gap, write+1.
-INCR='n="$(cat "$1")"; sleep 0.03; echo $((n+1)) > "$1"'
-
-if section "Test 1: concurrent workers, mutual exclusion (repeated rounds, $GCL_MODE width)"; then
-# A single pass is too weak to trust a rare exclusion race (the release-steal
-# bug found 2026-05-30 lost ~1 update per 25 only intermittently). Repeat
-# several rounds; ANY lost update across ALL rounds fails the test.
-# MAX_WAIT caps a regression at 180s per worker instead of the 420s default;
-# STALE stays comfortably above any realistic hold so nothing is ever stolen.
-N=$T1_N; ROUNDS=$T1_ROUNDS; t1_fail=0; T1ERR="$WORK/excl.err"; : > "$T1ERR"
-for r in $(seq 1 "$ROUNDS"); do
-  COUNTER="$WORK/counter.$r"; echo 0 > "$COUNTER"
-  LOCK="$WORK/excl.$r.lock"; LOG="$WORK/excl.$r.log"; : > "$LOG"; pids=()
-  for _ in $(seq 1 "$N"); do
-    AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=120 \
-      AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=180 \
-      bash "$LIB" run -- bash -c "$INCR" _ "$COUNTER" 2>> "$T1ERR" &
-    pids+=($!)
-  done
-  for p in "${pids[@]}"; do wait "$p"; done
-  c="$(cat "$COUNTER")"; a="$(grep -c ACQUIRED "$LOG")"; rl="$(grep -c RELEASED "$LOG")"
-  if [ "$c" != "$N" ] || [ "$a" != "$N" ] || [ "$rl" != "$N" ] || [ -e "$LOCK" ]; then
-    t1_fail=1; echo "  round $r: counter=$c acquired=$a released=$rl leftover=$([ -e "$LOCK" ] && echo yes || echo no)"
-  fi
-done
-[ "$t1_fail" = 0 ] && ok "$ROUNDS rounds x $N workers ($GCL_MODE): no lost updates, balanced acquire/release, no leftover lock" \
-                    || bad "mutual-exclusion failure in at least one round (see above)"
-# Regression: under contention the lock file routinely vanishes mid-mtime-probe;
-# that must NOT be misdiagnosed as "staleness detection broken" (false WARNING
-# observed 2026-06-10 before the probe got its retry loop).
-grep -q "Staleness detection is BROKEN" "$T1ERR" \
-  && bad "spurious mtime-probe WARNING under contention (see $T1ERR)" \
-  || ok "no spurious mtime-probe warnings under contention"
-fi
+# NB: Test 1 (the full-width concurrency CANARY) now lives in its own suite file,
+# tests/git-commit-lock.canary.test.sh, so it runs as a naturally-parallel CI job
+# (it is ~half the Windows unit wall-clock). The $INCR critical-section string it
+# used moved out with it (no other unit test uses it).
 
 if section "Test 2: stale lock (old file mtime) is stolen; holder comes from line 2"; then
 LOCK="$WORK/steal.lock"; LOG="$WORK/steal.log"; : > "$LOG"; MARKER="$WORK/steal-marker"

From b1eb0a8ad5eb630a92e99ef035096733c58ec5f2 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 10:13:44 +1000
Subject: [PATCH 60/76] CI: run the canary as its own parallel cell (all
 arches) + wire nightly/deep-sweep/kcov
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The canary is now a 4th suite file; every workflow that ran the suites must run it.
- tests.yml: a `canary` cell per arch (ubuntu/macos/windows) running
  canary.test.sh in parallel (the per-PR wall-clock win — Windows unit ~360s ->
  ~174s, overall macOS-gated). Canary step gates on `leg == 'canary'` only, so
  `leg: all` still runs unit+interop+integration but NOT canary (no double-run).
  Canary file added to the shellcheck lint list.
- nightly.yml: canary-under-load step in the stress cells; the kcov job runs the
  unit suite AND the canary under kcov into one merged --include-path outdir, so
  the canary's coverage of git-commit-lock.sh is preserved (the 0.80 floor can't
  regress from the split). Canary kcov log uploaded.
- deep-sweep.yml: canary step with the same repeat-loop / PIPESTATUS fail-fast /
  under-load wrapping as the unit suite.

actionlint clean on all three workflows.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/deep-sweep.yml | 25 +++++++++++++++++++++++++
 .github/workflows/nightly.yml    | 24 +++++++++++++++++++++++-
 .github/workflows/tests.yml      | 27 ++++++++++++++++++++++-----
 3 files changed, 70 insertions(+), 6 deletions(-)

diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml
index 7ac74e9..3f55b68 100644
--- a/.github/workflows/deep-sweep.yml
+++ b/.github/workflows/deep-sweep.yml
@@ -132,6 +132,31 @@ jobs:
             fi
           done
 
+      - name: Canary suite (deep, looped x repeat, under load)
+        # The concurrency canary moved into its own file; the deep flake hunt should
+        # exercise it (a concurrency canary is exactly what a deep+loaded+repeated hunt
+        # is for). Same legs as the unit suite, same loop/fail-fast wrapping.
+        if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'unit') }}
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 100 || 90 }}
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/canary
+        run: |
+          mkdir -p test-output
+          n='${{ inputs.repeat }}'
+          case "$n" in ''|*[!0-9]*) n=1 ;; esac
+          [ "$n" -lt 1 ] && n=1
+          echo "== canary: repeating $n time(s) under load =="
+          for i in $(seq 1 "$n"); do
+            echo "== canary iteration $i/$n =="
+            bash tests/with-load.sh bash tests/git-commit-lock.canary.test.sh 2>&1 \
+              | tee "test-output/canary-suite.iter$i.log"
+            rc=${PIPESTATUS[0]}
+            if [ "$rc" -ne 0 ]; then
+              echo "== canary iteration $i/$n FAILED (rc=$rc) — stopping deep sweep =="
+              exit 1
+            fi
+          done
+
       - name: Interop suite (deep, looped x repeat, under load)
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}   # run even if an earlier suite failed — every signal is useful
         timeout-minutes: 90
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index 6c72d6a..f48b9fe 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -95,6 +95,19 @@ jobs:
           mkdir -p test-output
           bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log
 
+      - name: Canary suite (under load)
+        # The concurrency canary moved out of the unit suite into its own file; still
+        # exercise it under oversubscription here (concurrency + load is the highest-value
+        # canary scenario). Runs in the same legs the unit suite does (sequentially after
+        # it — nightly is non-blocking, so no separate parallel cell is warranted).
+        if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'unit') }}
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 20 || 12 }}   # load + sweep stretch the full-width canary; a step timeout FAILS the step so the upload still runs
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/canary
+        run: |
+          mkdir -p test-output
+          bash tests/with-load.sh bash tests/git-commit-lock.canary.test.sh 2>&1 | tee test-output/canary-suite.log
+
       - name: Interop suite (under load; bash + pwsh)
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }}   # run even if an earlier suite failed — every signal is useful
         timeout-minutes: 30
@@ -165,15 +178,23 @@ jobs:
           make -j"$(nproc)"
           ./src/kcov --version
 
-      - name: Run unit suite under kcov (FULL, strict, no load)
+      - name: Run unit + canary suites under kcov (FULL, strict, no load)
         env:
           GCL_TEST_FULL: 1
           # GCL_ENVELOPE_TIER unset => strict (we want a true, clean coverage run; no load applied)
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/kcov-unit
         run: |
           mkdir -p test-output coverage
+          # The concurrency canary now lives in its own file; run BOTH the unit suite
+          # and the canary under kcov into the SAME --include-path outdir. kcov
+          # ACCUMULATES coverage across multiple runs that share one output dir (it
+          # merges into the top-level cobertura.xml), so the canary's coverage of
+          # git-commit-lock.sh is preserved and the 0.80 floor cannot regress from
+          # the split.
           /tmp/kcov-build/src/kcov --include-path="$(pwd)/git-commit-lock.sh" \
             coverage/kcov-out tests/git-commit-lock.test.sh 2>&1 | tee test-output/kcov-unit-suite.log
+          /tmp/kcov-build/src/kcov --include-path="$(pwd)/git-commit-lock.sh" \
+            coverage/kcov-out tests/git-commit-lock.canary.test.sh 2>&1 | tee test-output/kcov-canary-suite.log
 
       - name: Enforce coverage floor (parse cobertura line-rate)
         run: |
@@ -226,6 +247,7 @@ jobs:
           path: |
             coverage/kcov-out/
             test-output/kcov-unit-suite.log
+            test-output/kcov-canary-suite.log
           include-hidden-files: true
           if-no-files-found: warn
           retention-days: 30
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 8ebffcc..1ee9d0a 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -32,16 +32,23 @@ jobs:
     strategy:
       fail-fast: false               # an OS-specific failure is the signal we want; let the others finish
       matrix:
-        # Windows splits into two parallel jobs — the bash-only unit suite is the
-        # wall-clock bottleneck there (~309s vs interop 100s + integration 28s;
-        # process-spawn overhead, not the PowerShell engines). Suites must NOT run
-        # concurrently inside one runner: they're timing-sensitive on 2-core
-        # runners. POSIX legs are fast enough to stay single-job.
+        # The concurrency CANARY (Test 1, full-width 25x8) is its OWN suite file and
+        # runs as a separate parallel `canary` cell on EVERY arch — it is ~half the
+        # Windows unit wall-clock (process-spawn overhead, not the PowerShell engines)
+        # and cheap on POSIX, so parallelising it is the per-PR wall-clock win.
+        # Windows otherwise splits unit vs interop-integration. Suites must NOT run
+        # concurrently inside one runner: they're timing-sensitive on 2-core runners.
+        # `leg: all` runs unit+interop+integration but NOT canary (the canary step
+        # gates on `leg == 'canary'` only). The job-name + artifact-name templates
+        # already key on matrix.leg, so the `canary` leg is named/uploaded uniquely.
         include:
           - { os: ubuntu-24.04, leg: all, job_timeout: 35 }
+          - { os: ubuntu-24.04, leg: canary, job_timeout: 15 }
           - { os: macos-15, leg: all, job_timeout: 35 }
+          - { os: macos-15, leg: canary, job_timeout: 15 }
           - { os: windows-2025, leg: unit, job_timeout: 20 }
           - { os: windows-2025, leg: interop-integration, job_timeout: 22 }
+          - { os: windows-2025, leg: canary, job_timeout: 15 }
     timeout-minutes: ${{ matrix.job_timeout }}   # backstop only: sum of the leg's step budgets + upload headroom
     defaults:
       run:
@@ -70,6 +77,15 @@ jobs:
           fi
           stat --version 2>/dev/null | head -1 || echo "stat: BSD variant"
 
+      - name: Canary suite (full-width concurrency canary)
+        if: ${{ matrix.leg == 'canary' }}
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 7 || 6 }}   # ~151s on Windows + headroom; a step timeout FAILS the step (not the job) so the upload still runs
+        env:
+          GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/canary
+        run: |
+          mkdir -p test-output
+          bash tests/git-commit-lock.canary.test.sh 2>&1 | tee test-output/canary-suite.log
+
       - name: Unit suite
         if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}
         timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }}   # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang
@@ -131,6 +147,7 @@ jobs:
             git-commit-lock.sh \
             tests/_harness.sh \
             tests/git-commit-lock.test.sh \
+            tests/git-commit-lock.canary.test.sh \
             tests/git-commit-lock.interop.test.sh \
             tests/git-commit-lock.integration.test.sh \
             .github/scripts/nightly-triage.sh \

From 76bed1aa10519a795bc860861788712c9c5a787f Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 10:18:29 +1000
Subject: [PATCH 61/76] Plan: record canary-split CI result (~50% faster, all
 green, simpler than sharding)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI run 27728088150 all 8 jobs green. Overall CI 360s -> ~181s (~50%), gated by
the windows unit-minus-canary cell (181s) with the windows canary (165s)
balanced beside it; macOS 194->167s (canary peeled into a cheap 33s cell). Beats
the sharding (242s, imbalanced) and is far simpler — zero GCL_TEST_SHARD
machinery; the canary is just its own suite file.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../2026-06-18-ci-stress-canary-split-plan.md | 21 +++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/.plans/2026-06-18-ci-stress-canary-split-plan.md b/.plans/2026-06-18-ci-stress-canary-split-plan.md
index 7ec5985..c90254b 100644
--- a/.plans/2026-06-18-ci-stress-canary-split-plan.md
+++ b/.plans/2026-06-18-ci-stress-canary-split-plan.md
@@ -132,6 +132,27 @@ they already run.)
   obsoleted by this file-extraction approach; the sharding was unwound (`89de803`+`143e280`).
   (Leave those plan files in place per "leave history be"; add a superseded-by pointer at their top.)
 
+## Results (CI verified — run 27728088150, all 8 jobs green)
+Implemented in `5fe15c9` (canary file + unit removal + harness) + `b1eb0a8` (CI wiring). Local
+proof passed (canary 2/0 standalone; unit-minus-canary 313/0; 313+2=315 disjoint/union==original;
+interop 141/0, integration 12/0; shellcheck + actionlint clean). Cross-platform CI **succeeded**:
+
+| | windows | macOS | ubuntu | overall (slowest) |
+|---|---|---|---|---|
+| **pre-shard** (`27716080146`) | unit **360s** | 194s | 182s | **360s** |
+| **sharding** (`27723744798`) | unit 242s ‖ 99s (imbalanced) | 210s | 181s | 242s |
+| **canary split** (`27728088150`) | unit **181s** ‖ canary 165s ‖ interop 130s | all 167s ‖ canary 33s | all 165s ‖ canary 18s | **~181s** |
+
+- **Overall CI 360s → ~181s (~50% faster)** — gated by the windows unit-minus-canary cell (181s),
+  with the windows canary (165s) well-balanced beside it.
+- macOS dropped 194→167s (the canary moved out of its `leg: all` into a cheap 33s cell); the
+  POSIX canary cells are cheap (ubuntu 18s, macOS 33s) and off the critical path.
+- **Beats the sharding (242s, imbalanced) AND is far simpler** — zero `GCL_TEST_SHARD` machinery;
+  the canary is just its own suite file. The sharding was unwound (`89de803`+`143e280`).
+- The kcov merged-coverage run and the nightly/deep-sweep canary steps are statically validated
+  (actionlint-clean); their first live exercise is post-merge (those workflows dispatch only from
+  the default branch).
+
 ## Out of scope
 - Reducing the canary's own ~151s width (a test-design change — the width *is* its coverage;
   worth a separate look, not here). Sharding/`GCL_TEST_SHARD` (removed). `n>2` (N/A — files, not shards).

From 7a242ee2044f9eb597888298a561522784c567a0 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 10:36:02 +1000
Subject: [PATCH 62/76] Make the canary suite executable (kcov invokes it
 directly)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

The nightly kcov job runs `kcov … tests/git-commit-lock.canary.test.sh` directly
(not via `bash`), exactly as it does the unit suite — which works because the
unit suite is mode 100755. The canary was added 100644, so the direct kcov exec
would fail permission-denied. Set it executable to match the unit suite (the
precedent for a directly-invoked suite). Caught by Codex's implementation review.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 tests/git-commit-lock.canary.test.sh | 0
 1 file changed, 0 insertions(+), 0 deletions(-)
 mode change 100644 => 100755 tests/git-commit-lock.canary.test.sh

diff --git a/tests/git-commit-lock.canary.test.sh b/tests/git-commit-lock.canary.test.sh
old mode 100644
new mode 100755

From 07309250b53260a70faac17c98156f26147fc4c4 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 11:05:05 +1000
Subject: [PATCH 63/76] nightly kcov: select merged report by lines-covered +
 pipefail the kcov run
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-2 review fixes (Codex + Claude, independent) for the canary split's nightly
kcov job:
- Floor-parse picked the cobertura with the highest lines-VALID. Since the unit
  + canary runs into one outdir produce per-suite reports plus a merged union at
  coverage/kcov-out/kcov-merged/cobertura.xml — ALL covering the same source, so
  all with identical lines-valid — the tie-break could grab a single-suite report
  and measure partial coverage (spurious floor failure). Select by highest
  lines-COVERED instead (the merged union has the most), which robustly picks the
  union. Comment corrected (kcov writes no top-level cobertura.xml).
- The kcov-run step lacked pipefail (the kcov job has no shell:bash default), so a
  failing `kcov ... | tee` returned tee's 0. Added `set -euo pipefail`.
- Dropped an inaccurate "+ sweep" from the canary step comment (the canary reads
  only GCL_TEST_FULL, not the Axis-A sweep).

The kcov job dispatches only from the default branch, so this path is exercised
post-merge; verified statically against the kcov v43 output layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/nightly.yml | 25 +++++++++++++++----------
 1 file changed, 15 insertions(+), 10 deletions(-)

diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index f48b9fe..e2bcc87 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -101,7 +101,7 @@ jobs:
         # canary scenario). Runs in the same legs the unit suite does (sequentially after
         # it — nightly is non-blocking, so no separate parallel cell is warranted).
         if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'unit') }}
-        timeout-minutes: ${{ matrix.os == 'windows-2025' && 20 || 12 }}   # load + sweep stretch the full-width canary; a step timeout FAILS the step so the upload still runs
+        timeout-minutes: ${{ matrix.os == 'windows-2025' && 20 || 12 }}   # load stretches the full-width canary; a step timeout FAILS the step so the upload still runs
         env:
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/canary
         run: |
@@ -184,6 +184,7 @@ jobs:
           # GCL_ENVELOPE_TIER unset => strict (we want a true, clean coverage run; no load applied)
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/kcov-unit
         run: |
+          set -euo pipefail
           mkdir -p test-output coverage
           # The concurrency canary now lives in its own file; run BOTH the unit suite
           # and the canary under kcov into the SAME --include-path outdir. kcov
@@ -199,23 +200,27 @@ jobs:
       - name: Enforce coverage floor (parse cobertura line-rate)
         run: |
           set -euo pipefail
-          # kcov writes a per-binary report under coverage/kcov-out/<binary>.<hash>/ and a
-          # merged top-level coverage/kcov-out/cobertura.xml. For a single-binary run they
-          # are equivalent; pick the one with the highest lines-valid (most complete) so
-          # this is robust either way.
+          # kcov does NOT write a top-level coverage/kcov-out/cobertura.xml. The two runs
+          # (unit + canary) into one outdir produce per-binary reports under
+          # coverage/kcov-out/<binary>.<hash>/cobertura.xml and a merged union at
+          # coverage/kcov-out/kcov-merged/cobertura.xml. All cover the same source
+          # git-commit-lock.sh, so they share an identical lines-valid — a lines-valid
+          # tie-break would keep whatever find returns first (a single-suite report).
+          # Pick the highest lines-COVERED instead: the merged union has the most covered
+          # lines, so this robustly selects it (for a single run there's just one report).
           cob=""
-          best_valid=-1
+          best_covered=-1
           while IFS= read -r f; do
-            v="$(grep -oE 'lines-valid="[0-9]+"' "$f" 2>/dev/null | head -1 | grep -oE '[0-9]+')"
-            v="${v:-0}"
-            if [ "$v" -gt "$best_valid" ]; then best_valid="$v"; cob="$f"; fi
+            c="$(grep -oE 'lines-covered="[0-9]+"' "$f" 2>/dev/null | head -1 | grep -oE '[0-9]+')"
+            c="${c:-0}"
+            if [ "$c" -gt "$best_covered" ]; then best_covered="$c"; cob="$f"; fi
           done < <(find coverage/kcov-out -name cobertura.xml 2>/dev/null)
           if [ -z "$cob" ] || [ ! -f "$cob" ]; then
             echo "::error::no cobertura.xml found under coverage/kcov-out — kcov produced no report"
             find coverage/kcov-out -maxdepth 3 -type f 2>/dev/null | sed 's/^/  /'
             exit 1
           fi
-          echo "Parsing coverage from: $cob (lines-valid=$best_valid)"
+          echo "Parsing coverage from: $cob (lines-covered=$best_covered)"
           # Prefer the precise lines-covered/lines-valid ratio (exact); fall back to the
           # rounded line-rate attribute. Both live on the top-level <coverage ...> tag.
           covered="$(grep -oE 'lines-covered="[0-9]+"' "$cob" | head -1 | grep -oE '[0-9]+')"

From 9d00d449e76980cb99cfbc7fff16b51f96168ca8 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 11:05:07 +1000
Subject: [PATCH 64/76] docs: reflect the canary as a 4th suite (file table,
 run cmds, re-attribution)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-2 review caught the canary extraction didn't update the user-facing docs:
- docs/git-commit-lock.md: add the canary row to the suite table + run-commands;
  "three"->"four" (temp dirs / CI / log-copy); and move the "mutual exclusion
  under many concurrent workers" description from the unit suite to a new canary
  paragraph (only Test 1 moved — crash-recovery/claim-contention stay in the unit
  suite and remain in its list). Also note the canary's counts are exact.
- README.md: "Three suites" -> "Four suites" (+ the bash concurrency canary).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 README.md               |  7 ++++---
 docs/git-commit-lock.md | 19 +++++++++++++------
 2 files changed, 17 insertions(+), 9 deletions(-)

diff --git a/README.md b/README.md
index 9c7d595..2027cd0 100644
--- a/README.md
+++ b/README.md
@@ -256,9 +256,10 @@ knobs and how staleness and stealing work.
 
 ## Tests
 
-Three suites — bash unit, bash + PowerShell interop, and an end-to-end
-integration run of concurrent real commits — cover the tool, and CI runs
-them on Linux, macOS, and Windows. How to run them and what each covers:
+Four suites — bash unit, a bash concurrency canary, bash + PowerShell
+interop, and an end-to-end integration run of concurrent real commits —
+cover the tool, and CI runs them on Linux, macOS, and Windows. How to run
+them and what each covers:
 [`docs/git-commit-lock.md#tests`](docs/git-commit-lock.md#tests).
 
 ## Licence
diff --git a/docs/git-commit-lock.md b/docs/git-commit-lock.md
index f47fbb8..515e24f 100644
--- a/docs/git-commit-lock.md
+++ b/docs/git-commit-lock.md
@@ -577,6 +577,7 @@ unavailable):
 | `git-commit-lock.sh`                  | the mutex (bash; the authoritative implementation): source for `lock_acquire/lock_release/lock_run`, or `git-commit-lock.sh run -- <cmd>` |
 | `git-commit-lock.ps1`                 | wire-compatible PowerShell port (see [The PowerShell port](#the-powershell-port-git-commit-lockps1) above): `git-commit-lock.ps1 run "<pwsh cmd>"`, or dot-source for `Lock-Acquire`/`Lock-Release` |
 | `tests/git-commit-lock.test.sh`             | self-contained bash tests (throwaway temp dirs); exit 0 == all pass |
+| `tests/git-commit-lock.canary.test.sh`      | bash concurrency canary: mutual exclusion under many concurrent workers, plus the contended crash-recovery / claim-serialization scenarios (throwaway temp dirs) |
 | `tests/git-commit-lock.interop.test.sh`     | cross-impl tests: pwsh + bash workers share one lock and serialise; run from MINGW/Git-Bash |
 | `tests/git-commit-lock.integration.test.sh` | end-to-end: many concurrent workers make real commits into one shared repo; the history is audited for the tool's guarantees |
 
@@ -587,20 +588,25 @@ Run the suites from a clone of this repository (they are not installed to
 
 ```sh
 bash tests/git-commit-lock.test.sh             # bash implementation
+bash tests/git-commit-lock.canary.test.sh      # bash concurrency canary (mutual exclusion + crash recovery under load)
 bash tests/git-commit-lock.interop.test.sh     # bash + PowerShell interop (skips if pwsh is absent)
 bash tests/git-commit-lock.integration.test.sh # end-to-end: concurrent real commits into one repo (pwsh half skips if absent)
 ```
 
 Each suite prints a result summary line and exits 0 when everything passes.
-All three use throwaway temp dirs and never touch the repo you launch them
+All four use throwaway temp dirs and never touch the repo you launch them
 from. The heavy fan-out tests run at a REDUCED width by default, so a routine
 run doesn't lag a shared development machine; each suite prints a
 `fan-out mode:` line at the start and tags its result line with the mode, so
 check those say `FULL` when you ran `GCL_TEST_FULL=1` for the full-strength
 canary (CI does).
 
-`tests/git-commit-lock.test.sh` covers the bash implementation: mutual exclusion
-under many concurrent workers (clean acquire/release path), stale-lock theft,
+`tests/git-commit-lock.canary.test.sh` is the concurrency canary: mutual
+exclusion under many concurrent workers (clean acquire/release path) over
+repeated rounds — the statistical scenario that needs the full 8×25 fan-out
+(`GCL_TEST_FULL=1`, which CI runs) to trust a rare exclusion race.
+
+`tests/git-commit-lock.test.sh` covers the bash implementation: stale-lock theft,
 crash recovery under contention (several waiters racing one dead lock —
 claim-serialized: exactly one steal, zero displacements, zero spurious 98s,
 and no move-aside file ever created), claim contention (many concurrent
@@ -664,7 +670,7 @@ is audited for the guarantees this document claims — every commit lands,
 history stays linear, no commit sweeps up another worker's file, no
 `index.lock` races, no stolen leases, and a clean tree at the end.
 
-The same three suites run in CI on Linux, macOS, and Windows
+The same four suites run in CI on Linux, macOS, and Windows
 (`.github/workflows/tests.yml`), at full fan-out strength, alongside a
 shellcheck + PSScriptAnalyzer lint job. The POSIX legs exercise the
 PowerShell implementation purely as cross-implementation protocol
@@ -680,9 +686,10 @@ heavy process fan-out is environmental, not a lock failure — but only the
 interop suite's exclusion test tolerates it (scoring by violations/steals,
 with a minimum-acquired floor so a collapsed fan-out cannot pass vacuously);
 the integration suite is deliberately strict per worker (every worker must
-launch and commit), and the unit suite's counts are exact.
+launch and commit), and the unit and canary suites' counts are exact (the
+canary requires every worker to acquire and release in each round).
 
-For debugging, all three suites copy their logs and work dirs to
+For debugging, all four suites copy their logs and work dirs to
 `$GCL_TEST_PRESERVE_DIR` when it is set, and keep the work dir on disk on any
 failure.
 

From 89e25d6317033eeb4923a56931af1803aa8b0b2a Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 11:10:38 +1000
Subject: [PATCH 65/76] docs: don't overstate the canary suite (mutual
 exclusion only, not crash-recovery)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-3 review catch: the canary suite TABLE ROW + RUN-COMMAND comment claimed
the canary also covers "crash-recovery / claim-serialization", but those (unit
Tests 2b/20) stayed in the unit suite — the canary file contains only Test 1
(mutual exclusion over repeated rounds + its mtime-probe-warning guard). Align
the table row + run comment with the (already-correct) detailed paragraph.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/git-commit-lock.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/docs/git-commit-lock.md b/docs/git-commit-lock.md
index 515e24f..c8dc29b 100644
--- a/docs/git-commit-lock.md
+++ b/docs/git-commit-lock.md
@@ -577,7 +577,7 @@ unavailable):
 | `git-commit-lock.sh`                  | the mutex (bash; the authoritative implementation): source for `lock_acquire/lock_release/lock_run`, or `git-commit-lock.sh run -- <cmd>` |
 | `git-commit-lock.ps1`                 | wire-compatible PowerShell port (see [The PowerShell port](#the-powershell-port-git-commit-lockps1) above): `git-commit-lock.ps1 run "<pwsh cmd>"`, or dot-source for `Lock-Acquire`/`Lock-Release` |
 | `tests/git-commit-lock.test.sh`             | self-contained bash tests (throwaway temp dirs); exit 0 == all pass |
-| `tests/git-commit-lock.canary.test.sh`      | bash concurrency canary: mutual exclusion under many concurrent workers, plus the contended crash-recovery / claim-serialization scenarios (throwaway temp dirs) |
+| `tests/git-commit-lock.canary.test.sh`      | bash concurrency canary: mutual exclusion under many concurrent workers over repeated rounds — the statistical full-fan-out scenario (throwaway temp dirs) |
 | `tests/git-commit-lock.interop.test.sh`     | cross-impl tests: pwsh + bash workers share one lock and serialise; run from MINGW/Git-Bash |
 | `tests/git-commit-lock.integration.test.sh` | end-to-end: many concurrent workers make real commits into one shared repo; the history is audited for the tool's guarantees |
 
@@ -588,7 +588,7 @@ Run the suites from a clone of this repository (they are not installed to
 
 ```sh
 bash tests/git-commit-lock.test.sh             # bash implementation
-bash tests/git-commit-lock.canary.test.sh      # bash concurrency canary (mutual exclusion + crash recovery under load)
+bash tests/git-commit-lock.canary.test.sh      # bash concurrency canary (mutual exclusion under many concurrent workers)
 bash tests/git-commit-lock.interop.test.sh     # bash + PowerShell interop (skips if pwsh is absent)
 bash tests/git-commit-lock.integration.test.sh # end-to-end: concurrent real commits into one repo (pwsh half skips if absent)
 ```

From 86302c5ebf233692b31cab672638aa22d4b2337e Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 11:17:47 +1000
Subject: [PATCH 66/76] Fix split-introduced staleness: unit suite + CI
 comments no longer claim the canary

Round-4 review catches (all split-introduced, swept the whole class):
- tests/git-commit-lock.test.sh: Test 1 moved to the canary file, but the suite
  still computed orphaned T1_ROUNDS/T1_N and its `fan-out mode:` echo + header
  advertised "the full-strength 8x25 canary". Removed the dead T1_* vars; the
  mode-echo now reports the suite's actual heavy fan-out (Test 2b rounds, Test 20
  workers); header/echo say "full-strength fan-out", not "canary".
- deep-sweep.yml: "Mirrors the tests.yml 4-cell set" -> tests.yml now has canary
  cells; reworded to "per-OS legs" + noted the canary runs as a step here.
- nightly.yml: clarified the leg comment (all/unit/interop-integration as in
  tests.yml; the canary is a step here, not a leg).

bash -n + shellcheck + actionlint clean; mode-echo verified.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/deep-sweep.yml |  5 +++--
 .github/workflows/nightly.yml    |  5 +++--
 tests/git-commit-lock.test.sh    | 13 ++++++++-----
 3 files changed, 14 insertions(+), 9 deletions(-)

diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml
index 3f55b68..64eb5e2 100644
--- a/.github/workflows/deep-sweep.yml
+++ b/.github/workflows/deep-sweep.yml
@@ -61,8 +61,9 @@ jobs:
     strategy:
       fail-fast: false               # every cell's verdict is a useful deep signal; let the rest finish
       matrix:
-        # Mirrors the tests.yml 4-cell set (ubuntu all / macos all / windows unit /
-        # windows interop+integration). Windows stays split because the bash-only
+        # Mirrors tests.yml's per-OS legs (ubuntu all / macos all / windows unit /
+        # windows interop+integration); the canary runs as a step within the unit/all
+        # legs here, not a separate cell. Windows stays split because the bash-only
         # unit suite is the wall-clock bottleneck there and the suites must not run
         # concurrently inside one timing-sensitive 2-core runner. Generous deep
         # timeouts: deep + loaded + repeated is far slower than the per-PR gate.
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index e2bcc87..08e0c9e 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -42,8 +42,9 @@ env:
 
 jobs:
   # ── The 6 stress cells. Each runs the relevant suite(s) wrapped in with-load.sh
-  #    under one GCL_STRESS_KIND. `leg` selects which suites run (mirrors tests.yml):
-  #    ubuntu/macos run the full set; windows splits unit vs interop-integration. ──
+  #    under one GCL_STRESS_KIND. `leg` selects which suites run (the all/unit/
+  #    interop-integration legs as in tests.yml; the canary runs as a step here, not a
+  #    leg): ubuntu/macos run the full set; windows splits unit vs interop-integration. ──
   stress:
     name: ${{ matrix.id }} ${{ matrix.os }} (${{ matrix.kind }}${{ matrix.leg != 'all' && format(', {0}', matrix.leg) || '' }})
     runs-on: ${{ matrix.os }}
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index ea2cc67..a464fee 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -7,7 +7,7 @@
 #
 # Fan-out: heavy concurrency tests default to REDUCED width so routine dev
 # runs don't lag a live shared machine; set GCL_TEST_FULL=1 (CI does) for the
-# full-strength canary. The suite prints which mode ran — a reduced pass must
+# full-strength fan-out. The suite prints which mode ran — a reduced pass must
 # never masquerade as the full one.
 #
 # On failure the work dir is PRESERVED (path printed) for post-mortem; set
@@ -42,12 +42,15 @@ ROOT="$(cd "$DIR/.." && pwd)"   # the implementations live at the repo root
 LIB="$ROOT/git-commit-lock.sh"
 
 if [ "${GCL_TEST_FULL:-0}" = 1 ]; then
-  GCL_MODE="FULL"; T1_ROUNDS=8; T1_N=25; T2B_ROUNDS=4; T20_N=10
+  GCL_MODE="FULL"; T2B_ROUNDS=4; T20_N=10
 else
-  GCL_MODE="REDUCED"; T1_ROUNDS=3; T1_N=8; T2B_ROUNDS=2; T20_N=5
+  GCL_MODE="REDUCED"; T2B_ROUNDS=2; T20_N=5
 fi
-echo "fan-out mode: $GCL_MODE (T1 ${T1_ROUNDS} rounds x ${T1_N} workers)"
-[ "$GCL_MODE" = REDUCED ] && echo "  (set GCL_TEST_FULL=1 for the full-strength 8x25 canary — CI runs it)"
+# (The full-width concurrency canary, formerly Test 1, now lives in its own file
+# tests/git-commit-lock.canary.test.sh; this suite's heavy fan-out is Test 2b /
+# Test 20.)
+echo "fan-out mode: $GCL_MODE (Test 2b ${T2B_ROUNDS} rounds, Test 20 ${T20_N} concurrent workers)"
+[ "$GCL_MODE" = REDUCED ] && echo "  (set GCL_TEST_FULL=1 for full-strength fan-out — CI runs it)"
 
 WORK="$(mktemp -d 2>/dev/null || echo "${TMPDIR:-/tmp}/git-commit-lock-test.$$")"
 mkdir -p "$WORK"

From a662ce77c1d508fa9e8490c58ca5021abca42868 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 11:27:06 +1000
Subject: [PATCH 67/76] comments: fix two stale/inaccurate canary-split
 comments
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-5 review catches (comment-only, no behaviour change):
- nightly.yml kcov-run comment said kcov "merges into the top-level cobertura.xml"
  — contradicts the round-2-corrected enforcement comment. kcov writes the union
  to coverage/kcov-out/kcov-merged/cobertura.xml (no top-level file); the
  enforcement step selects it by highest lines-covered. Comment made consistent.
- deep-sweep.yml loop comment claimed "set -e is NOT in effect (default bash here)"
  — but deep-sweep sets shell: bash (-eo pipefail), so set -e + pipefail ARE on; a
  failing pipeline already trips the step. Reworded: the explicit PIPESTATUS check
  is a defensive backstop that names the failing iteration. (Pre-existing Bucket-6d
  wording the canary loop shares; behaviour was always correct.)

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/deep-sweep.yml |  5 +++--
 .github/workflows/nightly.yml    | 11 ++++++-----
 2 files changed, 9 insertions(+), 7 deletions(-)

diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml
index 64eb5e2..2fff157 100644
--- a/.github/workflows/deep-sweep.yml
+++ b/.github/workflows/deep-sweep.yml
@@ -109,8 +109,9 @@ jobs:
       # Each suite is repeated `repeat` times under load. The loop fails fast: the
       # first failing iteration `exit 1`s the step (so the step — and job — go red on
       # the earliest flake), and every iteration names its index in the log so a
-      # failure is attributable to a specific repeat. `set -e` is NOT in effect
-      # (default bash here), so we check with-load.sh's propagated rc explicitly.
+      # failure is attributable to a specific repeat. Under `shell: bash` (-eo
+      # pipefail) a failing suite pipeline already trips the step; the explicit
+      # PIPESTATUS check is a defensive backstop that also names the failing iteration.
       - name: Unit suite (deep, looped x repeat, under load)
         if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}
         timeout-minutes: ${{ matrix.os == 'windows-2025' && 100 || 90 }}
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index 08e0c9e..3d8fe3b 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -188,11 +188,12 @@ jobs:
           set -euo pipefail
           mkdir -p test-output coverage
           # The concurrency canary now lives in its own file; run BOTH the unit suite
-          # and the canary under kcov into the SAME --include-path outdir. kcov
-          # ACCUMULATES coverage across multiple runs that share one output dir (it
-          # merges into the top-level cobertura.xml), so the canary's coverage of
-          # git-commit-lock.sh is preserved and the 0.80 floor cannot regress from
-          # the split.
+          # and the canary under kcov into the SAME output dir. kcov ACCUMULATES
+          # coverage across multiple runs that share one output dir, writing the union
+          # to coverage/kcov-out/kcov-merged/cobertura.xml (NOT a top-level
+          # cobertura.xml — the enforcement step below reads the merged union by
+          # selecting the highest lines-covered), so the canary's coverage of
+          # git-commit-lock.sh is preserved and the 0.80 floor cannot regress from the split.
           /tmp/kcov-build/src/kcov --include-path="$(pwd)/git-commit-lock.sh" \
             coverage/kcov-out tests/git-commit-lock.test.sh 2>&1 | tee test-output/kcov-unit-suite.log
           /tmp/kcov-build/src/kcov --include-path="$(pwd)/git-commit-lock.sh" \

From c10aca045cbae49b490dfca4bfd2522658193e97 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 11:33:35 +1000
Subject: [PATCH 68/76] comments: sweep remaining canary-split staleness (kcov
 section header; Test 42 xref)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Comprehensive sweep of all touched files for stale post-split references (one had
surfaced per review round); fixed the two genuine remainders (comment-only):
- nightly.yml kcov section header said "unit suite at FULL" — the job now runs
  unit + canary under kcov (the step name already says so). -> "unit + canary at FULL".
- tests/git-commit-lock.test.sh Test 42 comment cross-referenced "Test 1" (which a
  reader would seek in the unit suite) — Test 1 moved to the canary file. Clarified:
  "the concurrency canary (formerly Test 1, now git-commit-lock.canary.test.sh)".

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/nightly.yml | 2 +-
 tests/git-commit-lock.test.sh | 3 ++-
 2 files changed, 3 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index 3d8fe3b..0f375da 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -149,7 +149,7 @@ jobs:
           if-no-files-found: warn
           retention-days: 14
 
-  # ── kcov line-coverage gate. Linux-only, no load, strict, unit suite at FULL.
+  # ── kcov line-coverage gate. Linux-only, no load, strict, unit + canary at FULL.
   #    Build kcov v43 from source (no apt package / prebuilt). Gate at 0.80. ──────
   kcov:
     name: kcov coverage (Linux, no load, strict)
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index a464fee..925466c 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -2578,7 +2578,8 @@ if section "Test 42: mtime unreadable — staleness disabled, fail-safe (no stea
 # PRESENT file), staleness detection is BROKEN. The mtime floor fails closed to
 # "fresh": _lock_verify_stale returns state=fresh, so a crashed/stale holder is
 # NEVER stolen — recovery is disabled and waiters block to MAX_WAIT (97). The
-# tool must say so LOUDLY, exactly once per process. Test 1 only asserts the
+# tool must say so LOUDLY, exactly once per process. The concurrency canary
+# (formerly Test 1, now tests/git-commit-lock.canary.test.sh) only asserts the
 # NEGATIVE (the warning must NOT fire under healthy contention); this drives the
 # positive lane.
 #

From 6a9053b556a7b8ff4bb502ccffe9f2b73e328a11 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 12:58:39 +1000
Subject: [PATCH 69/76] Phase-4 round-1: fix test + CI review findings
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

From the Phase-4 area reviews (A unit+canary, C CI):
- A1: Test 42 `grep -c … || echo 0` appended a 2nd "0" on zero matches ("0\n0"),
  garbling the warn-once assertion — drop `|| echo 0` (grep -c prints 0 itself;
  `${:-0}` covers the missing-file case). Only bit the zero-firings case (not a
  false-green), now a clean integer. Suite still 313/0.
- B8: reflow an awkward comment wrap in tests/_harness.sh (cosmetic).
- C1: nightly kcov job inherited the workflow-level GCL_ENVELOPE_TIER=relax despite
  its "strict" intent — set GCL_ENVELOPE_TIER=strict on the kcov step (a true clean
  coverage run; the floor enforces the envelope assertions).
- C2: nightly-triage.sh classified ANY job failure as `nightly-correctness`, so a
  step TIMEOUT (no ^FAIL: line) misclassified as correctness — correctness now
  requires a real ^FAIL:; failure/timeout/cancel without one falls to `nightly-infra`.
- C3: add tests/with-load.sh to the required tests.yml shellcheck gate (nightly/
  deep-sweep depend on it; it was unlinted per-PR).
- Drop a stale `steering-coverage.md` reference in a Test 46 comment.

shellcheck -S style + actionlint clean; unit suite 313/0; triage dry-run confirms
timeout→infra.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/scripts/nightly-triage.sh | 12 +++++++-----
 .github/workflows/nightly.yml     |  5 ++++-
 .github/workflows/tests.yml       |  1 +
 tests/_harness.sh                 |  6 +++---
 tests/git-commit-lock.test.sh     |  4 ++--
 5 files changed, 17 insertions(+), 11 deletions(-)

diff --git a/.github/scripts/nightly-triage.sh b/.github/scripts/nightly-triage.sh
index 485764d..d9bab14 100644
--- a/.github/scripts/nightly-triage.sh
+++ b/.github/scripts/nightly-triage.sh
@@ -110,14 +110,16 @@ for cell in $EXPECTED_CELLS; do
     fi
   done
 
-  if [ "$cell_fail" -eq 1 ] || [ "$concl" = "failure" ]; then
-    correctness_evidence+="- ${cell}: job='${concl}'"
-    [ "$cell_fail" -eq 1 ] && correctness_evidence+=", FAIL lines present:"$'\n'"${fail_lines}" || correctness_evidence+=" (job failed; no ^FAIL: in logs — see job log)"$'\n'
+  if [ "$cell_fail" -eq 1 ]; then
+    # A real `^FAIL:` assertion line ⇒ correctness, regardless of job conclusion.
+    correctness_evidence+="- ${cell}: job='${concl}', FAIL lines present:"$'\n'"${fail_lines}"
     log "[$cell] CORRECTNESS (cell_fail=$cell_fail conclusion=$concl)"
   elif [ "$concl" != "success" ]; then
     # Logs exist but the job did not cleanly succeed and there is no assertion FAIL:
-    # timeout / cancelled / errored late ⇒ infra, not green.
-    infra_evidence+="- ${cell}: logs present but job conclusion='${concl}' (timeout/cancel/late error)"$'\n'
+    # failure-without-^FAIL / timeout / cancelled / errored late ⇒ infra, not
+    # correctness and not green (a failure WITHOUT a FAIL line is a step
+    # timeout/late error, which is infra per the Bucket 6 design).
+    infra_evidence+="- ${cell}: logs present but job conclusion='${concl}' (failure/timeout/cancel without ^FAIL: line)"$'\n'
     log "[$cell] INFRA (conclusion=$concl, no FAIL)"
   elif [ "$cell_envwarn" -eq 1 ]; then
     envelope_evidence+="- ${cell}: succeeded with WARN[env-relaxed] (envelope assertion(s) stretched under load — expected)"$'\n'
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index 0f375da..ba5c9ad 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -182,7 +182,10 @@ jobs:
       - name: Run unit + canary suites under kcov (FULL, strict, no load)
         env:
           GCL_TEST_FULL: 1
-          # GCL_ENVELOPE_TIER unset => strict (we want a true, clean coverage run; no load applied)
+          # Set strict EXPLICITLY here to override the workflow-level GCL_ENVELOPE_TIER: relax
+          # (which this step would otherwise inherit) — we want a true, clean coverage run with
+          # the wall-clock envelope assertions enforced, no load applied.
+          GCL_ENVELOPE_TIER: strict
           GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/kcov-unit
         run: |
           set -euo pipefail
diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml
index 1ee9d0a..1b579e2 100644
--- a/.github/workflows/tests.yml
+++ b/.github/workflows/tests.yml
@@ -150,6 +150,7 @@ jobs:
             tests/git-commit-lock.canary.test.sh \
             tests/git-commit-lock.interop.test.sh \
             tests/git-commit-lock.integration.test.sh \
+            tests/with-load.sh \
             .github/scripts/nightly-triage.sh \
             install.sh
 
diff --git a/tests/_harness.sh b/tests/_harness.sh
index 9ff10ca..71eed3a 100644
--- a/tests/_harness.sh
+++ b/tests/_harness.sh
@@ -3,9 +3,9 @@
 #
 # Sourced by all four suites (git-commit-lock.test.sh, .canary.test.sh,
 # .interop.test.sh, .integration.test.sh) to share the bits they all
-# copy-pasted: the PASS/FAIL/
-# TAP counters, the GCL_TAP / GCL_TEST_ONLY reads, ok()/bad(), section(), the
-# end-of-suite DONE sentinel (finish), and the per-test selector verdict helper.
+# copy-pasted: the PASS/FAIL/TAP counters, the GCL_TAP / GCL_TEST_ONLY reads,
+# ok()/bad(), section(), the end-of-suite DONE sentinel (finish), and the
+# per-test selector verdict helper.
 # Pure deduplication — ZERO behaviour change vs the inline copies it replaces.
 #
 # Contract for sourcing suites:
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index 925466c..c722886 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -2638,7 +2638,7 @@ g42="$(head -n 1 -- "$T42_LOCK" 2>/dev/null | tr -d '\r')"
   && ok "mtime-unreadable: waiter blocked to MAX_WAIT and exited 97" \
   || bad "mtime-unreadable: waiter rc=$t42_rc (want 97 — was the stale ghost stolen?)"
 # (4) Warn-once: the broken-staleness warning fires EXACTLY once per process.
-t42_warns="$(grep -c "Staleness detection is BROKEN" "$T42_ERR" 2>/dev/null || echo 0)"
+t42_warns="$(grep -c "Staleness detection is BROKEN" "$T42_ERR" 2>/dev/null)"; t42_warns="${t42_warns:-0}"
 [ "$t42_warns" -le 1 ] \
   && ok "mtime-unreadable: broken-staleness warning fired at most once on stderr ($t42_warns)" \
   || bad "mtime-unreadable: warning repeated ($t42_warns times — warn-once broken)"
@@ -2845,7 +2845,7 @@ rm -f "$LOCK" "$LOG"
 fi
 
 if section "Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release"; then
-# A10 (steering-coverage.md): _lock_on_exit's no-hold arc-end (:1009,1017-1018).
+# Covers _lock_on_exit's no-hold arc-end (sh:1009,1017-1018).
 # A sourced waiter, blocked in the wait loop against a LIVE held lock, exits 0
 # while still parked — the EXIT trap is STILL '_lock_on_exit' (the timeout's
 # trap-restore has NOT run, because we never time out), so EXIT fires the

From e8f192ba18d77408adcf0df233e821c877c2f522 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 12:58:41 +1000
Subject: [PATCH 70/76] Phase-4 round-1: documentation accuracy + merge scope
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

From the Phase-4 docs review (D) + Ben's merge-scope decision:
- D1: guarantees.md still said "coverage planned" for BE-3/BE-4/G-S2 resource lanes
  that are now TESTED + passing — flip to cite Tests 42/48/49/50 (F3 stays
  document-only), reconciling with failure-modes.md §2 (flipped earlier in 309cf39).
- D2: re-attribute the concurrency canary from "unit Test 1" to
  tests/git-commit-lock.canary.test.sh (the branch's central change) in guarantees.md
  (G-S3) + failure-modes.md (A1); add a `C:` witness legend. Crash-recovery/
  claim-contention stay unit Tests 2b/20.
- guarantees.md OOS-5: the bash `exec` §H4 lane is no longer "a coverage gap" — cite
  unit Test 40 (the exec-bypass boundary test); drop the steering-coverage.md ref.
- load-testing-strategy.md: rewritten from a stale "RECOMMENDATION / not-built-yet"
  plan into a present-tense rationale for why the CI is shaped as it is (Ben's call:
  must describe current state, not read as a stale plan). 346 -> 236 lines.
- steering-coverage.md: DELETED. It was a point-in-time gap list whose every gap is
  filled (Tests 37-50) + had a broken .plans/ link — a finished working artifact, not
  main-worthy (Ben's decision). The kcov baseline lives in the nightly kcov gate.
- Fold in 2 parallel stale README-notes in guarantees/failure-modes (the notes the
  README now carries).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md         |  22 +-
 docs/guarantees.md            |  53 ++--
 docs/load-testing-strategy.md | 572 ++++++++++++++--------------------
 docs/steering-coverage.md     | 288 -----------------
 4 files changed, 276 insertions(+), 659 deletions(-)
 delete mode 100644 docs/steering-coverage.md

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 3f54abe..f4c0ef7 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -107,7 +107,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 
 | # | Failure mode | Current behavior | Tier | Tested | Recommendation |
 |---|---|---|---|---|---|
-| A1 | Clean high contention (N workers, no crashes) | Serialized; no lost update | 1 | ✓ U:166-195, I:227-261/341-386, integ | **In scope.** Keep. |
+| A1 | Clean high contention (N workers, no crashes) | Serialized; no lost update | 1 | ✓ C:81-111 (canary), I:227-261/341-386, integ | **In scope.** Keep. |
 | A2 | Thundering herd recovering one dead lock | Claim serializes; exactly one steal, zero displacement | 1 | ✓ U:212-346, I:884-1015 | **In scope.** Keep. |
 | A3 | Many concurrent stealers on one ghost | One O_EXCL claim winner | 1 | ✓ U:1095-1128, I:1017-1088 | **In scope.** Keep. |
 | B1 | Holder dies (crash/SIGKILL/power) mid-hold | Lease ages out; stolen after STALE | 1 (recovery) / 2 (latency) | ✓ U:197-210/348-361 | **In scope** (recovery). Latency = Tier 2. |
@@ -146,6 +146,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. |
 
 U = `tests/git-commit-lock.test.sh`, I = `tests/git-commit-lock.interop.test.sh`,
+C = `tests/git-commit-lock.canary.test.sh` (the concurrency canary),
 integ = `tests/git-commit-lock.integration.test.sh`.
 
 ---
@@ -161,10 +162,12 @@ exactly one creator wins, the rest poll and take turns
 token (read-back verification, `git-commit-lock.sh:1352-1361`) before claiming
 the hold — so even a create that "won" but whose file was concurrently
 clobbered does not produce a false hold.
-*Tier 1.* Tested heavily: unit Test 1 (8 rounds × 25 workers at FULL,
-`U:166-195`), interop Test 1/Test 6 mixed bash+pwsh (`I:227-261`, the strict
-deterministic counter `I:341-386`), and the integration suite's real-commit
-swarm. **Recommend: in scope, keep.** This is the tool's whole reason to exist.
+*Tier 1.* Tested heavily: the concurrency canary — mutual exclusion under many
+concurrent workers, 8 rounds × 25 at FULL (`tests/git-commit-lock.canary.test.sh`
+Test 1, `C:81-111`) — interop Test 1/Test 6 mixed bash+pwsh (`I:227-261`, the
+strict deterministic counter `I:341-386`), and the integration suite's real-commit
+swarm. (Crash-recovery / claim-contention witnesses stay in the unit suite: A2's
+Test 2b, A3's Test 20.) **Recommend: in scope, keep.** This is the tool's whole reason to exist.
 
 **A2 — Thundering herd recovering one dead lock.** After a holder dies, *every*
 waiter judges the same lock stale off the same mtime in the same poll window —
@@ -394,8 +397,9 @@ keep out of scope — but consider making it harder to *fall into* accidentally.
 The current failure mode on a bad FS is *silent* (the tool runs, exclusion may
 just not hold). Options, in increasing cost: (i) leave as-is, documented — the
 default lock lives in `.git`, which is almost always local, so accidental
-network use is rare; (ii) a one-line caveat in `README.md` (currently only in the
-deeper design doc); (iii) an optional best-effort startup probe of the lock dir's
+network use is rare; (ii) a one-line caveat in `README.md` (since done —
+`README.md:60-64`; previously only in the deeper design doc); (iii) an optional
+best-effort startup probe of the lock dir's
 FS type with a stderr warning on a known-network type (cheap on Linux via
 `stat -f`, awkward cross-platform, and inherently incomplete). **My
 recommendation: (ii) now** (surface the boundary in the README, where an operator
@@ -634,8 +638,8 @@ mixed tree degrades prevention to detection (98) and can leave `.dead.*` litter
 current versions don't clean (residual 4, `git-commit-lock.sh:261-265`). *Tier
 3.* Untested (would require shipping an old version into the suite). **Recommend:
 out of scope; keep the "upgrade both implementations together" deployment note**
-— currently in the design doc only (`docs/git-commit-lock.md:251-255`), **not** in
-`README.md`; surface it there too, where operators actually look. Acceptable
+— in the design doc (`docs/git-commit-lock.md:251-255`) and now also surfaced in
+`README.md:101-106`, where operators actually look. Acceptable
 because the degraded mode is still *detected* (98), never silent.
 
 ### J. Logging subsystem failure
diff --git a/docs/guarantees.md b/docs/guarantees.md
index dca88b1..5f839a8 100644
--- a/docs/guarantees.md
+++ b/docs/guarantees.md
@@ -124,7 +124,9 @@ bug.
   `tok.` is a non-lock-shaped residual, never stolen, that needs manual removal
   (`failure-modes.md` §F1 — an accepted residual). *Witness:* the read-back-failure lanes —
   create-path Test 32, steal-path Test 32b (`U:1760-1855`); resource lanes —
-  coverage planned (Bucket 2 / `failure-modes.md` §4.5). *Basis:* §1, §A1, §F.
+  unwritable lock dir Test 48 (F4), ENOSPC Test 50 (F1, Linux+sudo; skip-with-note
+  elsewhere) (`failure-modes.md` §4.5); FD/inode exhaustion (F3) is document-only
+  (no portable injection). *Basis:* §1, §A1, §F.
 
 - **G-S3 — Strict mutual exclusion within the staleness window, with no
   displacement during crash recovery.** Within `AGENT_LOCK_STALE_SECS` no steal
@@ -139,8 +141,10 @@ bug.
   *Condition:* holds complete within the window (E4); a stable clock (E1) — a local
   clock jump preserves *no silent loss* (G-S1) but can break *strict exclusion* by
   making a live lock look stale (a premature, but detected, steal); and matching
-  version (E5). *Witness:* unit Tests 1/2b/20, interop Tests 1/6/16/16b, integration suite
-  (`U:166-195,212-346,1095-1128`; `I:227-261,341-386,884-1088`). *Basis:*
+  version (E5). *Witness:* the concurrency canary (mutual exclusion under many
+  concurrent workers, 8 rounds × 25 at FULL, `C:81-111`), unit Tests 2b/20
+  (claim-recovery and many-stealers), interop Tests 1/6/16/16b, integration suite
+  (`U:212-346,1095-1128`; `I:227-261,341-386,884-1088`). *Basis:*
   §A1/§A2/§A3.
 
 - **G-S4 — Never destroys a non-lock-shaped object.** A directory, symlink, FIFO,
@@ -268,13 +272,17 @@ split, `failure-modes.md` §4.1 / D-c).
   warn loudly once per process and treat the lock as **not** stale (the mtime floor
   fails closed to "fresh"): no premature steal, no corruption — but recovery of a
   genuinely crashed holder is *disabled* and waiters block to `MAX_WAIT` (97).
-  Safety is preserved; recovery is lost and announced. *Coverage planned* (Bucket
-  2 / §4.5). *Basis:* §E3.
+  Safety is preserved; recovery is lost and announced. *Witness:* unit Test 42
+  (shadows the mtime probe to return empty on a present stale ghost; the
+  "Staleness detection is BROKEN" warn-once fires, the ghost is left in place,
+  the waiter blocks to 97). *Basis:* §E3.
 
 - **BE-4 — Logging is best-effort and never blocks the lock.** Every log write
   ends `|| true`; a failed or unwritable log write is swallowed and the lock works
-  unaffected (the log self-truncates past ~1 MB). *Coverage planned* (Bucket 2 /
-  §4.5, the F2/J1 test). *Basis:* §F2/§J1.
+  unaffected (the log self-truncates past ~1 MB). *Witness:* unit Test 49 (points
+  `AGENT_LOCK_LOG` under a regular file so every append fails ENOTDIR; the lock
+  still acquires + releases cleanly with the log write swallowed — also covers
+  J1). *Basis:* §F2/§J1.
 
 - **BE-5 — The PowerShell 5.1 steal is claim-guarded, not atomic.** Windows
   PowerShell 5.1 lacks the 3-arg `File.Move` overload, so its steal is
@@ -304,7 +312,7 @@ of §2 are **not** claimed here.
   the no-displacement prevention (G-S3) degrades to detection (98), and old-style
   stealers can leave `.dead.*` litter. Never silent, but the prevention property is
   not guaranteed. Deployment rule: **upgrade both implementations together**
-  (`git-commit-lock.md:251-256`; to be surfaced in the README too — Bucket 3).
+  (`git-commit-lock.md:251-256`; also surfaced in `README.md:101-106`).
   *Basis:* §I2.
 
 - **OOS-4 — PowerShell port on POSIX.** Supported on Windows only; on POSIX it runs
@@ -330,9 +338,9 @@ of §2 are **not** claimed here.
   or hard-exits the lock-holding shell **and** returns 0 **while displaced**. The
   *next* holder still recovers via staleness; only the abruptly-exiting one is
   unwarned. No code change closes this without the handle-based ops the design
-  rejected. *Witness (boundary exercised indirectly):* interop Test 5 (`I:308-334`,
-  ps1 `[Environment]::Exit()`); the bash `exec` lane is a coverage gap
-  (`steering-coverage.md` A4). *Basis:* §H4.
+  rejected. *Witness:* the §H4 non-unwinding-exit boundary is pinned by interop
+  Test 5 (`I:308-334`, ps1 `[Environment]::Exit()`) and unit Test 40 (bash `exec`
+  in the lock-holding shell, OOS-5). *Basis:* §H4.
 
 - **OOS-6 — Adversarial / hostile local processes.** The lock is advisory. Against
   a process actively trying to break it (deleting/overwriting the lock file, or a
@@ -384,16 +392,18 @@ The design considered and rejected each of these; they are not roadmap items
 
 Each guarantee → its witnessing test(s) and the failure-modes section. `U` =
 `tests/git-commit-lock.test.sh`, `I` = `tests/git-commit-lock.interop.test.sh`,
-`integ` = `tests/git-commit-lock.integration.test.sh`. "Coverage planned" marks a
-guarantee that is currently reasoned-correct-but-untested and slated for a
-fault-injection test under Bucket 2 (`failure-modes.md` §4.5, Ben's override to
-add coverage); the *guarantee* is made now, the *test* lands in Phase 3.
+`C` = `tests/git-commit-lock.canary.test.sh` (the concurrency canary), `integ` =
+`tests/git-commit-lock.integration.test.sh`. The former resource-exhaustion and
+diagnostic-clock coverage gaps are now closed by the fault-injection tests
+(`failure-modes.md` §4.5): F4 (Test 48), F2/J1 (Test 49), F1 (Test 50), and the
+unreadable-mtime fail-safe (Test 42). The one remaining document-only lane is F3
+(FD/inode exhaustion), which has no portable deterministic injection.
 
 | Guarantee | Witness | failure-modes § |
 |---|---|---|
 | G-S1 no silent lost update | U Test 4b + Test 16 (unverifiable lane); I Test 8 (both dirs) | §1, §B5 |
-| G-S2 no corruption / no false hold | U Tests 32/32b (read-back failure); **resource lanes: coverage planned** (F1/F3/F4) | §1, §A1, §F |
-| G-S3 strict exclusion in window + no displacement | U Tests 1/2b/20; I Tests 1/6/16/16b; integ | §A1/§A2/§A3 |
+| G-S2 no corruption / no false hold | U Tests 32/32b (read-back failure); **resource lanes: Test 48 (F4), Test 50 (F1); F3 document-only** | §1, §A1, §F |
+| G-S3 strict exclusion in window + no displacement | C Test 1 (8×25 canary); U Tests 2b/20; I Tests 1/6/16/16b; integ | §A1/§A2/§A3 |
 | G-S4 never destroys non-lock-shaped | U Tests 17/17d/18/22 | §D3/§D4/§G1 |
 | G-S5 truthful exit codes | U Tests 7/8/4b/5/16; I run-verdict tests | §1, §H4 |
 | G-R1 lock-shaped orphans reclaimed | U Tests 2/3/21 | §B1/§C1/§C2/§C3 |
@@ -401,8 +411,9 @@ add coverage); the *guarantee* is made now, the *test* lands in Phase 3.
 | G-R3 no busy-spin; bounded wait | I Test 14b | §K(4) |
 | G-R4 no unowned lock left behind | U Tests 31/35/36 | §C4 |
 | G-I1 bash⇄pwsh same lock | I suite throughout | §I1 |
-| BE-3 unreadable mtime fails safe | **coverage planned** (E3) | §E3 |
-| BE-4 logging best-effort | **coverage planned** (F2/J1) | §F2/§J1 |
+| BE-3 unreadable mtime fails safe | U Test 42 | §E3 |
+| BE-4 logging best-effort | U Test 49 (F2/J1) | §F2/§J1 |
 
-The "coverage planned" rows are exactly the lanes Phase 1c (the steering-coverage
-audit) and Bucket 2 (the new fault-injection tests) exist to close.
+The resource-exhaustion and diagnostic-clock lanes (Tests 42/48/49/50) are the
+fault-injection coverage added per `failure-modes.md` §4 item 5; F3 (FD/inode
+exhaustion) stays document-only for want of a portable deterministic injection.
diff --git a/docs/load-testing-strategy.md b/docs/load-testing-strategy.md
index e26d68c..459a04c 100644
--- a/docs/load-testing-strategy.md
+++ b/docs/load-testing-strategy.md
@@ -1,346 +1,236 @@
-# Load & matrix testing strategy — recommendation
+# git-commit-lock: CI & load-testing strategy
 
-**Status: RECOMMENDATION for Ben's decision — not an implementation.** Produced by a
-considered, first-principles process (three parallel research agents — load fidelity, CI
-matrix, test parametrization — synthesized and cross-checked against the code), deliberately
-**not anchored** on the current `tests/with-load.sh` approach (which was thrown together from a
-few lines of discussion). It answers: are we injecting load the right way / of the right
-kinds; how to use the free public GitHub runners for a load×config matrix; and how to get more
-from the existing tests routinely — while staying **considered, not maximalist**.
-
-Grounded in `docs/failure-modes.md` (esp. §K and the correctness-vs-liveness split) and the
-product/test code. Where it cites a fact about GitHub Actions limits, treat the number as
-"current as of writing, confirm against GitHub docs before relying on it."
-
----
-
-## 0. Headline recommendations (skim)
-
-1. **Reframe load's job.** Correctness here is *load-independent* (O_EXCL + atomic rename +
-   per-attempt tokens never consult the clock for a correctness decision). So load can't break
-   exclusion or cause a silent lost update. Load has exactly two jobs: **(J1)** perturb
-   scheduling so the protocol's multi-syscall sequences get preempted at adversarial points
-   (race-surfacing), and **(J2)** broaden configs to exercise different code paths. Load
-   *magnitude* past ~2× CPU oversubscription mostly manufactures *harness wall-clock flakes*,
-   not bugs.
-2. **The biggest race-coverage lever is NOT external load — it's deterministic steering.** The
-   genuinely dangerous windows are reachable *deterministically* only by the in-process
-   function-interposition the suite already uses. Invest there first; external load is a
-   secondary, probabilistic complement for the few windows it can actually move.
-3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate is
-   never a stress-manufactured wall-clock flake — it's actionable); a **Nightly** non-blocking
-   tier that adds calibrated
-   load × kind and the parametrization sweeps, with wall-clock assertions relaxed to warnings;
-   and an on-demand **Deep sweep** (the current stress design) for the 50-clean hunt.
-4. **Fix the injection: calibrate, target, record.** Express load as an *oversubscription
-   ratio* relative to core count (not an absolute hog count); prefer calibrated mechanisms
-   (`stress-ng`, Linux cgroup `cpu.max`/`io.max`) over free-running spinners; write a per-run
-   load-manifest artifact so a flake is reproducible.
-5. **Embrace platform asymmetry** instead of a uniform injection layer: steering everywhere
-   (portable); calibrated latency on the Linux leg only; plain CPU oversubscription as the
-   macOS/Windows fallback — and record per-leg which regime actually ran.
-6. **Get more from existing tests** via a *bounded* parametrization of a named handful (waiter
-   count, fail-open ratio, poll cadence) — with strict correctness assertions kept
-   config-independent and wall-clock assertions moved to the envelope tier.
-
----
-
-## 1. What load testing is FOR here (the reframe that drives everything)
-
-This is **not** a throughput-bound system whose correctness degrades under load. Per
-`failure-modes.md` §1/§K, safety/exclusion rest on structural primitives (atomic
-create/rename, per-attempt-token discovery) that never reference the clock for a *correctness*
-decision. No amount of CPU/IO pressure makes `rename(2)` non-atomic or lets two O_EXCL creates
-both win on a local FS.
-
-So load's honest purpose is narrow: **make the protocol's multi-syscall sequences (which are
-not individually atomic) get preempted at adversarial points, so the inter-process
-interleavings the code claims to handle are actually exercised** — plus widen the few
-genuinely timing-derived decisions (mtime staleness, the FILETIME-zero floor, empty-read
-retries). The right metric for a load regime is *"does it raise the probability that process A
-is suspended between syscall N and N+1 while process B advances?"* — **not** *"does it consume
-the box?"*
-
-**Direct consequence (the most important single point):** beyond ~2× CPU oversubscription,
-more load does not find new correctness bugs — it only stretches wall-clock latency and starts
-blowing the suite's *Tier-2* wall-clock assertions (Test 21's ≤20s recovery, Test 22a's
-warning timing, Test 29's poll-count), which `failure-modes.md` §K already identifies as
-Tier-1-bound-on-a-Tier-2-quantity. The fix for those is to **scope the bound**, not pile on
-load. This is why the strategy below puts load in non-blocking tiers and keeps the gate clean.
-
----
-
-## 2. The biggest lever is deterministic steering, not load
-
-The protocol's scary windows — and whether *external load* can even reach them:
-
-| Window | Code | Reachable by external load? |
-|---|---|---|
-| create → read-back verify | `git-commit-lock.sh:1336-1357` | Only probabilistically (1 command-sub wide); deterministically via steering |
-| **claim recheck → touch → re-verify → rename** (residual 1/2 — THE delicate path) | `:1092-1168` | Probabilistically via CPU preemption; deterministically only via steering |
-| rename-over → read-back (steal install) | `:1168-1179` | Same — steering for determinism |
-| **mtime staleness / fail-open boundary (B5)** | `:1408-1410`, `:928` | **Yes** — CPU/IO load stretches cadence and can push a contended holder past STALE → exercises the 98-detect lane. The most realistic "load surfaces a real lane" case. |
-| two-poll wrong-type confirmation (ghosts) | `:1518-1567` | **Yes, but mostly the bad way** — oversubscription *starves* the poll headroom → manufactures the Test 22a-style flake rather than finding a bug |
-| FILETIME-zero floor (Windows) | `:925`, `:1408` | **No** — a *create-churn* artifact, not load-driven |
-| empty-read retry ladder (AV/create→write) | `:668-684` | Realistic trigger is Windows AV/filter-drivers, not synthetic load |
-
-**Takeaway:** the windows where a *wrong interleaving could actually corrupt state*
-(create→readback, claim→rename, rename→readback, release boundary) are reached *deterministically*
-only by the in-process function-interposition steering the suite already does (`clone_fn`,
-`tests/git-commit-lock.test.sh:127-136`). External load merely raises the background
-probability of hitting an interleaving nobody scripted. **So the primary race-coverage
-investment is MORE STEERED SCENARIOS** (portable, deterministic, attributable) — e.g. steered
-cases that park the claimant between recheck and rename, and between touch and rename, firing a
-clearer + rival. External load is a *secondary, probabilistic* complement, valuable mainly for
-the staleness/fail-open boundary (B5) it can genuinely move.
-
-A corollary for triage: because external load *cannot* break correctness, a load run that
-produces a *correctness* failure is surfacing either (a) a real logic bug in a steering-only
-window (high value) or (b) a *test-harness* setup race (`sync_waiting_fresh`/`backdate_ghost`
-losing its race under load) — a harness fix, not a code fix. Prefer deterministic mechanisms so
-an observed failure is *attributable*.
+This is the rationale for *why the CI is shaped the way it is* — the principles
+behind the three workflows (`tests.yml`, `nightly.yml`, `deep-sweep.yml`), the load
+wrapper (`tests/with-load.sh`), and the two test-level levers (the Axis-A sweep and
+the envelope tier). It describes the system as it stands; for the correctness
+guarantees the suites assert against, see `docs/guarantees.md` and
+`docs/failure-modes.md`.
 
 ---
 
-## 3. Fix the load injection: calibrate, target, record
-
-**Critique of the current `tests/with-load.sh`** (N bare CPU spinners + N `dd … conv=fsync`
-create/write/delete loops): it is a *reasonable background-jitter generator* and adequate for
-"run the whole suite under generic pressure," but from first principles it is:
-- **Uncalibrated / non-reproducible:** `LOAD=N` spinners produce wildly different real
-  preemption pressure on a 2-core vs 4-core runner, so "we tested at load N" doesn't mean a
-  fixed thing — violating the reproducible-experiments requirement.
-- **Untargeted:** a box-wide hog perturbs *everyone uniformly* (including the rival you wanted
-  to advance), so it adds jitter but doesn't *bias* the interleaving toward the adversarial
-  order. The high-value windows need a *scalpel* (slow one syscall in one process), which it
-  can't do.
-- **Blind to two windows:** it can't widen the create→write gap (the lock create is one
-  redirect, no fsync to delay) and can't *produce* the Windows delete-pending ghost (it churns
-  unrelated files); its main effect on those is the *poll-starvation false-flake* direction.
-- **Self-defeating at high N:** on a 2-core runner it pushes wall-clock far enough to blow the
-  harness's own timeouts (the workflow already had to raise every step timeout 2–3×) — load
-  manufacturing churn, not findings.
-
-**Recommendations:**
-- **Express load as an oversubscription ratio `R = stressors / nproc`** (e.g. R ∈ {0, 1, 2}),
-  not an absolute hog count, so a level is runner-independent. Note `R` is **per kind**: the
-  current wrapper's `GCL_STRESS_LOAD=N` spawns N hogs per selected kind, so `both` doubles total
-  hogs — define and cap `R_total`, and record cpu- and disk-stressor counts separately.
-- **Prefer calibrated mechanisms:** `stress-ng --cpu $((R*nproc)) --cpu-load … --metrics`
-  (defined, measurable) over bare spinners. On **Linux**, calibrated **CPU** throttling is the
-  cleanest *envelope-validation* tool — `sudo systemd-run --scope -p CPUQuota=10%` gives a
-  runner-independent quota (a 10% quota means the same everywhere; "8 hogs" does not). **Treat
-  this as a probe-required Linux-only option, not a turnkey fact:** it needs cgroup v2 +
-  controller delegation + a usable systemd manager on the GitHub `ubuntu-24.04` runner, so gate
-  it behind a CI capability probe with the `stress-ng`/ratio path as the fallback. **IO** cgroup
-  throttling is *experimental* here — it is not a simple `systemd-run -p io.max`; systemd
-  exposes it as `IOReadBandwidthMax=`/`IOWriteBandwidthMax=` with device/path caveats — so don't
-  rely on it until proven on the runner.
-- **Record a per-run `load-manifest`** artifact next to the suite logs: `{kind, R, nproc,
-  achieved-slowdown, tool versions, runner os/arch, git sha}`, uploaded on *success too* (you
-  need the negatives to interpret the positives). Optionally probe achieved slowdown with a
-  fixed micro-benchmark before/during load.
-- **Cap routine load at ~2× oversubscription;** higher R only on the deep-sweep flake-hunt leg
-  (whose *correctness* assertions stay strict but *wall-clock* assertions are relaxed).
-
----
-
-## 4. Embrace platform asymmetry (don't build a uniform injection layer)
-
-The platforms diverge too much for a "uniform" *calibrated/targeted* load layer (cgroup
-throttling and FUSE fault-injection filesystems are Linux-only for this CI plan; `strace`
-inject is Linux-only; `DYLD_INSERT_LIBRARIES` injection is unreliable on macOS for
-SIP-protected Apple/system binaries like `mv`/`git` — possible only for non-protected helper
-binaries). Don't fight it — structure around it and **record which regime ran per leg**:
-
-- **Deterministic steering** — *everywhere* (portable bash; pwsh equivalent). The real
-  race-coverage tool.
-- **Calibrated / targeted latency** (cgroup CPU quota; optionally `strace -e inject` to slow one
-  syscall in one process; a FUSE fsync-delay shim — charybdefs-style — only if window W7 is
-  prioritized) — **Linux leg only** (probe-gated, per §3).
-- **Uncalibrated oversubscription — the macOS/Windows fallback.** Both **CPU** (`stress-ng` or
-  the bash-spinner fallback) **and the simple disk-churn hog** (the current
-  `dd`/create-write-fsync-delete wrapper) run cross-platform; they are *low-fidelity and
-  uncalibrated* but real metadata-op pressure, which is why the Tier-N macOS/Windows `disk`
-  cells (§5) use them. Document the asymmetry: calibrated latency only on Linux; everywhere else
-  it's blunt oversubscription.
-
-Low-yield, **avoid:** memory/swap pressure (trivial allocation surface; risks OOM-killing the
-harness), raw disk-bandwidth saturation (doesn't touch metadata-op latency), de-prioritizing
-the background hogs. `ulimit`/inode/FD exhaustion belong to the *fault-injection tests* (the
-§4.5 work), not the timing-load regime.
-
----
-
-## 5. The three-tier CI structure (the matrix)
-
-The organizing recommendation. It maps directly onto the already-decided correctness/envelope
-test split (D-c).
-
-### Tier R — Required / per-PR (blocking) — KEEP the existing 4 cells, STRIP the load
-| Cell | OS | Engines | Buys |
+## 1. The principle: correctness is load-independent
+
+This is not a throughput-bound system whose correctness degrades under load. Safety
+and exclusion rest on structural primitives — `O_EXCL` create, atomic `rename(2)`,
+per-attempt token discovery — that never consult the clock for a *correctness*
+decision (`guarantees.md` §E, BE-1; `failure-modes.md` §K). No amount of CPU or IO
+pressure makes a rename non-atomic or lets two `O_EXCL` creates both win on a local
+filesystem.
+
+So load does not *change what is correct* — it only *surfaces races*. Its sole job
+is to widen the timing windows in the protocol's multi-syscall sequences (which are
+not individually atomic) so that the inter-process interleavings the code claims to
+handle are actually exercised. The right question to ask of a load regime is "does
+this raise the probability that process A is suspended between syscall N and N+1
+while process B advances?" — not "does it consume the box?". Past roughly 2× CPU
+oversubscription, more load finds no new correctness bugs; it only stretches
+wall-clock latency and starts tripping the suite's best-effort timing assertions.
+
+Two consequences shape the whole design:
+
+- **The per-PR gate runs no load** (strict, fast). A red required check is then
+  always actionable — a real correctness bug or genuine infra drift, never a
+  stress-manufactured wall-clock flake.
+- **Load lives in non-blocking tiers** (nightly, deep-sweep), where the
+  load-sensitive timing assertions are relaxed to warnings so an oversubscribed
+  runner cannot turn a latency stretch into a red.
+
+## 2. Deterministic steering is the primary race-coverage lever
+
+The protocol's genuinely dangerous windows — create → read-back verify; the claim
+recheck → touch → re-verify → rename residual; rename-over → read-back on a steal;
+the release boundary — are ones where a *wrong interleaving could actually corrupt
+state*. External load can only reach those windows *probabilistically*: it raises
+the background chance of hitting an interleaving nobody scripted.
+
+The suite reaches them *deterministically* instead, by in-process function
+interposition. `clone_fn` (`tests/_harness.sh`) clones a library internal (or
+shadows a command like `mv`/`rm`/`touch` with a shell function) so a steering test
+can land "the rival's rename" at an exact protocol position, then call the original
+through the clone (the Test 23–36 steered scenarios in
+`tests/git-commit-lock.test.sh`). This hits the exact protocol window every run,
+attributably — which is why it, not external load, is the primary race-coverage
+investment.
+
+External load is the secondary, broad-net lever. It earns its place mainly on the
+one window it can genuinely move: the mtime-staleness / fail-open boundary, where
+CPU/IO pressure stretches a contended holder past the STALE threshold and exercises
+the detected-98 lane. A corollary for triage: because external load *cannot* break
+correctness, a load run that produces a *correctness* failure is surfacing either a
+real logic bug in a steering-reachable window (high value) or a test-harness setup
+race (a harness fix, not a code fix).
+
+## 3. The three tiers
+
+### Tier R — required, per-PR (`tests.yml`)
+
+The blocking gate. It runs every suite (unit, interop, integration, and the
+full-width concurrency canary as its own parallel cell) at full fan-out
+(`GCL_TEST_FULL=1`) with **no load** and the **strict** envelope tier (the default —
+the workflow sets no `GCL_ENVELOPE_TIER`, so every timing assertion is hard). The
+matrix is:
+
+| Cell | OS | Engines / leg | Buys |
 |---|---|---|---|
-| R1 | ubuntu | bash + pwsh7 (all suites) | Linux correctness + interop baseline |
-| R2 | macos | bash + pwsh7 (all suites) | BSD `stat`/`mv` lanes (D1/E3) — *only* place these run |
-| R3 | windows (unit leg) | bash (MINGW) | delete-pending ghosts, FILETIME floor |
-| R4 | windows (interop+integration leg) | bash + pwsh7 + **PowerShell 5.1** | the 5.1 non-atomic-fallback path (D1) + real NTFS commit swarm |
-
-This is exactly today's matrix **minus the stress env**. Running it at **`none` load** means it
-only ever asserts Tier-1 correctness — it *cannot* flake on a Tier-2 wall-clock bound, so **a
-red required check is never stress-manufactured envelope noise.** It's always actionable — a
-real bug, or at worst runner-image/action-download/infra drift (which is also worth knowing) —
-never a "load was too high" false alarm. Target < ~8 min. (Also: flip the concurrency group
-back to `${{ github.workflow }}-${{ github.ref }}` + `cancel-in-progress: true` — the current
-per-run-unique group is a *deep-sweep* setting, which is exactly why the stress branch is marked
-"do NOT merge to main.")
-
-### Tier N — Nightly / scheduled (non-blocking, triaged)
-~6 cells adding load **kind** (cpu / disk / both) at **one** oversubscribed level (R≈2), plus
-the §6 parametrization sweeps. Run with **`GCL_ENVELOPE_TIER=relax`** so the three known
-load-sensitive assertions (Test 21 ≤20s, Test 22a warning, Test 29 poll-count) **downgrade to
-warnings** while correctness assertions stay hard. Example cells: ubuntu×{disk, both, cpu},
-macos×disk, windows×{disk on the interop+5.1 leg — highest-value, both on the unit leg}.
-Auto-file a triaged issue on failure tagged `correctness` (investigate) vs `envelope-flake`
-(expected). macOS gets one harsh cell only (it's the scarce/slow runner); ubuntu absorbs the
-extra kinds (cheapest).
-
-### Tier D — On-demand deep sweep (`workflow_dispatch`, never gates)
-The current stress-branch design *is* this tier — keep its `stress_kind`/`stress_load` inputs
-and per-run-unique concurrency (many parallel dispatches), add `repeat` (run a cell K times)
-and `width` inputs. This is the "50-clean under both/8-hog" hunt: informational, time-boxed by
-choice, never a contract.
-
-**Why this is the linchpin:** keeping artificial load *off the required gate* is what makes the
-gate trustworthy; putting all load in non-blocking tiers with the envelope assertions relaxed is
-what stops load from manufacturing flakes that erode trust. The split needs a small product/test
-change: a `GCL_ENVELOPE_TIER=relax` env that downgrades the wall-clock assertions — nightly/deep
-set it, required never does.
-
----
-
-## 6. Get more from existing tests: bounded parametrization
-
-Today there are only two coarse knobs: `GCL_TEST_FULL` (global fan-out) and per-case
-hard-coded `AGENT_LOCK_*` values (never swept). Add **one** mechanism — a per-axis sweep over a
-**named handful** of tests (sum the axes, do **not** cross-product):
-
-- **Axis A — waiter/stealer count (highest value):** T2b (frozen at 4), T20, interop T16. Sweep
-  N ∈ {4, 12, 24}. Widens the thundering-herd/claim-serialization and displacement windows that
-  re-running N=4 never will.
-- **Axis B — fail-open ratio (hold ÷ STALE):** a parametrized T4b/T1 variant running hold ≪
-  STALE / hold ≈ STALE / hold > STALE, asserting the *correct verdict per regime* (clean → 0
-  steals; over → exactly one steal + a 98).
-- **Axis C — poll cadence:** {fast 0.05, **default 2s**}. The shipped 2s default is currently
-  never exercised under contention.
-- **Axis D — CLAIM_STALE depth (lower value):** {2, 60} on T21.
-
-**Do not sweep:** round count (keep as the nightly *soak* dial, not a coverage axis), MAX_WAIT
-(timeout-only), the deterministic steered protocol tests (T23–T36 — re-running reruns the same
-steered path), or the integration suite's worker count beyond FULL/REDUCED (it's strict in both
-modes by design and wall-clock-bound by serialized commits).
-
-**Flakiness discipline (critical):** keep correctness assertions **config-independent** — when
-sweeping N, hold STALE ≫ hold so "zero-98 / one-steal" stays a pure correctness statement, and
-**scale MAX_WAIT with N** (more waiters = more serialized turns) so a large-N run doesn't time
-out and *look* like a product failure. Move wall-clock/poll-count assertions to the envelope
-tier. Keep the existing `sync_waiting_fresh`/`backdate_ghost` scaffolding — at higher N it
-matters more.
-
-**Cadence:** per-PR runs the floor point of each axis (today's behavior, deterministic);
-nightly runs the sweeps under a `GCL_TEST_SWEEP=1` gate. The sweep (per-suite fan-out/knobs) is
-*orthogonal* to the OS/leg matrix — compose additively (per-PR = matrix × floor; nightly =
-matrix × sweep), never multiply everything on every PR.
-
----
-
-## 7. GitHub Actions realities (the real constraints — confirm against current docs)
-
-- **Minutes are free on public repos; concurrency is the real ceiling.** Free-plan accounts cap
-  concurrent jobs at **20 total, with a 5-job macOS sub-limit** (confirm against GitHub's
-  current limits page). A matrix past that **queues** (serialises into waves), it doesn't fail.
-  Design any single triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep
-  intentionally exceeds this and accepts waves.
-- **Cost-weight is separate from queue scarcity (don't conflate).** On a public repo standard
-  runners are *free* — the per-minute rates don't consume credits or set queue priority. They do
-  signal relative runner *cost/scarcity*: roughly Linux 1×, **Windows ~1.7×** ($0.010 vs
-  $0.006/min), **macOS ~10×** ($0.062/min). The real constraint on macOS is the **5-job
-  sub-limit** above, plus it being the slowest pool. → keep macOS cells **sparse**, ubuntu
-  liberal.
-- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal).
-  **`max-parallel` only limits parallelism *within a single matrix run*** — it does **not**
-  reserve capacity across separate workflow runs or the deep sweep's many `workflow_dispatch`
-  invocations. To stop a sweep starving the required gate, **bound the deep/nightly tiers with a
-  workflow-level `concurrency` group (and cap the dispatcher width)**, not `max-parallel` alone.
-  256-job hard cap per workflow run (irrelevant at our scale).
-- **Triggers:** required on `pull_request` + `push: main`; nightly on `schedule` (cron,
-  off-peak minute) + `workflow_dispatch`; deep on `workflow_dispatch` only — heavy load never
-  sits in a PR's critical path. (Note: `schedule` triggers are auto-disabled after ~60 days of
-  repo inactivity.)
-- **`paths-ignore` gotcha on a *required* check.** A workflow skipped by path filtering leaves
-  its checks **Pending**, which *blocks merge* if those checks are required. So **don't** put
-  `paths-ignore` on the workflow whose jobs are the required checks and expect doc-only PRs to
-  merge. Instead either (a) keep the required workflow always-running with a tiny always-green
-  job and path-filter only the expensive test jobs, or (b) make a separate cheap job the
-  required check. (Doc-only-skip is still worth doing — just not on the required-check workflow
-  naively.)
-- **Artifacts:** keep the existing `upload-artifact` (with `include-hidden-files` for the
-  `.git/`-buried lock logs); name uniquely per (os, leg, kind, level) so parallel cells don't
-  collide.
-
----
-
-## 8. Considered, not maximalist — the decision rule
-
-> **A cell enters the routine matrix (R or N) only if it can surface a bug class no other
-> routine cell can. Otherwise it's a deep-sweep cell, or it doesn't exist.**
-
-- Cap the routine matrix: **R ≤ 4, N ≤ ~8.** New routine cells must *displace* one, forcing the
-  "does this find something the others can't?" question.
-- **Earn the slot:** a config/cell graduates deep → nightly only after the deep sweep actually
-  caught a distinct failure there (mirrors the project's own "tested edge cases earn confidence"
-  philosophy). Demote a cell that's been green for ~60 days and whose window is a subset of
-  another green cell's.
-- Prefer *one* oversubscribed level over a level sweep; prefer *attributable* single-kind cells
-  over `both`-only when you want to localise a flake.
-- **Trustworthiness invariant:** required = always-meaningful-red; nightly = triaged-amber-
-  tolerant; deep = noise-by-design. Don't retry-mask the required tier (a retry that hides a
-  1-in-20 real race is exactly the silent-loss class this tool exists to prevent).
-
----
-
-## 9. Open decisions for Ben (what to pick before Phase 2 plans the build)
-
-1. **Nightly aggressiveness:** ~6 cells, cron daily vs weekly? (rec: ~6 cells, daily off-peak;
-   start smaller and grow by the earn-the-slot rule.)
-2. **Linux load mechanism:** adopt calibrated cgroup `cpu.max`/`io.max` throttling on the Linux
-   leg (reproducible, the right envelope-validation tool) vs keep the simple wrapper but
-   calibrate it by oversubscription ratio? (rec: cgroup on Linux for the envelope leg; keep a
-   ratio-calibrated `stress-ng`/spinner as the cross-platform race-jitter lane.)
-3. **`stress-ng` dependency:** add an install step (apt/brew) vs keep a pure bash spinner
-   (zero-dep, uncalibrated)? (rec: `stress-ng` where available + spinner fallback on Windows.)
-4. **Parametrization scope now:** Axis A (waiter count) only, or A+B+C? (rec: A first — highest
-   value, lowest flake risk — then B, then C.)
-5. **The envelope-tier switch** (`GCL_ENVELOPE_TIER=relax`): confirm this is how we implement the
-   D-c correctness/envelope split (a small test-harness change downgrading the 3 wall-clock
-   assertions to warnings under load). (rec: yes — it's the cleanest implementation of D-c.)
-6. **Nightly triage channel:** auto-file/track issues on nightly failure, tagged correctness vs
-   envelope? (rec: yes — otherwise scheduled-run reds are invisible.)
-
-These choices feed **Phase 2** (the implementation plan). This doc is a recommendation only —
-no code, no workflow changes, until you've decided.
-
----
-
-## Appendix — provenance
-Synthesized from three parallel first-principles research passes (load fidelity & injection
-mechanisms; CI matrix on free public runners; existing-test parametrization), each grounded in
-`git-commit-lock.sh`/`.ps1`, the three suites, `tests/with-load.sh`, `.github/workflows/tests.yml`,
-and `docs/failure-modes.md`, and cross-checked against the code (one agent's claim that
-`tests/with-load.sh` was absent was verified false — it exists and is tracked). A foreign-model
-(Codex, web-grounded) review has been applied: it confirmed the §2 window→load reachability
-table against the code and the core GitHub-Actions facts (20-total / 5-macOS free-plan
-concurrency, 256-job matrix cap, 60-day schedule auto-disable, `cancel-in-progress`, `stress-ng`
-availability), and its corrections are folded in — the cgroup mechanism is now marked
-**probe-required** (CPU quota only; IO throttling experimental), the `max-parallel` and
-`paths-ignore`-on-required caveats added, billing-weight separated from queue-scarcity, and the
-FUSE/SIP claims hedged.
+| ubuntu-24.04 `all` + `canary` | Linux | bash + pwsh7 | Linux correctness + interop baseline |
+| macos-15 `all` + `canary` | macOS | bash + pwsh7 | BSD `stat`/`mv` lanes |
+| windows-2025 `unit` | Windows | bash (MINGW) | delete-pending ghosts, FILETIME floor |
+| windows-2025 `interop-integration` | Windows | bash + pwsh7 + **PowerShell 5.1** | the 5.1 non-atomic-fallback path + real NTFS commit swarm |
+| windows-2025 `canary` | Windows | bash (MINGW) | full-width concurrency under process-spawn overhead |
+
+The canary runs as a separate parallel cell on every arch because it is about half
+the Windows unit wall-clock; suites must *not* run concurrently inside one runner
+(they are timing-sensitive on 2-core runners). Triggers: `pull_request` and
+`push: main` (both `paths-ignore` docs/`.plans`/license), a weekly `schedule` to
+catch runner-image and tool drift, and `workflow_dispatch`. The concurrency group is
+`${{ github.workflow }}-${{ github.ref }}` with `cancel-in-progress: true`, so rapid
+pushes coalesce. A separate `lint` job gates shellcheck (pinned v0.11.0, `-S style`)
+and PSScriptAnalyzer (warning severity).
+
+### Tier N — nightly, scheduled (`nightly.yml`)
+
+A non-blocking scheduled stress run (08:23 UTC daily, plus `workflow_dispatch`).
+This project has **no branch protection** (single-dev, decision 2026-06-18), so
+nightly never gates a PR; its job is to catch the load-sensitive flakes and coverage
+regressions the no-load per-PR gate cannot.
+
+Six `stress` cells run the suites wrapped in `tests/with-load.sh` at one
+oversubscription level (`GCL_STRESS_RATIO=2`, R≈2), one `GCL_STRESS_KIND` each:
+ubuntu×{cpu, disk, both}, macos×disk, windows interop-integration×disk, windows
+unit×both. macOS gets a single cell (it is the scarce, slow pool); ubuntu absorbs
+the extra kinds (cheapest). The whole workflow runs with two test-level levers
+turned on (§4): `GCL_ENVELOPE_TIER=relax` (the three load-sensitive timing
+assertions warn instead of failing; correctness assertions stay hard) and
+`GCL_TEST_SWEEP=1` (the Axis-A waiter-count sweep). Each cell writes its own
+`cell-conclusion.txt` (ground truth, captured under `always()`) and uploads its logs
+plus the load-manifest on success too — the negatives are needed to read the
+positives.
+
+A separate `kcov` job runs the unit + canary suites under kcov v43 (built from
+source) on Linux, **no load, strict envelope, full fan-out**, and gates line
+coverage of `git-commit-lock.sh` at a 0.80 floor (tracks ~0.83 achieved; ratchets up
+as tests land). It explicitly overrides the workflow-level `relax` back to `strict`
+so coverage is measured on a clean run.
+
+A `triage` job (`always()`) downloads every cell's artifact and classifies each into
+one labelled issue per (date, class): `nightly-correctness` (a correctness assertion
+failed — investigate), `nightly-envelope` (a relaxed timing miss — expected,
+tracked), or `nightly-infra` (missing artifact / timeout / errored — not a product
+failure). An empty-round guard prevents "0 FAIL across 0 logs" being misread as
+green when an artifact set is entirely missing.
+
+### Tier D — on-demand deep sweep (`deep-sweep.yml`)
+
+`workflow_dispatch`-only; it never runs on push/PR and never gates anything. This is
+the deep flake-hunting instrument — the "50-clean hunt". A dispatch picks a
+`stress_kind`, an optional raw `stress_load` override, a `repeat` count, and an
+`envelope_tier` (defaults `relax`). Each suite is run `repeat` times under load in a
+fail-fast loop that names the failing iteration. The concurrency group is per-run
+(`deep-${{ github.run_id }}`) so many parallel dispatches fan out freely and accept
+queue waves rather than cancelling each other. Timeouts are deliberately generous
+(deep + loaded + repeated is far slower than the gate).
+
+## 4. The two test-level levers
+
+These let the existing tests yield more under load without touching the per-PR
+gate's behaviour.
+
+**The Axis-A waiter-count sweep** (`GCL_TEST_SWEEP`, `T_AXIS_A` in
+`tests/_harness.sh`). By default `T_AXIS_A="4"`, so per-PR and plain dev runs are
+byte-identical to the historical behaviour. Under `GCL_TEST_SWEEP=1` (nightly and
+deep only) it becomes `"4 12 24"`, and the fan-out/contention tests iterate over it —
+unit Test 2b, unit Test 20 (which composes its own list from its mode-driven floor
+plus the sweep's higher counts), and interop Test 16 — each naming N in every
+assertion message so a sweep failure says which N broke. This widens the
+thundering-herd / claim-serialization and displacement windows that re-running N=4
+never will. Correctness assertions are kept config-independent (e.g. hold ≫ STALE so
+"zero-98 / one-steal" stays a pure correctness statement) and MAX_WAIT scales with N,
+so a large-N run doesn't time out and *look* like a product failure.
+
+**The envelope tier** (`GCL_ENVELOPE_TIER`, default `strict`, in
+`tests/git-commit-lock.test.sh`). A wall-clock or poll-count bound is a best-effort
+liveness property (`guarantees.md` BE-1), not a correctness one. The `ok_envelope` /
+`bad_envelope` assertion helpers behave exactly like the hard `ok`/`bad` under
+`strict`; under `relax` a `bad_envelope` becomes a `WARN` that does not increment
+FAIL. Three assertions are tiered this way — recovery latency ≤20s (Test 21), the
+claim-path config warning firing (Test 22a), and the failed-steal's claim being
+re-created rather than left to age out (Test 29). Nightly and deep set `relax`;
+per-PR and the kcov job never do. So an oversubscribed runner can stretch wall-clock
+to a warning without reddening correctness, while correctness assertions stay hard in
+both tiers.
+
+## 5. How load is calibrated (`tests/with-load.sh`)
+
+The wrapper runs a command under a calibrated, reproducible background load, then
+tears it down by *exact spawned PIDs* (never by name — safe on a shared box and on an
+ephemeral runner) and propagates the wrapped command's exit code.
+
+- **Load is an oversubscription ratio**, not an absolute hog count:
+  `GCL_STRESS_RATIO` (R, default 1) gives stressors-per-kind = `round(R × nproc)`,
+  floored at 1 for a selected kind. "R=2" means the same pressure on a 2-core and a
+  32-core runner, where a raw hog count would not.
+- **The total ratio is capped** by `GCL_STRESS_RATIO_MAX` (default 2). `both` runs
+  cpu + disk, so its total would be 2R; the cap scales each kind down proportionally
+  so the runner is never wedged. The deep-sweep flake hunt can raise it deliberately.
+- **`GCL_STRESS_KIND`** selects `none` (clean pass-through, zero added load),
+  `cpu`, `disk`, or `both`. **`GCL_STRESS_LOAD`** is a back-compat raw per-kind
+  count override (kept so the deep-sweep `stress_load` input keeps working); empty
+  ⇒ use the ratio.
+- **CPU stressor:** `stress-ng --cpu` when available (calibrated, measurable), else a
+  portable bash spin loop. **Disk stressor:** a tight create / write+fsync / delete
+  loop over a small file on the test scratch volume — real metadata + write-back
+  pressure that contends with the lock-file create/delete the suite itself does
+  (always the portable shell hog; cross-platform, low-fidelity but real).
+- **A per-run `load-manifest` JSON** is written next to the suite logs (on success
+  too): `{kind, R, ratio_max, raw-load override, nproc, cpu/disk/total stressor
+  counts, capped?, cpu mechanism, cgroup probe, baseline/loaded ms, achieved
+  slowdown, tool versions, os/arch, git sha, command}`, so any flake is reproducible.
+  A cheap fixed bash micro-benchmark, timed unloaded then mid-load, records a coarse
+  achieved-slowdown figure (only when load is actually applied).
+
+### Platform asymmetry (current operating facts)
+
+The platforms diverge too much for a uniform calibrated injection layer, so the
+wrapper is honest about which regime ran:
+
+- Deterministic steering is portable (bash everywhere; pwsh equivalent) — the real
+  race-coverage tool, on every leg.
+- Calibrated CPU throttling via a cgroup v2 quota is **Linux-only and probe-gated**:
+  `GCL_STRESS_CGROUP=1` makes the wrapper *probe* for a writable cgroup v2 cpu
+  controller and record the result in the manifest (`writable` /
+  `present-not-delegated` / `no-cpu-controller` / `no-cgroup-v2`); it does not create
+  scopes here (that needs a usable systemd manager). IO cgroup throttling is
+  experimental and intentionally not attempted.
+- Everywhere else (macOS, Windows) load is blunt CPU/disk oversubscription —
+  uncalibrated but real pressure.
+
+## 6. GitHub Actions operating facts
+
+- **Minutes are free on public repos; concurrency is the real ceiling.** Free-plan
+  accounts cap concurrent jobs (~20 total, with a smaller macOS sub-limit). A matrix
+  past that *queues* into waves, it doesn't fail. The required gate stays small
+  enough to run in one wave; the deep sweep intentionally exceeds it and accepts
+  waves. macOS is the slowest and scarcest pool, so it is kept sparse across all
+  tiers; ubuntu (cheapest) is used liberally.
+- **`fail-fast: false`** on every matrix — an OS-specific failure is exactly the
+  signal we want, so the other legs finish.
+- **`paths-ignore` and required checks:** `tests.yml` filters docs/`.plans`/license
+  paths. A workflow whose jobs are *required* checks would leave those checks
+  Pending (blocking merge) when skipped by a path filter — but this project has no
+  branch protection, so the filter just saves runner minutes on doc-only pushes
+  without that hazard.
+- **Artifacts** are uploaded with `include-hidden-files: true` (the integration
+  suite's key diagnostics — lock log, repo state — live under the scratch repo's
+  `.git/`) and named uniquely per cell so parallel uploads never collide.
+- All actions are SHA-pinned.
+
+## 7. The discipline: required = always-meaningful-red
+
+The invariant that ties it together: **required is always-meaningful-red; nightly is
+triaged-amber-tolerant; deep is noise-by-design.** Keeping artificial load off the
+required gate is what makes a red gate trustworthy; putting all load in non-blocking
+tiers with the envelope assertions relaxed is what stops load from manufacturing
+flakes that erode that trust. The required tier is never retry-masked — a retry that
+hid a 1-in-20 real race would defeat the silent-loss class this tool exists to
+prevent.
diff --git a/docs/steering-coverage.md b/docs/steering-coverage.md
deleted file mode 100644
index 8abaa03..0000000
--- a/docs/steering-coverage.md
+++ /dev/null
@@ -1,288 +0,0 @@
-# Deterministic-steering coverage: audit and gap list
-
-**Status: analysis / work-scoping.** This document maps the protocol's
-race-critical windows and branches to their deterministic-steering tests (or
-gaps), and scopes the test work that closes the gaps. It is the output of Phase
-1c of the [guarantees-and-coverage plan](../.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md)
-(Bucket 7). Gap-*filling* is Phase 3 (bundled with the Bucket 2 fault-injection
-tests); this doc decides *what* to fill and *how*.
-
-**Why steering, not load.** As [`load-testing-strategy.md`](load-testing-strategy.md)
-establishes, the protocol's correctness rests on structural properties (O_EXCL
-create + atomic rename + per-attempt tokens), so the primary coverage lever is
-**in-process function interposition** — the test suite's `clone_fn` mechanism
-shadows internal `_lock_*` functions (and `mv`/`rm`/`touch`) to force an exact
-interleaving deterministically. External load only *probabilistically* widens
-the same windows. This audit therefore measures *steering* coverage, with an
-objective `kcov` line-coverage pass as a cross-check.
-
----
-
-## 1. Method and headline numbers
-
-Three independent inputs, reconciled below:
-
-1. **Manual window audit — acquire + steal paths.** Every branch/residual mapped
-   to its steering test or a gap.
-2. **Manual window audit — hold + release + discovery + staleness/mtime paths.**
-3. **`kcov` objective line coverage** (the mechanical cross-check) — built from
-   source (kcov v43; no apt package / prebuilt binary exists) and run on the unit
-   suite at FULL fan-out under WSL Ubuntu-24.04. Artifacts (gitignored):
-   `.agent-testing/kcov/` (`cobertura.xml`, merged unit+integration, line-by-line
-   HTML). Repro commands in [§5](#5-kcov-reproduction).
-
-**kcov result: 83.1% line coverage — 451 / 543 instrumented lines; 92 never
-executed.** (kcov does not do real branch coverage on bash — its branch numbers
-are trivially 1.0 and must be ignored.) The integration suite added **zero** lines
-over the unit suite, so the unit suite is the comprehensive measurement.
-
-Of the 92 uncovered lines:
-
-- **~30 are platform-gated and *correctly* unreachable on Linux** — ~23 in the
-  Windows no-delete-share handle lanes (an open handle blocking `unlink`/`rename`,
-  which never happens on POSIX), plus 3 in the macOS/BSD `mv` fallback. These are
-  covered on the **Windows** CI leg (interop Tests 13/31d/33c) and would need a
-  **macOS/BSD** leg for the `mv` fallback. They are **not** Linux gaps. The
-  practical Linux line-coverage ceiling is therefore ~94% ((543−30)/543), not
-  100%.
-- **~62 are Linux-reachable** — the real targets, prioritized in [§3](#3-the-gap-list-prioritized).
-
-**The cross-check earned its place.** kcov objectively corrected **three
-over-credits** in the manual audit — branches the manual reasoning inferred were
-covered, but which `kcov` shows were never executed:
-
-| Branch | Manual audit said | kcov (objective) | Reconciled |
-|---|---|---|---|
-| step-3.3 pre-rename CLAIM-ABORT block (`:1151-1160`) | covered via the step-2 / `deletion-gone` matrix positions | **hits=0** | **GAP** — the step-2 twin is steered, the near-identical step-3.3 twin is not |
-| `foreign` claim-recheck branch (`:1103-1106`) | covered via Test 33b + the matrix | **hits=0** | **GAP** — only the `gone` recheck leg is steered |
-| EXIT-trap no-hold arc-end (`:1009,1017-1018`) | transitively covered | **hits=0** | **GAP** — only the *signal* (TERM) no-hold twin is steered, not the EXIT-while-waiting one |
-
-This is the value of a mechanical pass over correlated manual reasoning: trust the
-instance, verify the output against the tool. Where this doc and a manual claim
-disagree, **kcov's `hits=0` wins**.
-
-(Line numbers below are anchors against the current `ci-stress` tree and may drift
-a few lines; the manual audits re-located everything and found the
-failure-modes.md anchors had moved ~9 lines.)
-
----
-
-## 2. What is already well covered (for confidence)
-
-The audit confirms the protocol's *delicate* paths are strongly steered, so the
-gaps are at the edges, not the core:
-
-- **The two read-back "twins"** are each independently steered with opposite
-  claim-token gates: the create-path "I twin" (`acquire verification FAILED`,
-  `:1354-1361`) by **Test 32**, and the steal-path "F2 twin" (`steal rename
-  completed but read-back`, `:1171-1179`) by **Test 32b**.
-- **The discovery rule** — the ownership-discovery read on every non-rename exit —
-  by **Test 25**'s 7-position matrix (`step2-fresh`, `recheck-gone`, `touch-gone`,
-  `lock-gone`, `contested`, `deletion-gone`, `source-gone`), each steering a rival
-  install to an exact protocol point.
-- **The two discovery routes** (direct `_lock_discover` vs the per-poll
-  leaked-token-memory check) each independently steered (Test 25 vs Test 31b),
-  with Test 31a deliberately accepting *either* route on the genuine scheduling
-  race between them.
-- **The claim re-verify / touch / lease-reset lane** (Tests 23/24/26/27), the
-  leaked-claim family (Tests 31/35/36), the never-steal guards for dir/symlink/FIFO
-  at both lock and claim paths (Tests 17/22), and the trap-time claim cleanup
-  (Test 33).
-
----
-
-## 3. The gap list, prioritized
-
-Each gap: location, what it is, how to steer it, and a priority. "Portable
-interposition" = a `clone_fn`/shadow test that runs on every OS (the cheapest,
-most valuable kind). "Fault injection" = needs a real resource/IO failure. "Platform"
-= only reachable / only meaningful on a specific OS leg.
-
-### Tier A — Portable deterministic steering (do these first; no fault injection)
-
-These are new `clone_fn`/shadow tests in the unit suite, runnable on every leg.
-
-- **A1 — `CLAIM-ABORT (rename-refused)`: wrong-type object at the lock path
-  mid-steal** (`:1195-1202`). *Headline gap.* The only acquire/steal **verdict**
-  branch with no steering test, and it has its own log string. (This is the
-  F2-audit #7 lane; the strategy doc's §2 reachability table missed it.) *Steer:*
-  `clone_fn _lock_verify_stale` (or shadow `mv`) to `mkdir` a directory onto the
-  lock path immediately before the rename; assert `rename-refused` + claim deleted
-  + discovery + no false hold. **Highest value.**
-
-- **A2 — step-3.3 pre-rename CLAIM-ABORT block** (`:1151-1160`; kcov-corrected
-  over-credit). The `gone`/`wrongtype`/`fresh` reason map + claim-delete +
-  discovery + `return 1`, near-identical to the step-2 block but separately
-  reachable. *Steer:* a `_lock_verify_stale` shadow with a call-counter that flips
-  to not-stale on the **second** call (step-3.3), the first call (step-2) passing.
-  **High value** (a whole unexercised abort lane).
-
-- **A3 — `foreign` claim-recheck branch** (`:1103-1106`; kcov-corrected
-  over-credit). A clearer removed our claim and a rival re-claimed → leave it,
-  discovery read, back off. *Steer:* shadow the claim read at recheck to return a
-  foreign token. **Medium-high.**
-
-- **A4 — `exec`-bypass of release / the §H4 no-silent-loss boundary** (`lock_run`
-  runs the wrapped command vector in the wrapper shell, `:1733`). No test exercises
-  the bash bypass; the ps1 `[Environment]::Exit()` twin *is* (interop Test 5).
-  **Empirically verified (2026-06-17):** the bypass needs the exec to run in the
-  **lock-holding shell itself** — `run -- exec true` (the wrapped command *is* an
-  exec), or a sourced `lock_acquire; exec true` — **not** `run -- bash -c 'exec
-  true'`, which execs a *child* and lets the wrapper release normally (so that
-  recipe would silently pass without testing anything). *Steer, two parts:* (a)
-  benign — `run -- exec true` (or sourced `lock_acquire; exec …`) and assert no
-  `RELEASED` line / lock left held; (b) the silent-loss — backdate the lease + park
-  a contender so the holder is *displaced*, then exec a 0-exit and assert the caller
-  sees 0 with **no** 98 (pinning [`guarantees.md`](guarantees.md) OOS-5). **High
-  value** — the one interleaving that can silently lose an update. *Note:* this
-  corrected the original audit recipe, which used the non-bypassing `bash -c 'exec'`
-  form — a foreign-model (Codex) review + a 4-line empirical check caught it; the
-  manual audit and a same-model reviewer both had it wrong.
-
-- **A5 — forward clock-jump → premature steal of a live lock** (§E2; age = now −
-  mtime, `:928,1409`). Code-safe (degrades to the detected-98 lane) but untested.
-  *Steer:* `clone_fn _lock_now` to return now+offset on the poll while the real
-  holder's mtime stays current, forcing age ≥ STALE on a live lock; assert the
-  victim's release hits 98 (a clock-driven analogue of Test 4b). **Medium.**
-
-- **A6 — mtime-unreadable fail-safe** (§E3; `:639-645` warn, `:912-926` consume).
-  Only a *negative* assertion exists (the warning must NOT fire under normal
-  contention, Test 1). *Steer:* `clone_fn` the mtime helper (`_lock_path_mtime` /
-  the `stat` shadow) to return empty on a present file; assert the warn-once fires,
-  no steal occurs, and a waiter reaches 97. **Medium** (it is the clean reason
-  recovery is Tier-1-*within-envelope*, so worth pinning).
-
-- **A7 — malformed/unreadable content classification tails** (the `_lock_verify_stale`
-  tail `:940-949`; the in-acquire steal content guard `:1429-1443`; the
-  `_lock_claim_stale_check` content tail `:1240-1249`). The `tok.`-prefixed and
-  empty-orphan lanes are covered; the **non-empty-blank-line-1** (`#18`),
-  **unreadable-content steal-skip** (`#17`), and **vanished-mid-check** sibling
-  branches are not. *Steer:* fabricate a line-1-whitespace file and a
-  read-fault shadow; backdate; assert no-steal + the right warning. **Low-medium,
-  cheap** (several branches per small test).
-
-- **A8 — socket & device-node wrong-type arms** (`:1474-1475` claim path,
-  `:1561-1562` lock path; kcov-new). The dir/symlink/FIFO arms are tested; the
-  socket (`-S`) and device (`-b/-c`) arms are not. *Steer:* bind a unix socket /
-  reference a device node (`/dev/null`) at the path; assert refusal. **Low, cheap**
-  (sibling arms of a tested guard; both creatable on Linux).
-
-- **A9 — log rotation past 1 MB** (`:558-559`; kcov-new). *Steer:* pre-write a
-  >1 MB log, trigger a log call, assert truncate-restart. **Low, trivial** (no
-  fault injection).
-
-- **A10 — EXIT-trap no-hold arc-end** (`:1009,1017-1018`; kcov-corrected
-  over-credit). EXIT while *waiting* without a hold or in-flight claim. *Steer:* a
-  sourced `lock_acquire` that exits while still blocked; assert the no-hold
-  cleanup/restore path runs. **Low.**
-
-- **A11 — `mv -T` fallback forced on** (`:969,976-977`). Naturally hit only on
-  BSD/macOS, but **made Linux-steerable** by forcing `_LOCK_MVT=0` (or shadowing
-  the probe's `mv -T` to fail) in a sourced steering shell, then running a steal —
-  and a steal-into-a-directory to hit the `[ -d ]` guard (dovetails with A1).
-  **Low-medium** (closes a real engine lane on the common leg instead of waiting
-  for a BSD runner).
-
-### Tier B — Fault injection (real resource/IO failures; mostly POSIX-only)
-
-These are the [`failure-modes.md`](failure-modes.md) §4.5 lanes (Ben's override to
-add coverage) plus the read-fault siblings. They need a real failure, not
-interposition; guard by platform and **flag any that can't be injected portably
-rather than shipping a flake** (per the §4.5 decision).
-
-- **B1 — Unwritable lock dir/parent → clean 97** (F4). `chmod` the dir.
-  POSIX; the cheapest and highest-value fault-injection test. **High.**
-- **B2 — Unwritable/failing log path → lock still works, log swallowed** (F2/J1).
-  *Phase-2 feasibility:* use the **ENOTDIR trick** (`AGENT_LOCK_LOG` under a regular
-  file) — **portable, no chmod/guard**. **First cut.**
-- **B3 — ENOSPC during claim/lock create+write** (F1; the create write-fail branch
-  `#5` and the read-fault lanes `:848,871-873`). *Phase-2 feasibility:* real injection
-  needs `mount` (Linux **root**); `ulimit -f` is a SIGXFSZ trap (wrong lane). **Second
-  cut — Linux + `sudo -n` probe-gated, or document-only.**
-- **B4 — FD exhaustion via `ulimit -n`** (F3). **Corrected (Phase-2 feasibility,
-  supersedes the earlier "portable POSIX" rating):** NOT portably/deterministically
-  injectable — the create needs only ~1 FD, so any `ulimit -n` low enough to fail it
-  first starves bash's own startup (machine-dependent harness corruption); inode
-  exhaustion needs root. **Document-only.**
-
-### Tier C — Platform-only (verify off-Linux; not a Linux gap)
-
-- **C1 — Windows no-delete-share handle lanes** (~23 lines: `:881-890,993,
-  1639-1647,1700-1712`). Already covered by interop Tests 13/31d/33c on the Windows
-  CI leg. *Action:* confirm the Windows leg's coverage exercises them (it does by
-  construction); no Linux work. Consider a kcov-equivalent on Windows is
-  impractical — rely on the explicit interop tests.
-- **C2 — macOS/BSD `mv` fallback real path** (`:969,976-977`). A11 makes this
-  Linux-steerable by forcing the probe off; a *genuine* BSD `mv` exercise needs a
-  macOS leg. *Action:* prefer A11 (portable) and treat a macOS leg as optional
-  per the load-strategy matrix.
-
-### Tier D — Bounded residuals: document, don't test
-
-Low-value, bounded, detected, or self-healing; the manual audits rate these
-not worth a dedicated test. *Action:* ensure each is named in the code header /
-`guarantees.md` as an accepted residual; fold into a broader test opportunistically
-if cheap, but do not build bespoke tests.
-
-- **D1 — residual-1** (verify→rename: our rename clobbers a freshly-created rival
-  lock → victim detects 98). Detection is covered structurally; the specific
-  interleaving is bounded + detected.
-- **D2 — residual-3** (claimant suspended between touch and rename installs an
-  aged-mtime lock). Bounded shortfall, self-healing; the *positive* lease-reset is
-  covered (Test 26).
-- **D3 — leaked-resolve rare arc-end legs** (`:755-758,1260-1262`) and the
-  release boundary-re-read in isolation (`R2`). Reachable only with a non-empty
-  leaked set; transitively exercised.
-
----
-
-## 4. Scoping summary for Phase 2
-
-- **Tier A (11 tests, portable interposition)** is the bulk of the value and the
-  bulk of the work — all runnable on every CI leg, no fault-injection fragility.
-  A1, A2, A4 are the high-value three (a real verdict branch, a whole unexercised
-  abort lane, and the single silent-loss boundary). Bundle these into the unit
-  suite alongside the Bucket-2 work.
-- **Tier B (4 tests, fault injection)** is the failure-modes §4.5 set; platform-gate
-  them and flag any non-portable lane in the Phase-2 plan rather than shipping a
-  flake.
-- **Tier C** is verification on the Windows leg (already covered) + an optional
-  macOS leg; **Tier D** is documentation, not tests.
-- **Expected effect:** closing Tier A + the Linux-injectable parts of Tier B should
-  take Linux line coverage from 83.1% toward the ~94% platform ceiling; the
-  remaining ~6% is the Windows/BSD platform-gated lanes covered on their own legs.
-- **Harness ergonomics (Bucket 8)** pay off here: a `GCL_TEST_ONLY=<regex>`
-  selector and TAP output make iterating on ~15 new steered tests far cheaper —
-  schedule them before/with the test build.
-
----
-
-## 5. kcov reproduction
-
-For re-running the objective coverage measurement (per the reproducible-experiments
-principle). All from Git Bash; `MSYS_NO_PATHCONV=1` stops Git Bash mangling a
-leading `/tmp` arg into a Windows path before WSL sees it.
-
-```bash
-# Build kcov v43 (no apt package; upstream ships no prebuilt binary):
-wsl.exe -d Ubuntu-24.04 -e bash -c 'sudo apt-get install -y cmake libdw-dev libelf-dev \
-  binutils-dev libcurl4-openssl-dev zlib1g-dev libiberty-dev'
-wsl.exe -d Ubuntu-24.04 -e bash -c '
-  cd /tmp && curl -fsSL https://github.com/SimonKagstrom/kcov/archive/refs/tags/v43.tar.gz \
-    | tar xz && mkdir kcov-build && cd kcov-build && cmake ../kcov-43 && make -j"$(nproc)"'
-
-# Run the unit suite under kcov (FULL fan-out) and list never-executed lines:
-MSYS_NO_PATHCONV=1 wsl.exe -d Ubuntu-24.04 -e bash -c '
-  cd /mnt/c/agent_data/commit-lock/worktrees/ci-stress &&
-  GCL_TEST_FULL=1 /tmp/kcov-build/src/kcov --include-path=git-commit-lock.sh \
-    /tmp/gcl-cov tests/git-commit-lock.test.sh'
-MSYS_NO_PATHCONV=1 wsl.exe -d Ubuntu-24.04 -e bash -c '
-  F=/tmp/gcl-cov/git-commit-lock.test.sh.*/cobertura.xml;
-  grep -oE "<line number=\"[0-9]+\" hits=\"[0-9]+\"/>" $F |
-    sed -E "s/.*number=\"([0-9]+)\" hits=\"([0-9]+)\".*/\1 \2/" |
-    awk "\$2==0 {print \$1}" | sort -n'
-```
-
-When the kcov pass becomes a permanent CI leg (Phase 3 / Bucket 7), it runs on the
-Linux runner against the unit suite at FULL, and the platform-gated ~30 lines (§1)
-are expected-uncovered there by design.

From beaf943f87a27279f4221d33c9e4e4c31824f75e Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 13:10:46 +1000
Subject: [PATCH 71/76] Phase-4 round-2: fix dangling load-testing ref +
 guarantees line-number disclaimer
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-2 re-review (tests + CI clean; docs Codex found 3):
- deep-sweep.yml referenced docs/load-testing-strategy.md "§9", a section the
  rewrite removed (Tier D is now under §3) — point to the doc generically.
- guarantees.md lacked the line-number disclaimer its sibling failure-modes.md
  carries; the branch's interop/unit reformatting drifted some `file:line` anchors
  (the test NAMES/NUMBERS stay correct). Added the "anchors, not exact addresses"
  disclaimer rather than re-derive every anchor (a transient, re-drifting fix the
  disclaimer exists to avoid; reviewer D round-1 judged the drift non-blocking).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/workflows/deep-sweep.yml | 2 +-
 docs/guarantees.md               | 5 ++++-
 2 files changed, 5 insertions(+), 2 deletions(-)

diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml
index 2fff157..d83bb0f 100644
--- a/.github/workflows/deep-sweep.yml
+++ b/.github/workflows/deep-sweep.yml
@@ -1,4 +1,4 @@
-# deep-sweep — Tier D of the load-testing strategy (docs/load-testing-strategy.md §9).
+# deep-sweep — Tier D of the load-testing strategy (see docs/load-testing-strategy.md).
 #
 # ON-DEMAND ONLY. This workflow is `workflow_dispatch`-only: it NEVER runs on push
 # or pull_request, and it NEVER gates anything (it is not a required check — this is
diff --git a/docs/guarantees.md b/docs/guarantees.md
index 5f839a8..f5194bc 100644
--- a/docs/guarantees.md
+++ b/docs/guarantees.md
@@ -15,7 +15,10 @@ reference* (why the protocol is shaped this way and how it works). Where they
 appear to disagree, the **code and tests are authoritative**, then this contract,
 then the analysis, then the design narrative. Each guarantee below cites its
 witnessing test(s) and the failure-modes section that justifies it; the
-[Verification map](#7-verification-map) collects those pointers.
+[Verification map](#7-verification-map) collects those pointers. (Test and
+`file:line` citations are **anchors, not exact addresses**: find a test by its
+name/number — the line numbers reflect the tree when written and drift as files
+move.)
 
 This contract makes **no new claims** about behavior — it is a re-statement of
 the decisions recorded in `failure-modes.md` §4 as commitments. It does not

From 1b0bf7dea7a40f8d15bf45ce39d6bac8bb118cf6 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 13:21:16 +1000
Subject: [PATCH 72/76] Phase-4 round-3: fix stale "recommend/doc-gap" framing
 in failure-modes.md
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-3 docs re-review (Codex) found the E2 row + §4 items 1-2 still framing the
load/timing envelope and single-clock docs as future recommendations, though both
are implemented (git-commit-lock.md "operating envelope" / "One time source";
load-testing-strategy.md §1). Fixed the whole class so the doc describes current
state — verified each target exists before de-flagging:

- E2 + K1 table rows: drop "not addressed in docs" / "UNDER-documented" /
  "Define the envelope" -> state it's documented.
- §E2 detail and §K "Net K": "Recommend: document X" -> present-tense
  "documented (where)".
- §4 items 1/2/3: add *Status (done):* markers (matching item 5's style) and
  neutralize the two imperative headings.
- §B1 latency-bound, §E3, §H4 "Recommend: document" -> "documented (where)"
  (targets verified: guarantees.md BE-3 for E3, OOS-5/G-S1 for H4, §K for B1).

Tests + CI legs were clean in round 2 and are untouched here (docs-only round).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 docs/failure-modes.md | 58 ++++++++++++++++++++++++++-----------------
 1 file changed, 35 insertions(+), 23 deletions(-)

diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index f4c0ef7..09f4275 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -125,7 +125,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | D4 | Non-lock CONTENT at path (user file) | Never stolen (content guard); warn | 1 | ✓ U:1034-1076 | **In scope.** Two accepted residuals (§D4). |
 | D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. |
 | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
-| E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. |
+| E2 | Multi-host clock skew / NTP jump | Single-clock assumption; documented (local jump → detected-98, safe) | 3 | ✗ | **Out of scope**; single-clock assumption documented. See §E2. |
 | E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ✓ U:Test 42 | **Accept** — fails safe + announced. See §E3. |
 | F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ✓ U:Test 50 (Linux+sudo tmpfs; (plat) skip elsewhere) | **Tested** (§4.5) + document. See §F1. |
 | F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ✓ U:Test 49 (portable failing-log path) | **Tested** (§4.5); logging best-effort, lock unaffected. |
@@ -142,7 +142,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
 | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
 | J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ✓ U:Test 49 (via F2) | **Tested** (§4.5, via F2); logging never blocks the lock. |
-| K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. |
+| K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Envelope defined** (design doc + envelope tier). See §K — the key analytical section. |
 | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. |
 
 U = `tests/git-commit-lock.test.sh`, I = `tests/git-commit-lock.interop.test.sh`,
@@ -203,7 +203,7 @@ ghost, cross-parsing each other's claim files, `I:1017-1088`).
 file's mtime is older than `STALE_SECS`, a waiter steals it. *Recovery is Tier
 1; recovery latency is Tier 2* (bounded by STALE + poll cadence under normal
 load). Tested via the stale-lock and empty-orphan steals (`U:197-210, 348-361`).
-**Recommend: in scope (recovery). Document the latency bound (§K).**
+**Recommend: in scope (recovery); latency bound documented (§K).**
 
 **B2 — Trappable death mid-claim (INT/TERM).** The EXIT/INT/TERM handlers are
 armed at acquire *start*, not at hold, in "claim-window mode"
@@ -436,13 +436,14 @@ principles about what can go wrong:
   (`git-commit-lock.sh:439-449`, `git-commit-lock.ps1:448-451`), never local
   time.
 
-*Tier 3 for cross-host (rides on E1); Tier 2 for a local NTP jump.* Untested.
-**Recommend:** (a) **document explicitly** that the tool assumes a single time
-source — i.e. single-host use (the common case) or a shared FS with a single
-server clock — and that this is *why* network/multi-host is out of scope; the
-current docs imply it but never say "one clock." (b) Note the reassuring part: a
-*local* clock jump is correctness-safe (degrades to the detected-98 lane), so no
-code change is warranted. This is a **doc gap, not a code gap.**
+*Tier 3 for cross-host (rides on E1); Tier 2 for a local NTP jump.* Untested — and
+no code change is warranted (see below). **Documented:** the design doc now states
+explicitly that the tool assumes a single time source — single-host use (the common
+case) or a shared FS with a single server clock — and that this is *why*
+network/multi-host is out of scope (`git-commit-lock.md`, "One time source"). It
+also records the reassuring part: a *local* clock jump is correctness-safe — a
+forward jump can prematurely steal a still-live lock, but that degrades to the
+detected exit-98 lane, never a silent double-commit. A doc matter, not a code gap.
 
 **E3 — mtime probe fails entirely (the staleness clock is unreadable).** Distinct
 from a *wrong* clock (E2): here the lock file's mtime cannot be read at all. Both
@@ -459,7 +460,7 @@ MAX_WAIT (97). *Tier 2 (safety held, recovery lost — and loudly announced).*
 Tested: unit Test 42 shadows the inner mtime probe to return empty on a present,
 stale ghost and asserts the fail-safe lane — the "Staleness detection is BROKEN"
 warn-once fires, the ghost is NOT stolen (left in place), and the waiter blocks to
-MAX_WAIT → 97. **Recommend: accept and document** — it is a
+MAX_WAIT → 97. **Recommend: accept; documented (§E3, `guarantees.md` BE-3)** — it is a
 host/FS-health failure the tool already detects and announces, and it fails *safe*
 (no false steal); the loud warning is the right behavior. This is also the clean
 reason recovery is a *Tier-1-within-envelope* property, not unconditional (see the
@@ -612,8 +613,8 @@ hard-exits the process **and** returns 0 **while displaced**. The *next* holder
 still recovers via staleness; only the abruptly-exiting one is unwarned. *Tier 2 —
 the residual edge of the fail-open lease.* Exercised indirectly: interop Test 5
 *uses* `[Environment]::Exit()` to fabricate a no-release orphan, confirming the
-bypass (`I:308-334`). **Recommend: document this as the explicit boundary of the
-no-silent-loss guarantee**, alongside the "commits must be fast" golden rule — a
+bypass (`I:308-334`). **Recommend: accept; documented as the explicit boundary of the
+no-silent-loss guarantee** (`guarantees.md` OOS-5 / G-S1), alongside the "commits must be fast" golden rule — a
 command that replaces/hard-exits the process mid-critical-section *after being
 displaced* is exactly the fail-open case the STALE budget exists to make rare. No
 code change closes it without the handle-based ops the design rejected (§H3).
@@ -748,14 +749,17 @@ concern:**
    the regression guard (`I:746-817`). **Recommend: keep that test; treat any
    regression here as Tier 1.**
 
-**Net K recommendation:** adopt the explicit envelope — *"correctness holds under
-any load; wall-clock recovery/timeout latency scales with poll cadence and
-scheduling, bounded by the configured knobs."* Put that sentence in the design
-doc. Then audit the suite's wall-clock assertions and **scope each to the load
-level it's meant to run at** (the stress branch's extreme `both/8-hog` mode is a
-flake-hunting tool, not a contract the product must meet on a 2-core runner).
-This is the cleanest way to stop chasing "flakes" that are really the test
-asserting a Tier-1 bound on a Tier-2 quantity.
+**Net K — the envelope, now adopted.** The explicit envelope — *"correctness holds
+under any load; wall-clock recovery/timeout latency scales with poll cadence and
+scheduling, bounded by the configured knobs"* — is stated in the design doc
+(`git-commit-lock.md`, "operating envelope") and in `load-testing-strategy.md` §1.
+The suite's wall-clock assertions are scoped to a load level via the envelope tier
+(`GCL_ENVELOPE_TIER` strict/relax, `ok_envelope`/`bad_envelope`): an oversubscribed
+runner's latency miss warns rather than reds, while the correctness asserts stay
+strict. So the stress branch's extreme `both/8-hog` mode is a flake-hunting tool,
+not a contract the product must meet on a 2-core runner — which structurally ends
+the chasing of "flakes" that are really a test asserting a Tier-1 bound on a
+Tier-2 quantity.
 
 ---
 
@@ -769,7 +773,7 @@ Item 3 (network FS) is **document-only**: do not build the FS-type probe. Item 5
 edge cases make the tool more maintainable and give future users confidence), rather than
 "accept untested". Every other recommendation is accepted as written.
 
-1. **Define and document the load/timing envelope (§K) — highest value.**
+1. **The load/timing envelope (§K) — highest value.**
    *Recommendation:* state in `docs/git-commit-lock.md` that correctness
    (exclusion, no silent loss, eventual recovery) is load-independent, while all
    wall-clock bounds (recovery latency, MAX_WAIT, the read ladder) are
@@ -779,14 +783,20 @@ edge cases make the tool more maintainable and give future users confidence), ra
    envelope misses, not product regressions. *This resolves the recurring
    "flake" question structurally.* Cost: doc + a test-bound audit; no product
    change.
+   *Status (done):* the envelope is stated in `docs/git-commit-lock.md` ("operating
+   envelope" — correctness load-independent, wall-clock bounds best-effort) and
+   `docs/load-testing-strategy.md` §1, and the suite's wall-clock assertions are
+   scoped to a load level via the envelope tier (`GCL_ENVELOPE_TIER`).
 
-2. **Multi-host / clock-skew assumption is under-documented (§E2) — doc gap, not
+2. **Multi-host / clock-skew assumption (§E2) — a doc matter, not a
    code gap.** The tool implicitly assumes a single time source; a *local* NTP
    jump is correctness-safe (degrades to the detected-98 lane), and cross-host
    skew only bites on a network FS that's already out of scope. *Recommendation:*
    add one explicit sentence — "assumes a single clock, i.e. single-host (the
    common case) or a shared FS with one server clock" — and the reassurance that
    a local clock jump cannot cause a silent double-commit. No code change.
+   *Status (done):* the single-clock sentence + local-jump reassurance are in
+   `docs/git-commit-lock.md` ("One time source").
 
 3. **Network/shared FS is out of scope but fails *silently* if entered (§E1).**
    The boundary is correctly stated in the design doc but only there.
@@ -795,6 +805,8 @@ edge cases make the tool more maintainable and give future users confidence), ra
    **not** attempt to *support* network FS, and **do not build** the optional
    FS-type startup probe — just document. (It would be cross-platform-awkward and
    incomplete anyway; Ben: "don't do the polish, just document.")
+   *Status (done):* the network/sync-FS boundary is stated in `README.md` (the
+   "local filesystems only" note); the FS-type probe was deliberately not built.
 
 4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap
    (§H3) — accept as documented.** Both are real but confined to an unsupported

From 2b6f8de295f54b705f6d9885d583054eacad6d47 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 13:31:00 +1000
Subject: [PATCH 73/76] Phase-4 round-4: de-reference deleted .plans/ content +
 fix stale section anchors
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-4 docs re-review (Codex) flagged a cross-repo ref and stale plan vocabulary;
swept the whole class so the merged tree is self-contained (.plans/ + AGENTS.md are
deleted at merge, so any reference to their content would dangle on main):

- failure-modes.md: drop the cross-repo `agents/600-claude.md` citation; the
  "§4.5"/"§4.1" anchors meant "§4 item 5"/"§4 item 1" (no such subsections exist) —
  fixed across the doc.
- guarantees.md: same "§4.5"/"§4.1" fix; drop "Bucket 4 / D-c" plan vocabulary.
- deep-sweep.yml / nightly-triage.sh / nightly.yml: drop "Phase-2 build plan",
  "Bucket 6 spec/design", "Bucket-2 Tier-A" — the decisions they cited are stated
  inline.
- tests (test.sh / interop.test.sh / _harness.sh): "Bucket 4 / D-c" -> "see
  failure-modes.md §K / §4 item 1"; "Bucket 6" -> "see load-testing-strategy.md";
  "failure-modes.md §4.5" -> "§4 item 5". Comment-only; bash -n clean.

Validated: actionlint (deep-sweep, nightly), shellcheck (nightly-triage), bash -n
(all edited shell files).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/scripts/nightly-triage.sh     |  4 ++--
 .github/workflows/deep-sweep.yml      |  4 ++--
 .github/workflows/nightly.yml         |  2 +-
 docs/failure-modes.md                 | 18 +++++++++---------
 docs/guarantees.md                    | 12 ++++++------
 tests/_harness.sh                     |  2 +-
 tests/git-commit-lock.interop.test.sh |  2 +-
 tests/git-commit-lock.test.sh         | 12 ++++++------
 8 files changed, 28 insertions(+), 28 deletions(-)

diff --git a/.github/scripts/nightly-triage.sh b/.github/scripts/nightly-triage.sh
index d9bab14..f0f8dd4 100644
--- a/.github/scripts/nightly-triage.sh
+++ b/.github/scripts/nightly-triage.sh
@@ -8,7 +8,7 @@
 # JSON file. It reads only files on disk + `gh`; it makes no test decisions of its
 # own beyond parsing the preserved logs.
 #
-# CLASSIFICATION (per the Bucket 6 spec):
+# CLASSIFICATION:
 #   correctness  — any `^FAIL:` line in a suite log, OR a cell job concluded
 #                  `failure`. Files/append a `nightly-correctness` issue. The one
 #                  class that demands investigation.
@@ -118,7 +118,7 @@ for cell in $EXPECTED_CELLS; do
     # Logs exist but the job did not cleanly succeed and there is no assertion FAIL:
     # failure-without-^FAIL / timeout / cancelled / errored late ⇒ infra, not
     # correctness and not green (a failure WITHOUT a FAIL line is a step
-    # timeout/late error, which is infra per the Bucket 6 design).
+    # timeout/late error, which is infra).
     infra_evidence+="- ${cell}: logs present but job conclusion='${concl}' (failure/timeout/cancel without ^FAIL: line)"$'\n'
     log "[$cell] INFRA (conclusion=$concl, no FAIL)"
   elif [ "$cell_envwarn" -eq 1 ]; then
diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml
index d83bb0f..2877b52 100644
--- a/.github/workflows/deep-sweep.yml
+++ b/.github/workflows/deep-sweep.yml
@@ -2,8 +2,8 @@
 #
 # ON-DEMAND ONLY. This workflow is `workflow_dispatch`-only: it NEVER runs on push
 # or pull_request, and it NEVER gates anything (it is not a required check — this is
-# a single-dev project with no branch protection; see the Phase-2 build plan's
-# Bucket 6 decision box). It exists purely as a deep flake-hunting tool — the
+# a single-dev project with no branch protection). It exists purely as a deep
+# flake-hunting tool — the
 # "50-clean hunt" instrument from the load-testing strategy: dispatch it (often many
 # times in parallel), pick a stress kind/magnitude, and repeat the full suite N
 # times per job to surface intermittent, scheduling-sensitive flakes that a single
diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml
index ba5c9ad..238d234 100644
--- a/.github/workflows/nightly.yml
+++ b/.github/workflows/nightly.yml
@@ -242,7 +242,7 @@ jobs:
           # Compare rate >= floor with awk (float-safe).
           if awk -v r="$rate" -v f="$floor" 'BEGIN { exit !(r + 0 >= f + 0) }'; then
             echo "PASS: line coverage $rate >= floor $floor"
-            echo "NOTE: the floor ($floor) tracks the achieved coverage (~0.83); ratchet it up toward ~0.90 as Bucket-2 Tier-A tests land. The Linux ceiling is ~0.94 (~30 lines are platform-gated)."
+            echo "NOTE: the floor ($floor) tracks the achieved coverage (~0.83); ratchet it up toward ~0.90 as more Linux-coverable tests land. The Linux ceiling is ~0.94 (~30 lines are platform-gated)."
           else
             echo "::error::line coverage $rate is BELOW the floor $floor — coverage regressed"
             echo "The floor tracks achieved coverage (~0.83) and should only ratchet UP as tests land. A drop means a test stopped exercising lines it used to. Investigate before lowering the floor."
diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index 09f4275..adc7eff 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -127,10 +127,10 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
 | E2 | Multi-host clock skew / NTP jump | Single-clock assumption; documented (local jump → detected-98, safe) | 3 | ✗ | **Out of scope**; single-clock assumption documented. See §E2. |
 | E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ✓ U:Test 42 | **Accept** — fails safe + announced. See §E3. |
-| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ✓ U:Test 50 (Linux+sudo tmpfs; (plat) skip elsewhere) | **Tested** (§4.5) + document. See §F1. |
-| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ✓ U:Test 49 (portable failing-log path) | **Tested** (§4.5); logging best-effort, lock unaffected. |
+| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ✓ U:Test 50 (Linux+sudo tmpfs; (plat) skip elsewhere) | **Tested** (§4 item 5) + document. See §F1. |
+| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ✓ U:Test 49 (portable failing-log path) | **Tested** (§4 item 5); logging best-effort, lock unaffected. |
 | F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ (document-only) | **Document-only**: no deterministic portable injection. See §F3. |
-| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ✓ U:Test 48 (POSIX `chmod 0555`; (plat) skip on Windows) | **Tested** (§4.5, highest-value). See §F4. |
+| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ✓ U:Test 48 (POSIX `chmod 0555`; (plat) skip on Windows) | **Tested** (§4 item 5, highest-value). See §F4. |
 | G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. |
 | G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. |
 | G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. |
@@ -141,7 +141,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. |
 | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. |
 | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. |
-| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ✓ U:Test 49 (via F2) | **Tested** (§4.5, via F2); logging never blocks the lock. |
+| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ✓ U:Test 49 (via F2) | **Tested** (§4 item 5, via F2); logging never blocks the lock. |
 | K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Envelope defined** (design doc + envelope tier). See §K — the key analytical section. |
 | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. |
 
@@ -475,7 +475,7 @@ comment at `:1341-1343`). A created-but-write-failed file is an empty orphan tha
 ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the
 accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud,
 fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write
-manual-fix residual).* **Tested** (per §4.5): unit Test 50 mounts a small 64k
+manual-fix residual).* **Tested** (per §4 item 5): unit Test 50 mounts a small 64k
 tmpfs, fills it to ENOSPC, and asserts the waiter times out at 97 with the wrapped
 command never running — no corruption, no false hold. ENOSPC injection needs a full
 FS (root via a tmpfs; `ulimit -f` raises SIGXFSZ — the wrong lane), so the test runs
@@ -486,7 +486,7 @@ documented.
 
 **F2 — ENOSPC during a LOG write.** All log writes end in `|| true`
 (`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.*
-**Tested** (per §4.5): unit Test 49 points `AGENT_LOCK_LOG` at a path *under a
+**Tested** (per §4 item 5): unit Test 49 points `AGENT_LOCK_LOG` at a path *under a
 regular file*, so every open/append fails ENOTDIR, and asserts the lock still
 acquires + releases cleanly (rc 0), the wrapped command runs, the lock is cleaned
 up, and no log file appears — i.e. the failing log write is swallowed and the lock
@@ -509,7 +509,7 @@ behaviour is the same as F1, which is tested.
 best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is
 unwritable the create fails every poll and the waiter times out at 97. No
 corruption, no false hold. A *release* unlink blocked by an unwritable parent
-routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* **Tested** (per §4.5 — the
+routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* **Tested** (per §4 item 5 — the
 highest-value one): unit Test 48 `chmod 0555`s the lock-dir parent and asserts the
 waiter times out at 97, the wrapped command never runs, no lock file is created,
 and the WAITING/TIMEOUT lines are logged — no corruption, no false hold. POSIX-only
@@ -651,7 +651,7 @@ the lock. Under a redirected git dir, log *content* (the owner line) is
 attacker-influenceable — one-line text spoofing, no execution; the tool itself
 writes only its token, owner line, and protocol events, never secrets
 (`docs/git-commit-lock.md:543-551`). *Tier 2.* **Tested — covered by the F2
-log-failure test (per §4.5): unit Test 49** proves a failing log path leaves the
+log-failure test (per §4 item 5): unit Test 49** proves a failing log path leaves the
 lock fully working. Logging is best-effort by design, which is the right call for a
 lock that must keep working when the disk is full or the log path is bad. The
 follow-on (unchanged): don't build automation that *trusts* log text from an
@@ -678,7 +678,7 @@ flakes are real gaps vs harness concerns.
   one timing-sensitive input (mtime, and transient empty reads) cannot corrupt a
   correctness decision: a sub-floor or unsettled reading is treated as "wait,"
   never "steal." A 25-worker round can go 3s → 41s under load
-  (`agents/600-claude.md` observation) and *still* lose no update.
+  and *still* lose no update.
 
 - **Load-dependent (Tier 2, best-effort in an envelope):** every wall-clock bound.
   - **Recovery latency** ≈ STALE (+ CLAIM_STALE if a claimant also crashed) +
diff --git a/docs/guarantees.md b/docs/guarantees.md
index f5194bc..1fa9595 100644
--- a/docs/guarantees.md
+++ b/docs/guarantees.md
@@ -128,7 +128,7 @@ bug.
   (`failure-modes.md` §F1 — an accepted residual). *Witness:* the read-back-failure lanes —
   create-path Test 32, steal-path Test 32b (`U:1760-1855`); resource lanes —
   unwritable lock dir Test 48 (F4), ENOSPC Test 50 (F1, Linux+sudo; skip-with-note
-  elsewhere) (`failure-modes.md` §4.5); FD/inode exhaustion (F3) is document-only
+  elsewhere) (`failure-modes.md` §4 item 5); FD/inode exhaustion (F3) is document-only
   (no portable injection). *Basis:* §1, §A1, §F.
 
 - **G-S3 — Strict mutual exclusion within the staleness window, with no
@@ -251,9 +251,9 @@ BE-4). Logging is best-effort by design; correctness is not.
 
 These hold under normal conditions and degrade *gracefully and detectably* under
 pathological scheduling or host-health failures. **Correctness (§2) is preserved
-throughout; only liveness/latency degrades.** This tier is the reference Bucket 4
-scopes the suite's wall-clock test assertions against (the strict/envelope test
-split, `failure-modes.md` §4.1 / D-c).
+throughout; only liveness/latency degrades.** This tier is what the suite's
+wall-clock test assertions are scoped against (the strict/envelope test split; see
+`failure-modes.md` §K and §4 item 1).
 
 - **BE-1 — Wall-clock latency bounds are in poll-count, not seconds.** Recovery
   latency (≈ `STALE` + poll cadence), the `MAX_WAIT` timeout, and the ~1.26s
@@ -263,7 +263,7 @@ split, `failure-modes.md` §4.1 / D-c).
   poll-count number (Test 21's ≤20s, Test 22a's warning timing, Test 29's ≥2-CLAIM
   count) assert an *envelope* bound, not a correctness bound, and may be relaxed or
   gated to a defined load level (`GCL_ENVELOPE_TIER=relax`) without any product
-  change. *Basis:* `failure-modes.md` §K, §4.1.
+  change. *Basis:* `failure-modes.md` §K and §4 item 1.
 
 - **BE-2 — Diagnostic warnings are best-effort.** The wrong-type config warning
   and the claim-path warning rely on poll headroom that an oversubscribed runner
@@ -398,7 +398,7 @@ Each guarantee → its witnessing test(s) and the failure-modes section. `U` =
 `C` = `tests/git-commit-lock.canary.test.sh` (the concurrency canary), `integ` =
 `tests/git-commit-lock.integration.test.sh`. The former resource-exhaustion and
 diagnostic-clock coverage gaps are now closed by the fault-injection tests
-(`failure-modes.md` §4.5): F4 (Test 48), F2/J1 (Test 49), F1 (Test 50), and the
+(`failure-modes.md` §4 item 5): F4 (Test 48), F2/J1 (Test 49), F1 (Test 50), and the
 unreadable-mtime fail-safe (Test 42). The one remaining document-only lane is F3
 (FD/inode exhaustion), which has no portable deterministic injection.
 
diff --git a/tests/_harness.sh b/tests/_harness.sh
index 71eed3a..7529cca 100644
--- a/tests/_harness.sh
+++ b/tests/_harness.sh
@@ -36,7 +36,7 @@ PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0
 GCL_TAP="${GCL_TAP:-0}"           # CI sets GCL_TAP=1 for machine-readable TAP13 output
 GCL_TEST_ONLY="${GCL_TEST_ONLY:-}"  # if set, run ONLY test blocks whose label REGEX-matches (single-test selector)
 
-# Axis-A waiter-count sweep (Bucket 6). GCL_TEST_SWEEP=1 (nightly/deep CI) widens
+# Axis-A waiter-count sweep (see load-testing-strategy.md). GCL_TEST_SWEEP=1 (nightly/deep CI) widens
 # the fan-out/contention tests over several waiter counts to wring more coverage
 # from the existing tests; unset/0 (per-PR default + plain dev) keeps the floor so
 # default runs are byte-identical to today. T_AXIS_A is the shared waiter-count
diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh
index 0244d1a..bfb0e44 100644
--- a/tests/git-commit-lock.interop.test.sh
+++ b/tests/git-commit-lock.interop.test.sh
@@ -862,7 +862,7 @@ if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-seri
 # lock), so the run is discarded and retried (bounded) instead of failing
 # assertions the protocol never violated.
 #
-# Waiter count is swept over $T_AXIS_A (Bucket 6): one iteration at N=4 by
+# Waiter count is swept over $T_AXIS_A (see load-testing-strategy.md): one iteration at N=4 by
 # default (2 bash + 2 pwsh — byte-identical to today) and at N=4,12,24 under
 # GCL_TEST_SWEEP=1. N is split into a bash half (N/2) and a pwsh half (the
 # remainder); at N=4 that is 2+2 exactly. The correctness invariants stay strict
diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh
index c722886..3a41419 100755
--- a/tests/git-commit-lock.test.sh
+++ b/tests/git-commit-lock.test.sh
@@ -70,7 +70,7 @@ cleanup() {
 # above and fails loudly if the suite died before setting DONE=1.
 trap finish EXIT
 
-# Envelope-tier assertions (Bucket 4 / decision D-c). A wall-clock or poll-count
+# Envelope-tier assertions (see failure-modes.md §K / §4 item 1). A wall-clock or poll-count
 # bound is a Tier-2 (best-effort latency) property, NOT a correctness one (see
 # guarantees.md BE-1). In the default 'strict' tier these behave exactly like
 # ok/bad. Under GCL_ENVELOPE_TIER=relax (nightly/deep stress runs) an envelope FAIL
@@ -160,7 +160,7 @@ if section "Test 2b: crash recovery under CONTENTION — claim-serialized: zero
 # otherwise discarded and retried (bounded), instead of failing assertions
 # the protocol never violated.
 #
-# Waiter count is swept over $T_AXIS_A (Bucket 6): one iteration at N=4 by
+# Waiter count is swept over $T_AXIS_A (see load-testing-strategy.md): one iteration at N=4 by
 # default (byte-identical to today) and at N=4,12,24 under GCL_TEST_SWEEP=1.
 # Every sweep iteration's assertions carry an " at N=<count>" tag so a sweep
 # failure says which N broke; that tag is SUPPRESSED in the default (non-sweep)
@@ -1090,7 +1090,7 @@ if section "Test 20: claim contention — N concurrent stealers, ONE claim winne
 # STALE must scale with N too (see t20_stale below), keeping "exactly one
 # steal" a strict, config-independent correctness invariant at every N.
 #
-# Waiter count is swept (Bucket 6). Unlike Test 2b/16, this test's floor is NOT
+# Waiter count is swept (see load-testing-strategy.md). Unlike Test 2b/16, this test's floor is NOT
 # 4 — it is the MODE-driven $T20_N (5 REDUCED / 10 FULL), the count CI already
 # stresses. So instead of iterating the shared T_AXIS_A ("4 ...") it builds its
 # own list: just $T20_N by default (byte-identical), and $T20_N plus the sweep's
@@ -3069,7 +3069,7 @@ fi
 
 
 if section "Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4)"; then
-# F4 (failure-modes.md §4.5): a read-only / unwritable lock-dir parent makes the
+# F4 (failure-modes.md §4 item 5): a read-only / unwritable lock-dir parent makes the
 # O_EXCL create fail every poll, so the waiter times out at 97 — no corruption, no
 # false hold, and the wrapped command never runs. POSIX-only: chmod 0555 is a no-op
 # for writes on Git-Bash/NTFS (the create would wrongly succeed), so skip-with-note
@@ -3099,7 +3099,7 @@ esac
 fi
 
 if section "Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1)"; then
-# F2/J1 (failure-modes.md §4.5): logging is best-effort (every write ends || true).
+# F2/J1 (failure-modes.md §4 item 5): logging is best-effort (every write ends || true).
 # Point AGENT_LOCK_LOG under a REGULAR FILE so every append/open fails ENOTDIR — the
 # lock must still acquire+release cleanly (rc 0) with the log write swallowed.
 # Portable (no chmod/perms). NOTE: bash's redirection-OPEN failure leaks to stderr
@@ -3122,7 +3122,7 @@ rm -f "$T49P" "$WORK/t49.lock"
 fi
 
 if section "Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1)"; then
-# F1 (failure-modes.md §4.5): a full filesystem makes the create's write fail
+# F1 (failure-modes.md §4 item 5): a full filesystem makes the create's write fail
 # (ENOSPC); the created-but-write-failed file is an empty orphan and the waiter
 # times out at 97 — no corruption, no false hold. Real ENOSPC needs a full FS, which
 # needs root (a small tmpfs); `ulimit -f` is NOT usable (it raises SIGXFSZ and kills

From 1351edb4714a729864ad1e292cbb6a61d142df18 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 13:41:45 +1000
Subject: [PATCH 74/76] =?UTF-8?q?Phase-4=20round-5:=20socket/device=20witn?=
 =?UTF-8?q?ess,=20dangling=20=C2=A7E=20ref,=20four=20suites,=20triage=20he?=
 =?UTF-8?q?ader?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-5 docs re-review (Codex), 4 findings, all verified against the code:
1. G-S4 / D3 claim socket+device coverage but their witness lists cited only Tests
   17/17d/18/22 — added Test 44 (the socket & device-node arms of the wrong-type
   classifier) to the G-S4 prose witness, the §7 verification-map row, and the D3
   table row + detail.
2. load-testing-strategy.md cited `guarantees.md §E`, which does not exist
   (guarantees.md is §1-§7); the correctness-rests-on-structure point is §2A. Fixed.
3. failure-modes.md "Sources of truth" listed only three test suites — added the
   canary suite (the doc later uses `C Test 1` as a witness).
4. nightly-triage.sh header was stale: it claimed a CONCLUSIONS JSON file (the code
   reads per-cell cell-conclusion.txt) and that a `failure` conclusion is
   correctness (the code classifies failure-without-^FAIL: as infra). Header now
   matches the implementation.

Validated: shellcheck + bash -n (nightly-triage.sh). Docs are markdown.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/scripts/nightly-triage.sh | 23 ++++++++++++-----------
 docs/failure-modes.md             |  8 +++++---
 docs/guarantees.md                |  3 ++-
 docs/load-testing-strategy.md     |  2 +-
 4 files changed, 20 insertions(+), 16 deletions(-)

diff --git a/.github/scripts/nightly-triage.sh b/.github/scripts/nightly-triage.sh
index f0f8dd4..b7e94db 100644
--- a/.github/scripts/nightly-triage.sh
+++ b/.github/scripts/nightly-triage.sh
@@ -4,14 +4,15 @@
 #
 # Invoked by the `triage` job in .github/workflows/nightly.yml AFTER it has
 # downloaded every matrix cell's `test-output/` artifact (each into a directory
-# named `nightly-logs-<cell-id>/`) and written the per-cell job conclusions to a
-# JSON file. It reads only files on disk + `gh`; it makes no test decisions of its
-# own beyond parsing the preserved logs.
+# named `nightly-logs-<cell-id>/`, each carrying that cell's own
+# `cell-conclusion.txt`). It reads only files on disk + `gh`; it makes no test
+# decisions of its own beyond parsing the preserved logs.
 #
 # CLASSIFICATION:
-#   correctness  — any `^FAIL:` line in a suite log, OR a cell job concluded
-#                  `failure`. Files/append a `nightly-correctness` issue. The one
-#                  class that demands investigation.
+#   correctness  — any `^FAIL:` line in a suite log (a genuine assertion failure).
+#                  Files/append a `nightly-correctness` issue. The one class that
+#                  demands investigation. (A job that concluded `failure`/timed out
+#                  WITHOUT a `^FAIL:` line is infra, not correctness — see below.)
 #   envelope     — no FAIL anywhere, but at least one `WARN[env-relaxed]` line in a
 #                  log of a cell that *succeeded*. Tracked (`nightly-envelope`); the
 #                  three wall-clock envelope assertions stretched under load — by
@@ -33,11 +34,11 @@
 # Inputs (environment):
 #   ARTIFACTS_DIR   dir holding the downloaded per-cell artifact directories
 #                   (default: ./artifacts). Each cell dir is `nightly-logs-<id>/`.
-#   CONCLUSIONS     path to a JSON object { "<cell-id>": "<conclusion>", ... } of
-#                   each matrix cell job's `result` (success|failure|cancelled|
-#                   skipped). Read from `<cell-dir>/cell-conclusion.txt`, which each
-#                   stress cell writes (always()) into its own artifact — so the
-#                   conclusion is ground truth PER CELL, never a matrix aggregate.
+#   (Per-cell job conclusions are read from FILES, not env: each stress cell writes
+#                   its own `result` — success|failure|cancelled|skipped — to
+#                   `<cell-dir>/cell-conclusion.txt` under always(), and the script
+#                   reads that file directly. Ground truth PER CELL, never a matrix
+#                   aggregate.)
 #   EXPECTED_CELLS  space-separated list of cell ids that were supposed to run
 #                   (default: the six N1..N6 ids). Lets the empty-round / missing-
 #                   artifact guard know what to expect.
diff --git a/docs/failure-modes.md b/docs/failure-modes.md
index adc7eff..e82810c 100644
--- a/docs/failure-modes.md
+++ b/docs/failure-modes.md
@@ -8,7 +8,8 @@ we guarantee this" or "no, out of scope."
 
 **Sources of truth, in order:** the product code
 (`git-commit-lock.sh`, `git-commit-lock.ps1`) and the test suites
-(`tests/git-commit-lock.test.sh`, `tests/git-commit-lock.interop.test.sh`,
+(`tests/git-commit-lock.test.sh`, `tests/git-commit-lock.canary.test.sh`,
+`tests/git-commit-lock.interop.test.sh`,
 `tests/git-commit-lock.integration.test.sh`). Every claim below cites
 `file:line`. The narrative docs (`README.md`, `docs/git-commit-lock.md`) and
 the implementation header comments are corroborating, not authoritative — where
@@ -121,7 +122,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated.
 | C4 | Leaked claim (unverifiable unlink) | Leaked-token memory keeps ownership discoverable | 1 | ✓ U:1549-1758, U:2013-2164 | **In scope.** Keep. |
 | D1 | Atomic rename-over (steal install) | `mv -T` / `File.Move(...,true)` / 5.1 unlink+move | 1 (local FS) | ✓ U:212-346, I:16d S:1141 | **In scope on local FS.** Boundary = D-axis. |
 | D2 | O_EXCL atomic create | `set -C` redirect / `FileMode.CreateNew` | 1 (local FS) | ✓ throughout | **In scope on local FS.** |
-| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262/Test 37 (rename-refused mid-steal), ~(plat) | **In scope.** ps1-on-POSIX residual = accept. |
+| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262/Test 37 (rename-refused mid-steal)/Test 44 (socket+device), ~(plat) | **In scope.** ps1-on-POSIX residual = accept. |
 | D4 | Non-lock CONTENT at path (user file) | Never stolen (content guard); warn | 1 | ✓ U:1034-1076 | **In scope.** Two accepted residuals (§D4). |
 | D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. |
 | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. |
@@ -335,7 +336,8 @@ guards apply to the claim path with independent per-path warn-once state
 noclobber `>` onto a FIFO blocks in `open(2)` before any timeout logic — a hang,
 not a warning. *Tier 1 on bash, and on ps1-on-Windows.* Tested: Test 17
 (dir/symlink/FIFO at lock path), Test 22 (claim path), Test 17d (churn must not
-false-warn) (`U:818-892, 1156-1262, 894-1032`).
+false-warn), and Test 44 (the socket & device-node arms of the same classifier,
+bash; POSIX CI legs) (`U:818-892, 1156-1262, 894-1032`).
 
 > **The one real D3 boundary — ps1 on POSIX (Tier 2, accepted).** The .NET API
 > exposes no portable type bit for FIFO/device/socket on Unix; they stat as size
diff --git a/docs/guarantees.md b/docs/guarantees.md
index 1fa9595..d27aab0 100644
--- a/docs/guarantees.md
+++ b/docs/guarantees.md
@@ -162,6 +162,7 @@ bug.
   *Two accepted residuals* bound this and are documented, not bugs: a stale
   *empty* user file, and a stale file whose line 1 happens to start `tok.`, are
   stolen (`git-commit-lock.sh:298-311`). *Witness:* unit Tests 17/17d/18/22
+  (dir/symlink/FIFO/content) and Test 44 (socket & device-node, bash; POSIX CI)
   (`U:818-892,894-1032,1034-1076,1156-1262`). *Basis:* §D3/§D4/§G1. *Scoped
   exception:* ps1-on-POSIX has no .NET type probe for FIFO/device/socket (§5,
   OOS-4).
@@ -407,7 +408,7 @@ unreadable-mtime fail-safe (Test 42). The one remaining document-only lane is F3
 | G-S1 no silent lost update | U Test 4b + Test 16 (unverifiable lane); I Test 8 (both dirs) | §1, §B5 |
 | G-S2 no corruption / no false hold | U Tests 32/32b (read-back failure); **resource lanes: Test 48 (F4), Test 50 (F1); F3 document-only** | §1, §A1, §F |
 | G-S3 strict exclusion in window + no displacement | C Test 1 (8×25 canary); U Tests 2b/20; I Tests 1/6/16/16b; integ | §A1/§A2/§A3 |
-| G-S4 never destroys non-lock-shaped | U Tests 17/17d/18/22 | §D3/§D4/§G1 |
+| G-S4 never destroys non-lock-shaped | U Tests 17/17d/18/22 (dir/symlink/FIFO) + Test 44 (socket/device) | §D3/§D4/§G1 |
 | G-S5 truthful exit codes | U Tests 7/8/4b/5/16; I run-verdict tests | §1, §H4 |
 | G-R1 lock-shaped orphans reclaimed | U Tests 2/3/21 | §B1/§C1/§C2/§C3 |
 | G-R2 one stuck agent can't wedge | stale-steal + crashed-claimant lanes | §1 |
diff --git a/docs/load-testing-strategy.md b/docs/load-testing-strategy.md
index 459a04c..22ea393 100644
--- a/docs/load-testing-strategy.md
+++ b/docs/load-testing-strategy.md
@@ -14,7 +14,7 @@ guarantees the suites assert against, see `docs/guarantees.md` and
 This is not a throughput-bound system whose correctness degrades under load. Safety
 and exclusion rest on structural primitives — `O_EXCL` create, atomic `rename(2)`,
 per-attempt token discovery — that never consult the clock for a *correctness*
-decision (`guarantees.md` §E, BE-1; `failure-modes.md` §K). No amount of CPU or IO
+decision (`guarantees.md` §2A, BE-1; `failure-modes.md` §K). No amount of CPU or IO
 pressure makes a rename non-atomic or lets two `O_EXCL` creates both win on a local
 filesystem.
 

From f7363e7b1c4676f46e362b507388bf45d2a76a36 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 13:45:56 +1000
Subject: [PATCH 75/76] Phase-4 round-6: fix stale correctness-issue body in
 nightly-triage.sh
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Round-6 docs re-review (Codex): all named docs clean; one finding — the GENERATED
correctness-issue body still said "a FAIL: assertion and/or a cell job concluded
failure", contradicting the classifier (a failed/timed-out cell WITHOUT ^FAIL: is
infra, not correctness). The correctness body now cites only a ^FAIL: line; the
infra body's list is completed to name the failure-without-FAIL case (same class,
fixed pre-emptively).

Validated: shellcheck + bash -n.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .github/scripts/nightly-triage.sh | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/.github/scripts/nightly-triage.sh b/.github/scripts/nightly-triage.sh
index b7e94db..0a3964b 100644
--- a/.github/scripts/nightly-triage.sh
+++ b/.github/scripts/nightly-triage.sh
@@ -179,7 +179,7 @@ run_url="${GITHUB_SERVER_URL:-https://github.com}/${GITHUB_REPOSITORY:-}/actions
 filed=0
 
 if [ -n "$correctness_evidence" ]; then
-  body="Nightly stress run on **${RUN_DATE}** has CORRECTNESS failures (a \`FAIL:\` assertion and/or a cell job concluded \`failure\`). **Investigate.**
+  body="Nightly stress run on **${RUN_DATE}** has CORRECTNESS failures (a \`^FAIL:\` assertion line in a suite log). **Investigate.**
 
 $correctness_evidence
 Run: ${run_url}
@@ -190,7 +190,7 @@ Run: ${run_url}
 fi
 
 if [ -n "$infra_evidence" ]; then
-  body="Nightly stress run on **${RUN_DATE}** had INFRA issues (missing artifact / timeout / cancel / errored before suites ran). Not a product failure, but the run did not produce trustworthy results — re-dispatch or investigate the runner.
+  body="Nightly stress run on **${RUN_DATE}** had INFRA issues (missing artifact / timeout / cancel / a cell job that failed or errored WITHOUT any \`^FAIL:\` line). Not a product failure, but the run did not produce trustworthy results — re-dispatch or investigate the runner.
 
 $infra_evidence
 Run: ${run_url}

From 5e46b206b6c1a543f47a2ed1c06f61da4cbd62a9 Mon Sep 17 00:00:00 2001
From: Ben Toner <bentoner@bentoner.com>
Date: Thu, 18 Jun 2026 15:14:03 +1000
Subject: [PATCH 76/76] Merge-prep: remove branch-local agent-workflow
 artifacts (not needed on main)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

ci-stress was the CI flakiness stress-hunt campaign; its agent-workflow scaffolding
is branch-local, not part of the product:
- AGENTS.md — the flake-hunt runbook for this campaign.
- .plans/ (9 files) — the zformal plan + subplans (full history preserved in git).
- .agent/.gitkeep + the /.agent/* .gitignore entries — the per-worktree
  comment-review queue scaffolding.

Kept: the product (git-commit-lock.sh/.ps1), install.sh, README, docs/, the four
test suites + harness/load wrapper, the CI workflows, and the .agent-testing/
ignore (general testing convention).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .agent/.gitkeep                               |   0
 .gitignore                                    |   5 -
 ...2026-06-16-ci-stress-test17d-flake-plan.md | 186 -------
 ...-ci-stress-guarantees-and-coverage-plan.md | 245 ---------
 ...6-17-ci-stress-interop-test5-flake-plan.md | 119 -----
 .../2026-06-17-ci-stress-phase2-build-plan.md | 477 ------------------
 ...6-06-17-ci-stress-test-f2-coverage-plan.md |  97 ----
 ...2026-06-17-ci-stress-test31a-flake-plan.md | 135 -----
 .../2026-06-18-ci-stress-canary-split-plan.md | 158 ------
 ...2026-06-18-ci-stress-shard-balance-plan.md | 128 -----
 ...06-18-ci-stress-windows-unit-shard-plan.md | 306 -----------
 AGENTS.md                                     | 137 -----
 12 files changed, 1993 deletions(-)
 delete mode 100644 .agent/.gitkeep
 delete mode 100644 .plans/2026-06-16-ci-stress-test17d-flake-plan.md
 delete mode 100644 .plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
 delete mode 100644 .plans/2026-06-17-ci-stress-interop-test5-flake-plan.md
 delete mode 100644 .plans/2026-06-17-ci-stress-phase2-build-plan.md
 delete mode 100644 .plans/2026-06-17-ci-stress-test-f2-coverage-plan.md
 delete mode 100644 .plans/2026-06-17-ci-stress-test31a-flake-plan.md
 delete mode 100644 .plans/2026-06-18-ci-stress-canary-split-plan.md
 delete mode 100644 .plans/2026-06-18-ci-stress-shard-balance-plan.md
 delete mode 100644 .plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
 delete mode 100644 AGENTS.md

diff --git a/.agent/.gitkeep b/.agent/.gitkeep
deleted file mode 100644
index e69de29..0000000
diff --git a/.gitignore b/.gitignore
index abf679e..6ab470c 100644
--- a/.gitignore
+++ b/.gitignore
@@ -6,11 +6,6 @@
 .DS_Store
 Thumbs.db
 *.stackdump
-/.agent/review-queue
-/.agent/review-queue.lock
-/.agent/review-queue.lock.*
-/.agent/last-opened
-/.agent/.tmp.*
 
 # Test/CI artifact output (manifests, suite logs); created at runtime, never committed.
 test-output/
diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
deleted file mode 100644
index c2f4bb8..0000000
--- a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md
+++ /dev/null
@@ -1,186 +0,0 @@
-# Plan: de-flake Test 17d (`got97 >= 1`) in the unit suite
-
-Status: **DONE** (implemented + reviewed clean by Claude and Codex; local unit suite
-214/0; awaiting CI-stress confirmation toward 50 clean in a row).
-
-## Reviewer notes (add at top; do not renumber)
-Round 1 — fresh Claude reviewer + Codex (both independent), findings verified by me
-against the product code:
-
-1. **[BLOCKING — fixed in plan v2] rc-set `{0,97,98}` is not exhaustive of correct
-   outcomes → must be `{0,1,97,98}`.** Under this churn a clean `true` whose release
-   reads the held lock EMPTY (the churner's create→write window) gets release rc 2,
-   which `lock_run` maps to **rc 1** (`git-commit-lock.sh:1739-1744`). rc 1 is the
-   documented "ownership unverifiable, successful command demoted" outcome — correct,
-   not a defect. Verified. The original `{0,97,98}` was the *same class* of
-   timing-fragile assumption as the bug being fixed. Fixed below.
-2. **[BLOCKING — fixed in plan v2] the `WAITING` canary must not read the SHARED log.**
-   Plan v1 grepped `WAITING` from the single shared `churn.log` (line 916), but the
-   suite itself documents `# per-waiter logs: concurrent appends to one log drop lines`
-   (`tests/git-commit-lock.test.sh:258`) and uses per-waiter logs elsewhere for exactly
-   this reason. A shared-log `WAITING` count can under-count under concurrency and the
-   canary would itself flake. Fixed: give each waiter its OWN `AGENT_LOCK_LOG`
-   (single-writer ⇒ drop-free), count `WAITING` across those, and concatenate them into
-   `churn.log` afterwards so the preserved artifact is unchanged.
-3. **[disposition] Secondary hardenings DROPPED.** Reviewers flagged the
-   start-marker-after-first-cycle and alive-at-reap hardenings as needing care (the
-   alive check can false-fail if the churner's iteration cap is ever hit; both add
-   machinery to a delicate timing path). They are also largely redundant with the
-   drop-free `WAITING>=1` canary, which already proves the churner produced contention.
-   To keep the change minimal and the timing path untouched, v2 drops both. The
-   load-bearing fix is assertions 1-3.
-4. **[non-blocking, adopted] observability buckets** updated to `rc0/rc1/rc97/rc98/other`
-   and emitted unconditionally (pass and fail), so a drift toward an edge is visible.
-
-Round 2 — confirming review (fresh Claude + Codex, both independent): **CONVERGED, ok to
-implement.** Both verified against the product code that the rc-set {0,1,97,98} is
-exhaustive and tight (release rc 2 is remapped to 1, never leaks; acquire exposes only
-0/97; reentrant-1 unreachable from a fresh CLI process), per-waiter `AGENT_LOCK_LOG`
-auto-creates and breaks nothing, and `WAITING>=1` is a sound non-flaky floor. Two
-implementation reminders adopted: (a) `bad` is a function — name the "other" rc bucket
-something else (e.g. `nother`) and an offenders string; (b) avoid `cat … | grep -c`
-(ShellCheck SC2002 fires at the CI style gate). Resolution for (b): rebuild churn.log via
-`cat "$WORK"/t17d.*.log > "$LOG"` (a redirect, not a pipe — no SC2002), then
-`grep -c 'WAITING for lock' "$LOG"` on the single rebuilt file.
-
-## Context
-CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run
-27616343269 failed only on `windows-2025 (unit)` with one assertion in
-`tests/git-commit-lock.test.sh` Test 17d:
-
-```
-PASS: 12 waiters polled through churn with ZERO spurious non-lock warnings
-FAIL: no waiter reached 97 under churn (got97=0/12) — timeout lane bypassed?
-```
-
-Diagnosis (Claude subagent) + independent review (Codex) — both in
-`.agent-testing/failures/27616343269/{DIAGNOSIS.md,codex-diag-review.md}`:
-
-- **Root cause.** The Windows pwsh churner (`tests/git-commit-lock.test.sh:925-931`)
-  does `WriteAllText → Delete` with **no present-hold**, unlike the POSIX perl churner
-  which sleeps 2ms present each iteration (`:944-947`). On the loaded 2-core
-  windows-2025 VM, per-iteration pwsh/.NET overhead widened the *absent*
-  (Delete→next-Write) window past the 20ms poll interval, so all 12 waiters won an
-  ordinary `O_EXCL` create-race in an absent window (`git-commit-lock.sh:1323-1356`)
-  and exited rc 0 — none reached the `MAX_WAIT=2` timeout, so `got97=0`. Proof: every
-  waiter in `churn.log` carries its **own** `tok.<pid>...` token (not the churner's
-  `tok.churn.1.1`) and there are no steal/TIMEOUT lines; the leg ran 17d in 4.4s
-  (too short for twelve 2s timeouts).
-- **Classification: test-flake, not a product bug.** Acquiring during a genuinely
-  absent window is correct behavior. `got97 >= 1` is a *self-validation* guard (was
-  the timeout lane exercised?), not a product requirement. In this test shape rc ∈
-  {0 (create-win), 97 (timeout), 98 (churner overwrote the hold before release —
-  designed theft detection; present in this run, waiter 36836 / `t17d.3.3.err`)} are
-  **all** correct outcomes. Which one occurs is machine-speed luck.
-
-The real regression Test 17d guards — `warn17d == 0`, the per-poll non-lock-warning
-TOCTOU guard — PASSED and is untouched by this plan.
-
-## Goal
-Make Test 17d non-flaky across fast and slow runners **without weakening the
-`warn17d == 0` regression guard**, while keeping a real anti-vacuous-pass canary so a
-dead/absent churner can't let the test pass without exercising the guarded poll path.
-
-## Fix (v2) — replaces the single `got97 >= 1` assertion; keeps everything else
-**Structural A — per-waiter lock logs (drop-free).** Today all 12 waiters share
-`AGENT_LOCK_LOG="$LOG"` (`$LOG=churn.log`, line 916). Change each waiter to its OWN log
-`AGENT_LOCK_LOG="$WORK/t17d.$r.$i.log"` (the churner writes only the lock *file*, never
-the log, so per-waiter logs lose nothing). After the 3 rounds,
-`cat "$WORK"/t17d.*.log > "$LOG"` to rebuild the consolidated `churn.log` artifact.
-`warn17d` is unaffected — it greps the per-waiter `.err` STDERR files, not the log.
-
-Then replace the `got97` accumulation + its assertion with three assertions:
-
-1. **Regression guard — unchanged.** `warn17d == 0` ("12 waiters polled through churn
-   with ZERO spurious non-lock warnings"). Keep verbatim.
-
-2. **Every waiter reaches a designed terminal state.** Accumulate each waiter's rc;
-   require all 12 ∈ **{0, 1, 97, 98}**. For `bash -c 'true'` under this churn: `0`
-   acquired+clean release; `1` acquired but release read the held lock EMPTY (churner's
-   create→write window) ⇒ release rc 2 ⇒ `lock_run` demotes the clean command to 1
-   (`git-commit-lock.sh:1739-1744`), ownership-unverifiable/correct; `97` timed out;
-   `98` churner overwrote the hold before release (designed theft detection). Any OTHER
-   rc (crash/139, 96 config error, 99, …) ⇒ `bad`, listing the offending `round.idx=rc`.
-   Stricter than the old test (which ignored every rc but 97) and is the real new
-   product-regression check. Comment must name why rc 1 is correct so a successor does
-   not "tighten" the set back and re-introduce the flake.
-
-3. **Anti-vacuity: contention actually happened (the guarded path ran).** Require
-   `cat "$WORK"/t17d.*.log | grep -c 'WAITING for lock' >= 1` (counted from the
-   single-writer per-waiter logs ⇒ drop-free; see reviewer note 2). `WAITING` is logged **only** after a
-   waiter's create was blocked by a present file (`git-commit-lock.sh:1363-1370`),
-   immediately before the per-poll type-guard loop (`:1388-1570`) that `warn17d`
-   guards — so ≥1 `WAITING` proves at least one waiter entered the exact path under
-   test. A dead/absent-only churner produces 0 `WAITING` and fails this. Threshold is
-   **≥1** (the weakest non-vacuous signal) to stay robust on absent-dominant runners;
-   the failing run already had 9 `WAITING` lines, so ≥1 has wide margin both ways.
-
-### Why ≥1 WAITING is robust (not a new flake)
-`WAITING` count is machine-dependent in the *opposite* direction to `got97`: a
-present-dominant (fast) runner blocks most waiters (lots of WAITING, got97 high); an
-absent-dominant (slow) runner lets waiters acquire (fewer WAITING, got97 low) — but
-even the worst observed case (this failure) still logged 9 WAITING. The only way to
-get 0 WAITING is no contention at all (churner never ran / always absent), which is
-exactly the vacuity we want to fail on. So ≥1 has margin on both ends; no threshold
-near the machine-variance band is introduced.
-
-### Secondary hardening — DROPPED (reviewer note 3)
-v1 proposed two extra hardenings (move the start-marker after the churner's first
-write+delete cycle; assert the churner is alive at reap). Both are dropped in v2: they
-add machinery to a delicate timing path, the alive-check can false-fail if the churner's
-iteration cap is ever hit, and both are largely redundant with the drop-free
-`WAITING>=1` canary (which already proves the churner produced real contention — a
-waiter can only log `WAITING` if the churner had the lock file present). The
-load-bearing fix is the per-waiter logs + assertions 1-3.
-
-## Observability (per logging practice)
-Keep the data that made this diagnosable: emit a `note:` line with the rc distribution
-and the WAITING count **unconditionally** (both pass and fail paths), e.g.
-`note: T17d outcomes rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited`
-— so a future failure (or a pass drifting toward an edge) can be classified from the
-suite log without re-deriving it. (The old test discarded this.)
-
-## Out of scope / explicitly NOT changed
-- The `warn17d`/TOCTOU regression logic and its assertion.
-- The churner shapes' core (pwsh on Windows, perl elsewhere) — unchanged in v2.
-- Product code (`git-commit-lock.sh`) — no product defect found.
-- The `.ps1` port and other suites — Test 17d is bash-unit-only.
-
-## Testing
-1. **Static:** `bash -n tests/git-commit-lock.test.sh`; shellcheck `-S style` (the CI
-   lint gate) on the test file — must stay clean.
-2. **Local sanity (Windows, this box):** run Test 17d in isolation a handful of times via
-   the suite's single-test selector if present, else the whole unit suite once, in
-   `.agent-testing/` — confirm it passes and the new `note:` line shows a sane rc/WAITING
-   mix. (Local box is faster/less loaded, so it will likely be present-dominant — expect
-   high got97; that's fine, the test no longer asserts on it.)
-3. **Real proof = CI stress.** The genuine signal is the GitHub windows-2025 (unit) leg
-   under load. After implementing, resume the stress driver (streak reset to 0) and
-   require the previously-flaky path to survive the run to 50 clean. If 17d flakes again
-   we re-open.
-
-## Rollout
-Commit the test fix to `ci-stress` (under the git commit lock). This is a normal,
-mergeable fix (unlike the stress-only concurrency commit 980856b). Reset
-`clean_count`, relaunch the driver, continue toward 50 clean in a row.
-
-## Changelog (implementation)
-- Implemented exactly the Fix v2 design in `tests/git-commit-lock.test.sh` Test 17d
-  (the `if wait_for_file "$START" 60` block): per-waiter `AGENT_LOCK_LOG`, rc `case`
-  bucketing into `n0/n1/n97/n98/nother` + `rc_bad` offender list, `cat glob > "$LOG"`
-  rebuild, `grep -c 'WAITING for lock' "$LOG"` count, unconditional `note:` line, and
-  the three assertions (warn17d==0 kept verbatim; rc∈{0,1,97,98}; WAITING>=1). Removed
-  `got97`. No product code or other test touched.
-- Static: `bash -n` clean; `shellcheck -S style` v0.11.0 (the CI-pinned gate version)
-  clean.
-- Local run (Windows, this box, REDUCED fan-out — Test 17d is not fan-out-gated so it
-  runs identically): full unit suite **214 passed / 0 failed**. Test 17d emitted
-  `note: T17d outcomes rc0=0 rc1=0 rc97=12 rc98=0 other=0; WAITING=12` and all three
-  assertions PASS. (Idle box ⇒ present-dominant ⇒ all 12 timed out at 97 — the opposite
-  extreme to the CI failure's rc0-heavy distribution; both now accepted.)
-- Implementation review: fresh Claude reviewer — "IMPLEMENTATION OK" (confirmed
-  set -uo pipefail / no errexit so `grep -c` exit-1 is harmless; empty-glob rebuild
-  handled; no `bad`/`rc_bad` collision; `warn17d` guard intact). Codex
-  `exec review --uncommitted` — no blocking bug. Both in `.agent-testing/`.
-- Real proof pending: the windows-2025 (unit) leg under CI load. Resuming the stress
-  driver with the streak reset to 0.
diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
deleted file mode 100644
index 523118a..0000000
--- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md
+++ /dev/null
@@ -1,245 +0,0 @@
-# Plan proposal: guarantees spec + close the failure-modes follow-ups
-
-Status: **PROPOSAL — awaiting Ben's review.** No implementation until approved.
-This is the action list + proposed workflow Ben asked for after the `/c` pass on
-`docs/failure-modes.md` (his comments converged at commit a5df9d9; recorded 534a0073).
-
-## Where this comes from
-`docs/failure-modes.md` is the **analysis / decision-support** doc (current behavior,
-3-tier classification, recommendations). Ben has now decided on its §4 (agree, with two
-overrides). The follow-ups below turn those decisions into work, and add the new doc Ben
-asked for: a **normative spec** ("what we guarantee / what's out of scope") — distinct from
-the analysis doc.
-
-## Action list (requirements / things to do)
-
-### Bucket 1 — NEW normative guarantees spec (Ben's explicit ask)
-- **A1.** Create a normative spec doc — *what the tool guarantees* and *what is out of
-  scope* — derived from `failure-modes.md`'s tiers but written as a contract, not analysis.
-  - Guarantees: the Tier-1 **safety** properties (no silent lost update given cooperative
-    unwind; strict mutual exclusion within the staleness window; no corruption) and the
-    Tier-1 **recovery** properties (lock-shaped orphans reclaimed), each with their stated
-    conditions/envelope.
-  - Out of scope: network/shared FS, multi-host/clock-skew, mixed-version trees,
-    ps1-on-POSIX, the non-unwinding-exit boundary (§H4) — the documented boundaries.
-  - Defines the **operating envelope** precisely (the load/timing envelope from §4.1) — the
-    reference Bucket 4 scopes tests against.
-  - *Open decision D-a:* location/name — `docs/guarantees.md` (new), or a normative section
-    inside `docs/git-commit-lock.md`? (Recommend a dedicated `docs/guarantees.md` — a crisp
-    contract is easier to point users/CI at than a section.)
-
-### Bucket 2 — Test coverage for the untested-but-robust lanes (§4.5, Ben's override)
-Decision (Ben): tested edge cases > reasoned-correct-but-untested. Add deterministic,
-**portable**, fault-injection tests; flag any lane that can't be injected portably rather
-than shipping a flake. **All test execution via CI** (local runs are banned — they lag
-Ben's box).
-- **B-F4.** Unwritable lock dir/parent → clean 97 (cheapest, highest-value; `chmod`).
-- **B-F2/J1.** Unwritable / failing log path → lock still works, the log write is swallowed.
-- **B-F1.** ENOSPC during claim/lock create+write (small dedicated tmpfs or quota).
-- **B-F3.** FD exhaustion via `ulimit -n` (portable); inode exhaustion only if cleanly
-  injectable.
-- **B-E3 (candidate).** mtime probe unreadable → staleness-detection-disabled, fail-safe
-  (no steal), 97 + the once-per-process warning. (Also a ○ untested lane; fits the same
-  decision — include unless Ben says skip.)
-- *Open decision D-b:* scope — just the §4.5 set (F1-F4, J1) + E3, or also fold in the two
-  **deferred F2-audit gaps**: #7 wrong-type object appearing *at the lock path mid-steal*
-  (A2/G2 — `CLAIM-ABORT (wrong-type)`/`(rename-refused)`), and #8 the Windows-only
-  blocked-unlink legs? (Recommend: do F4/F2/J1/F3 now; treat F1-ENOSPC, E3, and #7/#8 as a
-  second tier to confirm.)
-- Platform reality: several lanes are POSIX-only (tmpfs, `ulimit`, chmod semantics) — guard
-  by platform like the existing suite does; Windows-specific lanes (no-delete-share) already
-  have their own gated tests.
-
-### Bucket 3 — Documentation gaps (all "document" decisions: §4.1-4.3, §4.6, §I2)
-- **C-envelope (§4.1).** Document the load/timing envelope in `docs/git-commit-lock.md`:
-  "correctness is load-independent; wall-clock bounds (recovery latency, MAX_WAIT, the read
-  ladder) are best-effort and scale with scheduling."
-- **C-clock (§4.2).** One sentence: the tool assumes a single time source (single-host, or a
-  shared FS with one server clock); a local clock jump is correctness-safe.
-- **C-netfs (§4.3).** Surface the network/shared-FS boundary in `README.md` (document-only,
-  **no** FS-type probe).
-- **C-mixedver (§I2).** Add the "upgrade both implementations together" note to `README.md`
-  (currently design-doc-only).
-- **C-misc (§4.6, optional).** One-line each for mixed-version + case-insensitive FS in the
-  design doc.
-
-### Bucket 4 — Scope the wall-clock test bounds (§4.1 — the Test 21/22a resolution)
-- **S1.** Relax / scope the wall-clock assertions that flake only under extreme artificial
-  load — **Test 21** (≤20s recovery), **Test 22a** (claim-warning timing), **Test 29**
-  (≥2-CLAIM poll count) — to the envelope Bucket 1 defines, so the protocol's correctness
-  assertions in those tests stay strict while the latency/poll-count bounds get headroom (or
-  are gated to a defined load level). *Depends on Bucket 1's envelope.*
-- *Open decision D-c:* relax the numbers in place, or split the suite into a
-  "correctness" tier (always strict) and a "latency/envelope" tier the extreme-stress runs
-  don't hard-fail on? (Recommend the latter — it makes the envelope explicit and stops
-  future stress runs re-raising these as "flakes".)
-
-### Bucket 5 — Merge-to-`main` strategy (**D-d REOPENED 2026-06-18**)
-Ben reopened this: cherry-picking may not be the best path — "tidying up and preserving
-history" is a live alternative. **Git facts (verified 2026-06-18):**
-- **`main` has not diverged** — `merge-base(main, ci-stress) == main HEAD (fa43f30)`. So
-  ci-stress is strictly **34 commits ahead**, and a cleaned-up branch can **fast-forward**
-  onto `main` (no merge commit).
-- The 34 commits are a mix: genuine product/test/doc work; **pure stress-only scaffolding**
-  (`980856b` concurrency tweak; `b430d73`'s `tests.yml` load-wiring + raised timeouts — *but
-  `b430d73` also adds `tests/with-load.sh`, which graduates*, so it is a **mixed** commit);
-  intermediate **plan / AGENTS.md churn**; and the **`/c` commit+revert pairs**
-  (`534a007` → `959cca9` → `a5df9d9`).
-- **Bucket 6 itself rewrites the CI workflows** (3 new files) and reverts the stress wiring.
-  So after Bucket 6 lands, **ci-stress's final *tree* is already main-worthy** — the
-  stress-only commits are a *history* concern, not a tree concern. **The decision is therefore
-  mostly about what history `main` should carry, not about keeping bad code out of the tree.**
-
-Options:
-- **(A) Cherry-pick a curated subset** onto `main` (the prior plan). Surgical, but ~20
-  interdependent picks (later commits edit the same test file repeatedly → conflict-prone),
-  new SHAs disconnected from the branch, and `b430d73` must be split by hand. Drops the
-  review/decision narrative.
-- **(B) Tidy-rebase `ci-stress`, then `--ff-only` merge** ("tidy up + preserve history").
-  Interactively rewrite the branch: squash the `/c` commit+revert pairs and the intermediate
-  plan/changelog churn into their content commits, excise the pure scaffolding (or rely on
-  Bucket 6 having already removed the wiring from the tree), curate messages; then `git -C
-  <main> merge ci-stress --ff-only` lands a clean linear history in one operation. Keeps a
-  curated narrative; **rewrites history** — gotcha: `rebase.updateRefs=true` moves any branch
-  pointing into the range, so back up with a **raw SHA/tag, never a branch**.
-- **(C) Squash-merge** to one (or a few) curated commit(s). Cleanest `main` log, trivially
-  excludes scaffolding (final tree only), but discards all granular history.
-
-*Recommendation:* **(B)** — enabled cleanly by `main` not having diverged; gives a
-curated-but-real history (which (C) discards and (A) reconstructs laboriously) and matches
-"tidy up and preserve."
-
-**RESOLVED (Ben, 2026-06-18): (B) — a *mild* tidy-up, then merge via a GitHub pull request**
-(ci-stress → main), **not** a local ff-merge. Refinements:
-- **Extent of tidy-up is Ben's call.** Keep it mild. Before any history rewrite, propose the
-  specific tidy (candidates: drop the pure scaffolding commits `980856b` + `b430d73`'s
-  required-job wiring; squash the obvious `/c` commit+revert noise `534a007`→`959cca9`→
-  `a5df9d9`; leave the rest) and get Ben's sign-off on the extent — do not decide it autonomously.
-- **Merge via a GitHub PR**, so the PR's CI is the gate and the merge is reviewable. `main`
-  has not diverged, so the PR stays clean.
-- Still the **last** step of Phase 3/4; not a blocker for the harness/CI work.
-
-### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code
-The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete
-loops) was thrown together from a few lines of discussion. Ben wants a **considered,
-first-principles rethink** — explicitly **not anchored on the existing approach** — whose
-**deliverable is a recommendation doc for Ben, NOT an implementation.** Scope:
-- **Is the load injection right?** From first principles: which KINDS of load actually stress
-  *this* tool's timing-critical windows (claim→rename, read-back, discovery, mtime/staleness,
-  fsync durability, scheduler preemption at critical points)? Are CPU-spin + disk-fsync the
-  right proxies, or are better mechanisms warranted (cgroup CPU throttling, `taskset`/`nice`,
-  `ionice`, `stress-ng` stressors, FUSE/FS-latency injection, memory pressure)? Faithfulness,
-  reproducibility, and calibration (load relative to runner core count).
-- **Expand the CI matrix** on free public GitHub runners: run the suite across
-  {OS} × {load level} × {load kind} × {config} in parallel. How many cells is *considered* vs
-  *blowing it up* — diminishing returns, signal-per-cell, GitHub concurrency limits, a small
-  per-PR tier vs a larger nightly tier.
-- **Get more from EXISTING tests, routinely:** parametrize the fan-out/timing tests across
-  waiter counts and knob values (STALE / CLAIM_STALE / POLL / MAX_WAIT) so each run exercises
-  more surface — without adding flakiness. Which tests benefit most.
-- **Considered, not maximalist:** principles for choosing the matrix + a routine cadence.
-Output: `docs/load-testing-strategy.md` (recommendation). Runs EARLY (Phase 1b) because it
-shapes Buckets 2 & 4 and the Phase-2 plan. **§9 open decisions: all accepted by Ben (2026-06-17)
-with the doc's recommendations** — daily ~6-cell nightly (start smaller, grow by earn-the-slot);
-Linux cgroup CPU quota (probe-gated) for the envelope leg + ratio-calibrated stress-ng/spinner
-as the cross-platform race-jitter lane; stress-ng with a Windows spinner fallback;
-parametrization Axis A (waiter count) first; `GCL_ENVELOPE_TIER=relax` as the D-c
-correctness/envelope-split implementation; nightly issue auto-triage (correctness vs envelope).
-
-### Bucket 7 — Complete deterministic-steering coverage (Ben raised 2026-06-17)
-The load-strategy doc establishes deterministic STEERING (in-process function-interposition) —
-not external load — as the primary lever for the protocol's race-critical windows, and "more
-steered scenarios" as the #1 coverage investment. We have **not** scoped what *complete*
-steering coverage requires.
-- **Audit (Phase 1c):** enumerate every window/branch/residual across acquire / steal / hold /
-  release and map each to its deterministic-steering test or a GAP. Inputs: `failure-modes.md`,
-  the load-strategy §2 reachability table, the earlier F2 audit. Known gaps already: residual-
-  1/2/3 (claimant parked between recheck / touch and rename), and the F2-audit #7/#8 (wrong-type
-  appearing at the lock path mid-steal — A2/G2; Windows blocked-unlink legs). Add a **mechanical
-  branch-coverage pass (kcov for bash, on the Linux CI leg)** to find never-executed lines
-  objectively, as an input to the manual window audit.
-- **Output:** a coverage gap-list doc that scopes the steering-test work.
-- **Fill (Phase 3):** write the missing steered tests, bundled with Bucket 2.
-
-### Bucket 8 — Test-harness ergonomics (research done 2026-06-17; small, zero-dep)
-A subagent researched "big bash files vs alternatives." Verdict: **keep the plain-bash, zero-dep,
-custom-harness, steering-friendly design** — do NOT adopt bats-core (its forced `set -e` fights
-the suite's deliberate `set -uo` + exit-code assertions; its Windows/MINGW path quirks add risk
-on this project's most fragile axis) or shunit2 (lateral move, weaker Windows story). But the
-*monolith* (not the harness) costs a single-test selector + machine-readable reporting.
-Recommended incremental, **zero-dependency** additions, priority order:
-  1. **TAP output** from `ok`/`bad` + a `1..N` plan line (~15 lines) — machine-readable CI
-     reporting AND closes the silent-undercount gap (an early `exit`/crash currently drops every
-     later assertion from the count, total still prints "passed").
-  2. **A single-test selector** (`GCL_TEST_ONLY=<regex>`) — the biggest day-to-day pain (today
-     you run all 36 unit tests to iterate on one, on the slowest leg).
-  3. **Extract the duplicated helpers** into `tests/_harness.sh` (ok/bad/backdate/clone_fn/
-     wait — copy-pasted verbatim across all three files).
-  4. (Optional) split the two large files by concern; leave the integration suite whole (its
-     cross-test repo-state audit is an intentional dependency).
-Fold into the Phase-2 plan / Phase-3 build; items 1–2 are an afternoon and pay off every
-iteration (esp. given the local-test ban → faster CI triage from machine-readable output).
-
-## Workflow (settled: spec → plan → implement → review)
-
-Each phase ends with **Claude + Codex review rounds to convergence** and a **Ben gate**.
-Test execution is **CI-only** throughout (local runs lag Ben's box).
-
-**Phase 1a — Guarantees spec.** Write `docs/guarantees.md` (D-a) — what we guarantee / what's
-out of scope, as a normative contract + the precise operating envelope. Review (Claude +
-Codex) against the code + `failure-modes.md`. → Ben gate.
-
-**Phase 1b — Load-&-matrix testing STRATEGY recommendation (Bucket 6 / Ben "f").** Run a
-considered, first-principles process (parallel research agents on distinct facets: the tool's
-timing-window→load-type mapping + critique of the current wrapper; CI-matrix design on free
-runners; existing-test parametrization), synthesize into `docs/load-testing-strategy.md`,
-review (Claude + Codex). **Recommendation only — NO implementation.** → Ben reviews; his chosen
-recommendations feed Phase 2. Runs early because it shapes Buckets 2 & 4. (1a and 1b are
-independent and can run in parallel.)
-
-**Phase 1c — Complete-steering-coverage audit (Bucket 7 / Ben raised 2026-06-17).**
-Systematically map every protocol window/branch/residual to its deterministic-steering test or
-a GAP, plus a mechanical kcov branch-coverage pass to find never-executed lines objectively;
-output a coverage gap-list doc. Analytical (read-only), parallel to 1a/1b; its gap-list is a
-major input to the Phase-2 test plan (steering is the #1 race-coverage lever per the
-load-strategy doc). **Audit now; gap-filling is Phase 3.** → Ben gate. (Also folds in the
-Bucket-8 harness-ergonomics items, which the new tests will want.)
-
-**Phase 2 — Plan.** Concrete implementation plan for Buckets 2-4, incorporating Ben's chosen
-load/matrix recommendations: per-test injection method (tmpfs / `ulimit` / chmod) + platform
-guard + CI wiring; the matrix/parametrization to adopt; exact doc edits; the
-correctness/envelope test split (D-c); a logging/observability note. Record in `.plans/`,
-review. → Ben gate.
-
-**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2, tiered per D-b), apply
-the doc edits (Bucket 3), scope the wall-clock bounds + split the tiers (Bucket 4 / D-c), wire
-the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **Verify via CI**
-(dispatch `tests.yml` on `ci-stress`) — never locally.
-
-**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the
-agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the
-matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then land on `main`
-per **D-d** (resolved 2026-06-18: a mild tidy-up — extent is Ben's call — then merge via a
-GitHub PR; see Bucket 5).
-
-## Decisions (settled 2026-06-17)
-- **D-a → new `docs/guarantees.md`** (dedicated normative doc).
-- **D-b → accept rec:** F4 / F2-J1 / F3 first tier; F1-ENOSPC, E3, and the deferred F2-audit
-  gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier.
-- **D-c → split the suite** into a strict-correctness tier (always enforced) and a
-  latency/envelope tier (not hard-failed by extreme-stress runs).
-- **D-d → RESOLVED 2026-06-18: (B) mild tidy-up, then merge via a GitHub PR** (ci-stress →
-  main), not a local ff-merge. **Extent of tidy-up is Ben's call** — propose the specific
-  commits to drop/squash and get his sign-off before rewriting history. (Was briefly reopened
-  2026-06-18 across cherry-pick (A) / tidy-rebase (B) / squash (C); see **Bucket 5**.) Still the
-  last step.
-- **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the
-  test/matrix count is known.
-- **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing
-  **recommendation doc** (not implementation), run early as Phase 1b.
-
-## Out of scope for this plan
-- Anything the design already rejected (heartbeat, two-rename CAS, `File.Replace`, supporting
-  network FS) — see `failure-modes.md` §4 "Things explicitly NOT to do".
-- No product *behavior* changes are implied by any of the above — these are tests + docs +
-  test-bound scoping. (If a new test surfaces a real product bug, that's a separate loop.)
diff --git a/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md b/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md
deleted file mode 100644
index a6f9e8d..0000000
--- a/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md
+++ /dev/null
@@ -1,119 +0,0 @@
-# Plan: de-flake interop Test 5 (genuine-pwsh-orphan steal) under load
-
-Status: **DONE** — diagnosis + fix D validated by Claude subagent + Codex; implemented;
-implementation reviewed clean by fresh Claude reviewer ("IMPLEMENTATION OK") + Codex ("no
-correctness issues"); local interop suite 141/0 with a genuine `tok.ps.*` token. Awaiting
-CI-under-load confirmation.
-
-## Reviewer notes (top; do not renumber)
-_(none yet)_
-
-## Context
-CI stress under CPU load (load=4, 4-core Windows runner) reproducibly fails the **interop
-suite Test 5** ("bash steals a STALE lock GENUINELY created by pwsh (holder killed
-mid-hold)"), `tests/git-commit-lock.interop.test.sh:308-334`:
-```
-FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''
-PASS: bash run exited 0 after stealing pwsh's stale lock   (+2 more PASS)
-```
-Diagnosis (Claude subagent) + independent Codex review — both in
-`.agent-testing/failures/interop-test5/{DIAGNOSIS.md,b5.log}` and
-`.agent-testing/codex-t5-diag-review.txt`. Agreed mechanism (high confidence,
-triple-corroborated by b5.log):
-
-- The holder is `pwsh ... Lock-Acquire; write READY; Start-Sleep 60 &`, with `hpid=$!`.
-  bash waits READY then `kill -9 "$hpid"`. **That kill does not terminate the native
-  pwsh** (MSYS `$!` names a shim, not `pwsh.exe`; under load it misses). Proof: b5.log
-  shows ACQUIRED 13:42:45 → RELEASED 13:43:45 = **exactly 60s = the full Start-Sleep**,
-  and the release reason is **`engine-event backstop at process exit`** which fires ONLY
-  on graceful exit (`git-commit-lock.ps1:1299-1322`), never on a hard kill.
-- That graceful-exit backstop **deletes the lock file** (`git-commit-lock.ps1:1319-1321`)
-  before bash reads it, so `head -n 1 "$LOCK"` (:320) returns `''` — a **gone file**, not
-  a slow-to-appear token. `backdate "$LOCK" 9999` (:325 = `touch`, no `-c`, :107-115)
-  then **re-creates it empty+ancient**, and bash steals THAT empty orphan (`ghost=?`,
-  b5.log). So the 3 downstream PASSes are **vacuous** (they steal an empty file, not a
-  genuine `tok.ps.*` orphan); the only assertion checking the real premise correctly FAILed.
-- **Classification: test bug, product correct.** Every product action in b5.log is right.
-- **Why load:** unloaded, the kill lands by timing luck before the sleep ends; under load
-  the kill misses and the holder self-releases.
-
-Scope: this kill-a-holder-then-read-its-orphan pattern is unique to Test 5. The other
-interop kill (`:787`, `w14b`) is cleanup of a *hung waiter* after a regression `bad` — no
-orphan read depends on it — so it is NOT affected.
-
-## Fix (Option D — make the orphan deterministic; remove the unreliable kill)
-Both reviewers recommend D over hardening the kill (B/C): it eliminates the flaky
-mechanism instead of making it reliable, and is the smaller, more deterministic change.
-
-Have the pwsh holder **acquire, signal READY, then self-exit via
-`[Environment]::Exit(0)`** — the product's *documented* hard-exit that bypasses BOTH
-`Lock-Release` and the `PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-224`,
-`:1299-1301`), so it leaves a genuine token'd orphan every time, with no external kill and
-no timing dependence. `Lock-Acquire` writes+flushes+closes the token before returning
-(`git-commit-lock.ps1:650-664`) and READY is written only after acquire, so the moment
-bash sees READY the `tok.ps.*` token is already durably on disk.
-
-Concretely in `tests/git-commit-lock.interop.test.sh` Test 5:
-1. Holder command (`:314-315`): replace
-   `. '$PS1WIN'; Lock-Acquire | Out-Null; [IO.File]::WriteAllText('$READY','r'); Start-Sleep 60`
-   with
-   `. '$PS1WIN'; if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; [IO.File]::WriteAllText('$READY','r'); [Environment]::Exit(0)`
-   (`Lock-Acquire` returns `$false` on failure, `git-commit-lock.ps1:1350`; guard it so a
-   failed acquire never writes READY → the existing else-branch "never readied" fires.)
-2. Success branch (`:317-324`): drop the unreliable `kill -9 "$hpid"; wait "$hpid"; sleep
-   0.3` and replace with just `wait "$hpid" 2>/dev/null` (reap the self-exited holder).
-   Keep the token read + `case tok.ps.*` assertion + `backdate` + the steal asserts
-   unchanged — but now the orphan deterministically carries the genuine pwsh token, so the
-   `tok.ps.*` assertion (and the downstream steal) are no longer vacuous.
-3. Comment (`:309-311`): rewrite to describe the new mechanism honestly — the holder
-   acquires, signals ready, then exits via `[Environment]::Exit(0)`, a CLR hard-exit that
-   bypasses release (no `PowerShell.Exiting` event), leaving a genuine no-release token'd
-   orphan; deterministically equivalent (same on-disk state) to a holder killed mid-hold,
-   without depending on a scheduler-raced external kill.
-4. else branch (`:331-333`): keep its `kill -9 "$hpid"` cleanup (harmless; the holder may
-   still be starting if it never readied).
-
-### Why D is faithful (not a weakening)
-Test 5 verifies **bash stealing a genuine stale pwsh-created lock cross-impl**. What
-matters is the on-disk state at steal time: a live lock file whose line 1 is a real
-`tok.ps.*` token, with the holder gone and no release performed. D produces exactly that
-state deterministically. The literal "killed by external TerminateProcess" flavor is only
-test *setup*, not the product behavior under test; D's CLR hard-exit leaves the identical
-artifact. The fix makes the long-vacuous downstream PASSes actually meaningful.
-
-## Also
-- Correct the `AGENTS.md` Test 5 progress-log note (it currently states the wrong
-  mechanism — "token not-yet-visible under load"); replace with the missed-kill /
-  graceful-release-deleted-the-file mechanism.
-
-## Out of scope / NOT changed
-- Product code (`git-commit-lock.ps1` / `.sh`) — no product defect.
-- The bash-worker kills in the unit suite (they kill native bash where `$!` is correct and
-  no orphan-read depends on them; they passed under load).
-- Other interop tests.
-
-## Testing
-1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) on the interop test.
-2. Local: run the interop suite once on this box (pwsh present) — Test 5 must pass and the
-   token assertion must see a real `tok.ps.*` token. (Unloaded local box can't reproduce
-   the original miss, but confirms the rewrite is correct.)
-3. Real proof = CI under load: dispatch ci-stress with stress_kind=cpu/both several times;
-   the interop leg must stay green where it previously failed deterministically.
-
-## Changelog (implementation)
-- Implemented Fix D in `tests/git-commit-lock.interop.test.sh` Test 5: holder command now
-  `if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; write READY; [Environment]::Exit(0)`
-  (was `Lock-Acquire | Out-Null; write READY; Start-Sleep 60`); success branch drops
-  `kill -9 "$hpid"; sleep 0.3`, keeps `wait "$hpid"` to reap; ok-message + comment updated.
-  No product code, no other test touched. `Lock-Acquire` returns a strict boolean
-  (git-commit-lock.ps1:1350 etc.) so the `-not` guard is valid; the token is flushed+closed
-  during acquire (before READY) so it is durably visible before `[Environment]::Exit`.
-- Static: `bash -n` + `shellcheck -S style` (v0.11.0) clean.
-- Local (Windows, pwsh 7.5.5): interop suite **141 passed / 0 failed**; Test 5 token
-  assertion now PASSes with a real `tok.ps.*` token (e.g. `tok.ps.76676.…`) — no longer the
-  vacuous empty-orphan steal.
-- Review: fresh Claude reviewer "IMPLEMENTATION OK" (verified Lock-Acquire boolean contract,
-  no pipeline pollution from dropping Out-Null, token durability, race-free `wait`, quoting);
-  Codex `exec review --uncommitted` "no correctness issues." Both in `.agent-testing/`.
-- AGENTS.md Test 5 progress note corrected (was the wrong "token not-yet-visible" mechanism).
-- Real proof pending: CI interop leg under CPU load where it previously failed 3/3.
diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md
deleted file mode 100644
index 8da0f2a..0000000
--- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md
+++ /dev/null
@@ -1,477 +0,0 @@
-# Phase 2 plan: implement the guarantees-and-coverage build (Buckets 2/3/4/6/8)
-
-Status: **PROPOSAL — Phase 2 of the [guarantees-and-coverage
-plan](2026-06-17-ci-stress-guarantees-and-coverage-plan.md).** Awaiting Ben's
-gate. No implementation (Phase 3) until approved.
-
-## What this plans
-The concrete build that follows from the (committed, queued) Phase 1 outputs:
-- `docs/guarantees.md` — the normative contract (Phase 1a).
-- `docs/steering-coverage.md` — the prioritized steering-coverage gap list (Phase 1c).
-- `docs/failure-modes.md` §4 — the accepted scope decisions (incl. Ben's §4.5
-  override to add fault-injection coverage).
-- `docs/load-testing-strategy.md` §9 — accepted load/matrix recommendations.
-
-It turns those into: new tests (Bucket 2 — the Tier-A steering + Tier-B
-fault-injection gaps), documentation edits (Bucket 3), the correctness/envelope
-test split (Bucket 4 / D-c, via `GCL_ENVELOPE_TIER=relax`), the CI matrix wiring
-(Bucket 6), and harness ergonomics (Bucket 8). **Verification is CI-first** (the
-new tests run across the matrix); local runs are allowed but the box lags under
-heavy fan-out.
-
-Each section gives per-item designs concrete enough for Phase 3 to implement
-directly. Three sections (Bucket 2 Tier-B, Bucket 6, Bucket 8) are being
-feasibility-validated by parallel design agents and are integrated below.
-
----
-
-## Bucket 2A — Tier-A steering tests (portable, deterministic; the bulk of the value)
-
-From `steering-coverage.md` §3 Tier A. All are new `clone_fn`/shadow tests in
-`tests/git-commit-lock.test.sh` (unit suite), runnable on every CI leg — no
-fault-injection fragility. The audit already established each steering technique;
-line anchors are current-tree and may drift (re-locate at build).
-
-| ID | Gap (location) | Steering mechanism | Asserts | Platform | Priority |
-|---|---|---|---|---|---|
-| **A1** | `CLAIM-ABORT (rename-refused)` — wrong-type object at the lock path mid-steal (`:1195-1202`) | `clone_fn _lock_verify_stale` (or shadow `mv`) to `mkdir` a directory onto `$AGENT_LOCK_PATH` immediately before the rename | `CLAIM-ABORT (rename-refused)` + "non-file at the lock path" log; claim deleted; discovery read; **no false hold**; ghost handled | all | **HIGH** — the only acquire/steal *verdict* branch with no test; its own log string |
-| **A2** | step-3.3 pre-rename CLAIM-ABORT block (`:1151-1160`; kcov hits=0) | `_lock_verify_stale` shadow with a **call-counter**: pass on call 1 (step-2), flip to `not stale` (gone/wrongtype/fresh) on call 2 (step-3.3) | the step-3.3 abort reason-map fires; claim-delete + discovery + `return 1`; no false hold | all | **HIGH** — a whole unexercised abort lane |
-| **A3** | `foreign` claim-recheck branch (`:1103-1106`; kcov hits=0) | shadow the claim read at recheck to return a *foreign* token (a clearer removed our claim, a rival re-claimed) | leave the foreign claim; discovery read; back off; no 98-on-mere-claim | all | MED-HIGH |
-| **A4** | `exec`-bypass / §H4 no-silent-loss boundary (`lock_run` runs `"$@"` in the wrapper shell, `:1733`) | **(corrected, verified empirically)** the exec must run in the lock-holding shell: `run -- exec true` or sourced `lock_acquire; exec true` — **NOT** `run -- bash -c 'exec true'` (that execs a child, releases normally) | (a) benign: no `RELEASED` line / lock left held; (b) displaced (backdated lease + parked contender) + exec 0 → caller sees 0 with **no** 98 — pins `guarantees.md` OOS-5 | all (bash) | **HIGH** — the one silent-loss boundary |
-| **A5** | forward clock-jump → premature steal of a live lock (§E2; `:928,1409`) | `clone_fn _lock_now` to return now+offset on the poll while the live holder's mtime stays current | the live lock is judged stale and stolen; the victim's release hits **98** (clock-driven analogue of Test 4b) | all | MED |
-| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn _lock_stat_mtime` (the **inner** stat probe at `:606`) to return empty on a *present* file — **NOT** `_lock_path_mtime`, which is the function that *emits* the warn-once (`:639-643`); shadowing it would defeat the assertion | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED |
-| **A7** | malformed/unreadable content classification tails (`_lock_verify_stale` `:940-949`; in-acquire steal guard `:1429-1443`; claim-stale-check `:1240-1249`) | fabricate a line-1-whitespace file (non-empty blank line 1 = `#18`); shadow a read-fault (`#17`) | no steal; the right `not a lock/claim file` / `unreadable` warning; covers several sibling branches per test | all | LOW-MED (cheap, multi-branch) |
-| **A8** | socket & device-node wrong-type arms (`:1474-1475` claim, `:1561-1562` lock; kcov-new) | bind a unix socket / reference a device node (`/dev/null`) at the path | refusal (never stolen/deleted); the `-S`/`-b`/`-c` arms execute | POSIX | LOW (cheap; sibling of tested guard) |
-| **A9** | log rotation past 1 MB (`:558-559`; kcov-new) | pre-write a >1 MB `$AGENT_LOCK_LOG`, trigger a log call | truncate-restart (log shrinks; lock unaffected) | all | LOW (trivial, no injection) |
-| **A10** | EXIT-trap no-hold arc-end (`:1009,1017-1018`; kcov hits=0) | a sourced `lock_acquire` that `exit`s while still *waiting* (no hold, no in-flight claim) | the no-hold cleanup/restore path runs (vs the TERM twin already tested) | all | LOW |
-| **A11** | `mv -T` fallback forced on (`:969,976-977`) | pre-set `_LOCK_MVT=0` (or shadow the probe's `mv -T` to fail) in a sourced steering shell, then run a steal + a steal-into-a-directory | the BSD/macOS unlink+bare-`mv` lane + the `[ -d ]` last-instant guard execute on Linux/MINGW | all (forces the lane) | LOW-MED (closes an engine lane on the common leg) |
-
-**Sequencing:** A1/A2/A4 first (high value, real verdict/abort/silent-loss lanes);
-A3/A5/A6 next; A7-A11 as a cheap batch. Each is a self-contained unit test using
-the existing fabricate + backdate + `clone_fn` idioms.
-
----
-
-## Bucket 2B — Tier-B fault-injection tests (empirically feasibility-validated)
-
-Each injection was prototyped against the real `git-commit-lock.sh` (Git Bash + WSL).
-The §4.5 discipline applies: **ship only lanes that inject portably/deterministically;
-flag the rest rather than ship a flake.** This **refines the original D-b** (which had
-F3 in the first cut) based on the feasibility results.
-
-| Lane | Injection | Asserts | Guard | Status |
-|---|---|---|---|---|
-| **F4 — unwritable lock dir → 97** | `chmod 0555` the lock dir; create fails O_EXCL every poll. Cap `MAX_WAIT=1-2`, `POLL=0.1`. | `rc==97`; command never ran (no marker); no lock created; log `WAITING` then `TIMEOUT after Ns` | **POSIX-only** (guard is **load-bearing**: `chmod 0555` is a *no-op for writes* on Git Bash/NTFS → would falsely pass rc=0; skip-with-note like Test 17's symlink branch) | **First cut.** Deterministic (5/5 rc=97 on WSL). The §F4 highest-value lane (most likely real misconfig). |
-| **F2/J1 — failing log → lock works, write swallowed** | Point `AGENT_LOCK_LOG` at `<regular-file>/x.log` so every append fails **ENOTDIR** (portable; no chmod/perms). | `rc==0`; command ran (marker); lock cleaned up (gone); log **not written** (`[ ! -s "$LOG" ]` / uncreated). Covers F2 **and** J1 in one test. | **Portable — no guard.** | **First cut.** Deterministic, both platforms. **Caveat:** bash's redirection-open failure leaks to stderr (the `||true` is on the write, not the open) — do **not** assert clean stderr, and do **not** `grep RELEASED "$LOG"` (nothing is written). |
-| **F1 — ENOSPC on create/write** | Real full FS only: `sudo mount -t tmpfs -o size=400k` + `dd` fill, point the lock there. | `rc==97`; command never ran; an **empty-orphan lock left behind** (create 0-byte, write failed — matches §F1) | **Linux-only AND needs root/sudo** | **Second cut — gated, or document-only.** Behavior validated end-to-end on WSL. **`ulimit -f 0` is a trap** — it raises SIGXFSZ (rc=153) killing the *wrapper*, not the create. **No portable injection.** |
-| **F3 — FD / inode exhaustion** | (intended `ulimit -n` / small-inode FS) | (intended `rc==97`, create-fail→wait) | Linux-only; inode→root | **Document-only.** **Cannot inject deterministically:** the create uses **~1 FD**, so any `ulimit -n` low enough to fail *it* first starves bash's own startup (machine-/load-dependent harness corruption, not the lib's 97 lane). Inode exhaustion needs root. §F3 is already reasoned-correct (same shape as F1). |
-
-**D-b tier split (refined by feasibility):**
-- **First cut (implement now):** F4 (POSIX-guarded) + F2/J1 (portable). Both deterministic,
-  single-shot (no fan-out), ~3-4 s total. These close the resource-lane coverage on every
-  leg with zero flake risk.
-- **Second cut:** F1 — **recommend** a Linux-only test gated behind both `uname`==Linux
-  **and** a `sudo -n true` capability probe that **skips-with-note** when sudo is
-  unavailable (never fails the suite), with `sudo umount` in cleanup (GitHub `ubuntu-*`
-  runners have passwordless sudo). *Alternative:* document-only, since the behavior is
-  validated. *(Decision point for Ben — see Open decisions.)*
-- **Document-only:** F3 (and F1 if Ben prefers zero root in the suite). Note the validated
-  behavior in `failure-modes.md` §F1/§F3 (the empty-orphan→97 path) rather than shipping a
-  flaky/non-portable test. **This supersedes `steering-coverage.md` §3 B4's "portable POSIX"
-  rating and the failure-modes §4.5/Q5 "`ulimit -n` for FDs" suggestion** — the empirical
-  check shows the create needs ~1 FD, so no `ulimit -n` fails it without first starving
-  bash's own startup (harness corruption). `steering-coverage.md` B4 is corrected to match.
-
-**Implementation notes (match existing idioms):** use the `LOCK`/`LOG`/`AGENT_LOCK_*` env
-vocabulary and the `rc=$?; [ "$rc" = 97 ] && ok … || bad …` + `grep -q "TIMEOUT after"`
-pattern; mirror Test 17's `2> "$WORK/tNN.err"` capture and skip-with-note. **F4 cleanup is
-load-bearing:** a `chmod 0555` dir blocks `rm -rf` of its *contents* — keep that lock dir
-**empty** (nothing is created in it) so the suite's `cleanup()` `rm -rf "$WORK"` succeeds.
-**F2 assertion polarity** is inverted: assert the log was **not** written; the lock-success
-signal is `rc==0` + the command's marker + lock-file-gone, not a log line.
-
----
-
-## Bucket 3 — Documentation edits (exact text)
-
-Small, concrete edits surfacing the boundaries the analysis decided to document.
-
-### C-envelope (§4.1) → `docs/git-commit-lock.md`
-Add, near the staleness/clock discussion (after the "One caveat on the mtime
-clock" block, ~`:283-293`), a short **operating-envelope** statement:
-> **Correctness is load-independent; latency is not.** Exclusion, no-silent-loss,
-> and eventual recovery rest on atomic create/rename + per-attempt tokens and hold
-> under any load. The wall-clock bounds — recovery latency (≈ STALE + poll
-> cadence), the `MAX_WAIT` timeout, and the ~1.3 s read-retry ladder — are
-> best-effort and scale with scheduling: under CPU oversubscription or a slow FS
-> they stretch, but the protocol still recovers and never loses an update.
-
-### C-clock (§4.2) → `docs/git-commit-lock.md`
-One sentence in the same caveat block:
-> The tool assumes a **single time source** — single-host use (the common case,
-> all contenders share one checkout hence one clock), or a shared FS with one
-> server clock. A local clock jump is correctness-safe: a forward jump can make a
-> live lock look stale and be prematurely stolen, but that degrades to the
-> detected exit-98 lane, never a silent double-commit.
-
-### C-netfs (§4.3) → `README.md`
-The boundary is in the design doc (`git-commit-lock.md:122-126`) but not the
-README, where operators look. Add to "How it works" (after the atomic-create
-sentence, ~`README.md:57`):
-> The protocol's correctness rests on these operations being atomic, which holds
-> on local filesystems (ext4, APFS, NTFS, and kin) but **not** on network or
-> sync-backed storage — NFS, SMB shares, Dropbox/OneDrive-synced directories —
-> where exclusion may silently fail. Keep the repo (and so its `.git/`) on a local
-> disk. (The default lock lives in `.git`, which almost always is.)
-
-### C-mixedver (§I2) → `README.md`
-The "upgrade both together" rule is design-doc-only (`git-commit-lock.md:251-256`).
-Add to the two-implementations section (~`README.md:82-95`):
-> **Upgrade both implementations together.** Older releases stole with an
-> unserialized move-aside instead of the claim protocol, so the
-> no-displacement-during-recovery guarantee holds only when every party in a tree
-> runs a current version; a mixed-version tree degrades that prevention to
-> detection (exit 98) and can leave `.dead.*` files current versions don't clean.
-
-### C-misc (§4.6, optional) → `docs/git-commit-lock.md`
-One line each (low priority): case-insensitive FS is a non-issue (the lock/claim
-paths never collide under case folding); the mixed-version `.dead.*` litter note
-cross-referenced.
-
----
-
-## Bucket 4 — Correctness/envelope test split (D-c; `GCL_ENVELOPE_TIER=relax`)
-
-D-c is implemented as a **tagged assertion downgrade**, not a physical file split
-(a file split would duplicate Test 21/29's heavy `clone_fn` setup and break the
-single-suite kcov measurement). Add an `ok`/`bad`-adjacent helper pair (in
-`tests/_harness.sh` once Bucket 8 item 3 lands; inline in the unit suite until
-then — same signature, so the later move is mechanical):
-
-```bash
-ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}"   # default strict; nightly/deep set relax
-ENV_WARN=0
-# TAP-aware (Bucket 8 item 1 lands FIRST, so TAPN/GCL_TAP already exist — review catch).
-# An envelope PASS is a normal `ok`; an envelope FAIL is a hard `bad` in strict, but in
-# relax it is a TAP-passing line with a `# env-relaxed` directive — it counts toward the
-# 1..N plan and bumps ENV_WARN (for triage), and NEVER reds the run.
-ok_envelope()  { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS[env]: $*"
-                 [ "${GCL_TAP:-0}" = 1 ] && echo "ok $TAPN - $*"; return 0; }
-bad_envelope() {
-  if [ "$ENVELOPE_TIER" = relax ]; then
-    ENV_WARN=$((ENV_WARN+1)); TAPN=$((TAPN+1)); echo "WARN[env-relaxed]: $*"
-    [ "${GCL_TAP:-0}" = 1 ] && echo "ok $TAPN - $* # env-relaxed"
-  else
-    FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*"
-    [ "${GCL_TAP:-0}" = 1 ] && echo "not ok $TAPN - $*"
-  fi; return 0; }
-```
-
-- **`ok`/`bad` = the strict-correctness tier** (always hard, both tiers);
-  **`ok_envelope`/`bad_envelope` = the latency/envelope tier** (hard in `strict`,
-  warn-only in `relax`). Exit code is driven by real `FAIL` only — `ENV_WARN` never
-  reds a run; the summary prints the `ENV_WARN` count so it's visible.
-- **The three (and only three) downgraded call sites** — swap `ok`/`bad` →
-  `*_envelope` on the *wall-clock* assertion only; every neighbouring correctness
-  assertion (rc=97, no-steal, dir-untouched, STOLE-BY-CLAIM, …) **keeps `ok`/`bad`**:
-  - **Test 21** `:1144` — recovery latency `≤20s`.
-  - **Test 22a** — downgrade ONLY the *warning-fired-at-all* assertion (`:1167`,
-    `grep -q "is not a claim file"`, i.e. count `≥1`), which depends on two-poll-confirm
-    headroom under load. Keep the warn-once **correctness** strict: **split** the current
-    `n==1` check (`:1170`) into `n≥1` (→ `bad_envelope`, timing) **+** `n≤1` (→ `bad`,
-    strict — the dedup property: never warns twice), and **guard** "names the type"
-    (`:1168`) on a warning having fired (assert strictly only when `n≥1`). So a real
-    warn-once regression (n≥2, or wrong type) stays a hard FAIL even under `relax`.
-    (Mapping `:1167`/`:1168`/`:1170` verified against the current tree — a reviewer's
-    alternate line numbers were a mislocation; re-confirm at build.) The never-steal /
-    never-delete assertions (`:1171`/`:1172`) stay strict.
-  - **Test 29** `:1531` — `≥2` CLAIM lines (poll-count).
-- **Required CI sets `strict` (or leaves it unset)** — at zero artificial load the
-  three pass comfortably, so the gate behavior is unchanged; **nightly/deep set
-  `relax`** so an oversubscribed runner can't turn an envelope miss into a red.
-- Anchors are current-tree; re-locate the three sites at build (each is the single
-  `-le 20` / warning-count / `-ge 2` line).
-
----
-
-## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions)
-
-> **DECISION (Ben, 2026-06-18): NO branch protection — single-dev project.** We will not
-> enforce required status checks. Consequences for this bucket:
-> 1. **The `tests-passed` aggregator and the per-job doc-only path-filter (the `changes`
->    job) are DROPPED.** Both existed only to make a *required* check behave well (one
->    green context to require; doc-only PRs not blocked by it). With nothing required,
->    `tests.yml` keeps the simple **workflow-level `paths-ignore`** and reports the per-cell
->    matrix contexts directly. So **Bucket 6a = the de-stress revert only** (revert
->    `980856b` + `b430d73`'s `tests.yml` half; restore original concurrency/timeouts; drop
->    the stress `workflow_dispatch` inputs; suites run un-wrapped).
-> 2. The 3-workflow file split (`tests.yml` / `nightly.yml` / `deep-sweep.yml`) is **kept**,
->    but now purely for separation of concerns (per-PR no-load gate vs scheduled load vs
->    on-demand deep) — not to stop `workflow_dispatch` publishing gating contexts (moot
->    without protection). The "distinct `deep-*` job names" detail is likewise now cosmetic.
-> The paragraphs below that describe the aggregator / path-filter / required-context gotchas
-> are **SUPERSEDED** by this note; keep them only as the rationale for why they're unneeded.
-
-**Three-workflow structure** (revised after review — a `workflow_dispatch` run
-publishes check contexts on the head SHA, so keeping Deep in `tests.yml` under shared
-job names risks a failed Deep run gating a PR; separate files + a stable required
-aggregator remove that risk *and* the event-conditional concurrency):
-- **`tests.yml`** — Tier R (required): the 4-cell `test` matrix + `lint` + a single
-  stable **`tests-passed` aggregator** (`needs: [test, lint]`, `if: always()`, succeeds
-  iff every needed job *succeeded or was skipped*). **Branch protection requires ONLY
-  `tests-passed`**, not the per-cell matrix contexts. Concurrency: `group: ${{
-  github.workflow }}-${{ github.ref }}` + `cancel-in-progress`.
-- **`nightly.yml`** — Tier N + the kcov job + triage (`issues: write`, `schedule`, its
-  own `concurrency: nightly`).
-- **`deep-sweep.yml`** — Tier D (`workflow_dispatch` only), with **distinct job names**
-  (`deep-*`) so it never publishes the `tests-passed` context, and per-run-unique
-  concurrency.
-This also fixes the **`paths-ignore`-on-required gotcha** cleanly: path-filter the
-expensive `test`/`lint` jobs (they *skip* on doc-only PRs) while `tests-passed` always
-runs and reports green (its needs were skipped, not failed) — so a doc-only PR satisfies
-the one required context without the expensive jobs running.
-
-**Tier R — Required / per-PR (blocking), `tests.yml`.** The current 4 cells
-unchanged (ubuntu all / macos all / windows unit / windows interop+integration),
-**no load**, `GCL_ENVELOPE_TIER=strict` (default — the 3 wall-clock assertions pass
-comfortably at zero load), `GCL_TEST_FULL=1`. Diff from today: **revert** the
-per-run-unique concurrency group (`980856b`) → `group: ${{ github.workflow }}-${{
-github.ref }}` + `cancel-in-progress`; **drop** the `GCL_STRESS_*` env + `with-load.sh`
-wrap + raised timeouts from the required job (`b430d73`'s workflow half); restore the
-original step/job timeouts. Target < ~8 min. A red here is therefore never a
-stress-manufactured flake.
-
-**Tier N — Nightly (non-blocking, triaged), new `nightly.yml`.** `schedule` (daily,
-off-peak) + `workflow_dispatch`; one oversubscribed level **R≈2**;
-`GCL_ENVELOPE_TIER=relax` + `GCL_TEST_SWEEP=1`; `concurrency: nightly` + cancel
-(one run at a time). **6 explicit cells** (`matrix.include`): N1 ubuntu/cpu, N2
-ubuntu/disk, N3 ubuntu/both, N4 macos/disk (the single harsh macOS cell — scarce/slow/
-5-job sub-limit), N5 windows interop+integration/disk (highest-value: delete-pending
-ghosts + 5.1 unlink-then-move under churn), N6 windows unit/both. 6 cells + kcov +
-triage ≈ 8 jobs → one wave under the ~20/5 ceiling. Nightly steps keep the raised
-timeouts (correct here).
-
-**Tier D — Deep sweep (`deep-sweep.yml`, `workflow_dispatch` only, never gates).**
-Inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax). Its
-jobs use **distinct names** (`deep-*`) so a failed dispatch never publishes the
-`tests-passed` required context (the review catch), with per-run-unique concurrency
-(`group: deep-${{ github.run_id }}`, `cancel-in-progress: false`) so many parallel
-dispatches each run and accept queue waves. Living in its own file removes any need for
-an event-conditional concurrency expression.
-
-**Axis-A waiter-count sweep {4,12,24}** under `GCL_TEST_SWEEP=1` (nightly/deep only;
-unset per-PR → today's floor `N=4`, deterministic). A `T_AXIS_A` list read at suite
-top; each of **Test 2b / Test 20 / interop Test 16** loops `N` over it, naming `N` in
-every message. Anti-flake discipline baked into the loop: keep correctness assertions
-config-independent (hold `STALE ≫ hold` so "zero-98 / one-steal" holds at every N —
-these stay `ok`/`bad` strict, *not* `_envelope`), and **scale `MAX_WAIT` with N** so a
-large-N run doesn't time out and look like a product failure. Mechanism generalizes to
-Axis B/C later (deferred per §9.4).
-
-**kcov coverage job** (nightly.yml, Linux-only): build kcov v43 from source (no
-apt/prebuilt), run the **unit suite at FULL, strict, no-load** (`--include-path=git-
-commit-lock.sh`), upload HTML + cobertura (30-day retention), and gate on a
-**conservative line-coverage floor of 0.80** (below the current 83.1%, above noise;
-the Linux ceiling is ~94% because ~30 lines are platform-gated). **Ratchet the floor up
-toward ~0.90 as Bucket-2 lands the Tier-A tests** — the floor tracks achieved coverage,
-it doesn't lead it.
-
-**Nightly issue auto-triage** (nightly.yml, `if: always()`, `issues: write`): parse the
-preserved logs — `^FAIL:` and/or job `failure` → **correctness** (file/append a
-labelled issue, investigate); no FAIL but `WARN[env-relaxed]` and job `success` →
-**envelope-flake** (tracked, no action); timeout/checkout failure → **infra**.
-Idempotent (search-then-append, one issue per (date, class); no all-green spam).
-**Empty-round guard (learned-once):** every cell's artifact missing / workflow errored
-before any suite ran is an **infra** failure — do NOT read "0 FAIL across 0 logs" as
-green. Upload nightly logs on success too (need the negatives to read the positives).
-
-**Load calibration** (`with-load.sh` graduates from scaffolding): express load as
-oversubscription ratio `R = stressors/nproc` (cap `R_total`), prefer `stress-ng`
-(Windows spinner fallback) and a **probe-gated** Linux cgroup CPU-quota path for the
-calibrated envelope leg (IO throttling experimental — don't rely on it); emit a per-run
-**load-manifest** artifact (`{kind, R, nproc, achieved-slowdown, tool versions, os/arch,
-sha}`) uploaded on success too.
-
-**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):** *(This lists the
-mergeable **content**; it is mechanism-agnostic. The merge **mechanism** — cherry-pick vs
-tidy-rebase+ff-merge vs squash — was **reopened 2026-06-18**; see the guarantees-and-coverage
-plan's Bucket 5. Note that after this Bucket 6 lands, ci-stress's tree already excludes the
-stress wiring, so "what graduates" is mostly a history-curation question, not a tree one.)*
-- **Graduate to `main`:** the calibrated `with-load.sh` (strip the do-not-merge banner;
-  add ratio calibration + load-manifest); `ok_envelope`/`bad_envelope` + the 3
-  reassigned assertions; `GCL_TEST_SWEEP` + Axis-A loop (default-off → per-PR identical
-  to today); the new `nightly.yml`; the `tests.yml` event-conditional-concurrency edit +
-  dispatch inputs. So `b430d73` is **not** wholly do-not-merge — its `with-load.sh`
-  payload graduates; only its *required-job wiring* is dropped.
-- **Revert / drop:** `980856b` (flat per-run-unique group); `b430d73`'s load-wrap +
-  raised-timeouts **on the required job** (they move to nightly.yml).
-
-**§7 GitHub-Actions gotchas the diff MUST honor:**
-- **`paths-ignore` on a *required* check blocks doc-only PRs** (skipped workflow → checks
-  Pending → merge blocked). **Fixed** by the `tests-passed` aggregator above: it is the
-  sole required context and always runs (green when the path-filtered `test`/`lint` jobs
-  skip), so doc-only PRs merge. Branch protection must require **`tests-passed`**, NOT the
-  per-cell matrix contexts (else skipped cells sit Pending).
-- **`max-parallel` is intra-matrix only** — bound Deep/Nightly with workflow-level
-  `concurrency` groups (done), never `max-parallel`.
-- **`schedule` auto-disables after ~60 days of repo inactivity** — note in `nightly.yml`;
-  rely on `workflow_dispatch` to re-trigger. A successor should know an empty nightly
-  history may mean "disabled," not "passing."
-- **Artifact names** unique per `(os, leg, kind)`; keep `include-hidden-files: true`
-  (the lock logs live under the scratch `.git/`). `fail-fast: false` stays (per-OS
-  signal + triage needs every cell's verdict). 256-job cap irrelevant at this scale.
-
----
-
-## Bucket 8 — Harness ergonomics (zero-dep; prototype-validated)
-
-Tests are straight-line `echo "== Test N: … =="` blocks (no registry): **43** in the
-unit suite (the "~36" figure was stale), 25 interop, 2+1 integration. Sequencing is
-**TAP → selector → extract** (each its own commit).
-
-**Item 1 — TAP + `1..N` plan line + the undercount fix (do FIRST, ~20 lines/suite).**
-The bug: under `set -uo pipefail` (no `-e`), an early `exit`/crash terminates the
-suite before the final `echo RESULT` + `[ "$FAIL" = 0 ]`, dropping later assertions
-from the count — and a stray `exit 0` after a recorded FAIL exits **0 with no RESULT
-line** (a *silent green*). Fix, three parts (all prototype-validated):
-- Make `ok`/`bad` TAP-aware, gated by `GCL_TAP=1` (dev runs byte-unchanged): bump a
-  running `TAPN` and emit `ok N - desc` / `not ok N - desc`; keep the `return 0` that
-  the `A && ok || bad` idiom needs.
-- Emit a **trailing `1..$TAPN`** plan line before the verdict — a consumer fails on a
-  short count.
-- A **"reached-the-end" sentinel**: `DONE=0` set to `1` as the last action before the
-  verdict; a `finish` EXIT trap (wrapping the existing per-suite `cleanup`) that, if it
-  fires with `DONE!=1`, prints `Bail out!` and **`exit 1`**. (Key validated detail: a
-  bare trap *return* is ignored — the script keeps its pre-trap code — so the guard
-  needs an explicit `exit 1`; this is what converts the silent early-`exit 0`-after-FAIL
-  into a red.) No hand-maintained expected-count constant — the sentinel catches *any*
-  premature termination with zero upkeep. Apply to all three suites.
-
-**Item 2 — `GCL_TEST_ONLY=<regex>` selector (SECOND; 43 mechanical header rewrites).**
-Wrap each block: `echo "== Test N: … =="` → `if section "Test N: …"; then … fi`, where
-`section` echoes the header and returns success iff `GCL_TEST_ONLY` is unset or its
-regex matches the label. **Care point:** a few blocks do trailing cleanup *after* the
-last assertion before the next header — those lines must move *inside* the `fi`.
-**Integration is EXCLUDED by design:** its Tests 1-3 share one repo + `ALL_IDS`
-accumulator (Test 3 audits 1+2's output), so it is one indivisible scenario — it
-must *note-and-ignore* `GCL_TEST_ONLY` (loud stderr note), never per-block select.
-Unit first; interop the same treatment (lower priority). Anchoring tip for docs:
-`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`. **Zero-match
-guard (review catch):** `section` bumps a `SECTIONS_RUN` counter when it runs a block;
-at the end, if `GCL_TEST_ONLY` is set and `SECTIONS_RUN==0`, fail loudly — a typo'd regex
-must not report a vacuous `PASS=0 FAIL=0` green (same spirit as the undercount sentinel).
-
-**Item 3 — extract `tests/_harness.sh` (LAST; pure dedup, largest diff).** Source one
-shared file from each suite. Tier 1 (all three): the `PASS/FAIL/TAPN/DONE` inits +
-`GCL_TAP`/`GCL_TEST_ONLY` reads, `ok`/`bad`, `section`, the `finish`/sentinel helper,
-and the shared shellcheck disables. Tier 2 (unit+interop only — integration uses none):
-`epoch_to_stamp`, `backdate`, `backdate_ghost`, `sync_waiting_fresh`, `fabricate_lock`,
-`wait_for_grep`, `clone_fn` + its `export -f` line. Tier 3: keep **both** poll helpers
-under their existing names/semantics (`wait_for_file` `$2`=seconds, interop's `wait_for`
-`$2`=50ms-iterations) — do *not* unify signatures this pass (would touch every call site
-on the most fragile timing axis). **Do NOT extract `cleanup`** — it closes over each
-suite's `$WORK` and interop's body genuinely differs; the shared `finish` just calls the
-suite-local `cleanup`. Do it last so the final TAP/selector code is extracted once.
-Verify byte-identical behavior by diffing a FULL run's sorted `PASS:`/`FAIL:` set
-before/after (CI or local).
-
-Prototypes (gitignored, `.agent-testing/bucket8-proto/`) validate TAP emission, the
-trailing plan, selector matching, TAP+selector composition, and the sentinel closing
-the exact silent-green bug.
-
----
-
-## Phasing for Phase 3 (the build)
-
-Order chosen so cheap, enabling work lands first and each step is CI-verifiable:
-
-1. **Bucket 8 items 1-2 first** (TAP + `GCL_TEST_ONLY`) — they make iterating on
-   ~15 new tests far cheaper and give machine-readable CI output to read the new
-   tests' results back from. (Per the harness design's safe-increment order.)
-2. **Bucket 3 doc edits** — independent, low-risk, can land anytime; do early so
-   the docs match the contract.
-3. **Bucket 4 envelope switch** (`GCL_ENVELOPE_TIER`) — needed before the nightly
-   CI tier and before scoping Test 21/22a/29.
-4. **Bucket 2A steering tests** (A1/A2/A4 first, then the rest) — the coverage core.
-5. **Bucket 2B fault-injection tests** (the feasible D-b first cut; flag/defer any
-   non-portable lane).
-6. **Bucket 8 item 3** (`_harness.sh` extraction) — after the new tests exist, so
-   the shared helpers are settled.
-7. **Bucket 6 CI matrix** — wire the three tiers + kcov leg + parametrization last,
-   once the tests and the envelope switch exist for it to orchestrate.
-
-Each step commits incrementally under the commit-lock; verification dispatches
-`tests.yml` on `ci-stress`. **Build vs Workflow:** decide hand-run vs a Claude Code
-Workflow once the final test count is known (plan D-e) — likely a Workflow for the
-~15 steering tests (fan-out write + per-test CI verify).
-
-## Logging / observability design (per engineering practices)
-- **New tests** assert on the product's existing protocol log strings (the coverage
-  proxy the audit used) — every new steering test greps a specific log line, so a
-  silent behavior change is caught.
-- **TAP output** (Bucket 8) makes each assertion's pass/fail individually visible in
-  CI logs, and the `1..N` plan line makes a truncated run fail loudly (closing the
-  silent-undercount gap).
-- **The load-manifest artifact** (Bucket 6) records `{kind, R, nproc,
-  achieved-slowdown, tool versions, runner os/arch, git sha}` per nightly/deep run,
-  uploaded on success too, so any flake is reproducible (the reproducible-experiments
-  requirement).
-- **kcov coverage artifact** (Bucket 6) uploaded per Linux run; the gap list in
-  `steering-coverage.md` is the baseline to diff against.
-- **Nightly auto-triage** tags a failing scheduled run `correctness` (investigate)
-  vs `envelope` (expected under load), so scheduled reds are visible, not silent.
-
-## Open decisions for Ben
-- **D-b tiering (confirm):** build all of Tier A (A1-A11) + the Tier-B first cut
-  (F4, F2/J1) now? The original D-b's "second tier" items are all accounted for —
-  E3 → **A6** (steering, not fault-injection), F2-audit #7 (rename-refused) → **A1**,
-  #8 (Windows blocked-unlink) → **Tier C** (platform-only, verified on the Windows
-  leg); only **F1/F3** are genuinely not portably injectable. (Recommend: yes — Tier A
-  is all portable; defer only F1/F3.)
-- **F1 (ENOSPC) — gated test vs document-only:** F1's behavior is validated but its
-  injection needs Linux root (`mount`). Ship as a Linux-only test gated behind a
-  `sudo -n` capability probe (skip-with-note elsewhere, `sudo umount` in cleanup), or
-  document-only? (Recommend: the **gated test** — GitHub `ubuntu-*` runners have
-  passwordless sudo so it actually runs there and skips cleanly everywhere else; falls
-  back to document-only if you'd rather keep zero root in the suite.) **F3 is
-  document-only either way** (no deterministic injection exists — the create needs ~1 FD).
-- **Build mechanism (D-e):** hand-run Phase 3, or a Claude Code Workflow for the test
-  fan-out? (Recommend: decide once the count is final — ~13 steering + 2-3 fault tests;
-  lean Workflow for the steering batch, hand-run the CI/doc edits.)
-- Anything else needing a call is surfaced inline in the integrated sections.
-
-## Changelog (Phase 3 implementation)
-- **Step 1 (commit `3789be9`) — Bucket 8 item 1 done.** TAP + `1..N` + the
-  `DONE`/`finish` undercount sentinel in all three suites. Unit validated locally
-  (220/220 REDUCED + matching plan line, exit 0, sentinel does not false-fire);
-  interop/integration syntax-checked, full runs via CI.
-- **Deviation — defer Bucket 8 item 2 (the `GCL_TEST_ONLY` selector).** Wrapping 43
-  blocks in `if section …; then … fi` is a large, boundary-sensitive change whose only
-  benefit is per-test iteration speed; for this batch the steering tests are validated
-  by a full-suite run, so it doesn't justify front-loading its risk. Bundled with item 3
-  (`_harness.sh` extraction — also a large harness change) into one validated
-  harness-restructure step near the end. **Revised phasing: 8.1 → 3 → 4 → 2A → 2B →
-  (8.2 + 8.3 together) → 6.**
-- **Step (commit `4ee5899`) — Bucket 8 item 2 done** (`GCL_TEST_ONLY` selector). Each
-  top-level `== Test N: … ==` header in unit + interop became `if section "Test N: …";
-  then … fi` (each `fi` before the next `if section`, so trailing cleanup stays inside);
-  `section` runs a block iff `GCL_TEST_ONLY` is unset/empty or its regex matches, bumping
-  `SECTIONS_RUN`. Zero-match guard bails loudly (exit 1) on a set-but-non-matching regex
-  (no vacuous green). Integration note-and-ignores (one indivisible scenario). Built by 3
-  parallel sub-agents (one per suite), each self-validating byte-identical + selector
-  precision + the guard; orchestrator re-verified independently. Validated reduced: unit
-  315/0, interop 141/0, integration 12/0; selector precision proven (regex, trailing-colon
-  anchoring); `shellcheck -S style` clean.
-- **Step (commit `b8e2951`) — Bucket 8 item 3 done** (`tests/_harness.sh` extraction, 177
-  lines, net −42). Tier 1 (all three): inits + `GCL_TAP`/`GCL_TEST_ONLY` reads + `ok`/`bad`
-  + `section` + the `finish` sentinel + shared shellcheck disables + a unified
-  `selector_report` (so unit/interop match). Tier 2 (unit+interop, byte-identical-verified
-  first): `epoch_to_stamp`, `backdate`, `backdate_ghost`, `sync_waiting_fresh`,
-  `fabricate_lock`, `wait_for_grep`. Left per-suite: `cleanup` (closes over `$WORK`),
-  `clone_fn`+`export -f` (unit-only), `ok_envelope`/`bad_envelope` (unit-only), both poll
-  helpers (`wait_for_file` secs vs `wait_for` 50ms-iters — Tier 3, not unified), verdict
-  lines. CWD-independent sourcing (`BASH_SOURCE`) + `# shellcheck source=` directive;
-  `tests/_harness.sh` added to the CI lint list. Byte-identical (315/141/12), `shellcheck`
-  clean, selector/guard/integration-note all intact; orchestrator re-verified independently.
-- **(8.2 + 8.3 COMPLETE.) Next: Bucket 6 (CI matrix wiring).** Cross-platform CI verification
-  of these two commits pending (dispatch `tests.yml` on `ci-stress`).
diff --git a/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md b/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md
deleted file mode 100644
index e1d9f4e..0000000
--- a/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# Plan: cover F2 — steal rename WON but read-back verification FAILED (coverage gap)
-
-Status: **DONE** — implemented; reviewed clean (see changelog). Test-only addition; product
-untouched.
-
-## Reviewer notes (top; do not renumber)
-_(none yet)_
-
-## Context
-A coverage audit (subagent + my own verification against the code) found that the product's
-two acquire read-back-verification failure lanes are asymmetrically covered:
-- **Create path (outcome I)** — `git-commit-lock.sh:1354-1360`: O_EXCL create wins, the path
-  read-back ≠ our token → `WARNING: acquire verification FAILED — create won but read-back
-  found ...` → re-enter wait. **Covered** by Test 32 (`tests/git-commit-lock.test.sh:1760`),
-  whose `_lock_cur_token` shadow is gated `[ -z "$_LOCK_CLAIM_TOKEN" ]` (fires only at the
-  create read-back).
-- **Steal path (outcome F2)** — `git-commit-lock.sh:1168-1179`: the stealer WON the claim
-  race AND won the rename-over (`STOLE-BY-CLAIM` already logged, ghost destroyed), but the
-  post-rename read-back ≠ our token → `WARNING: acquire verification FAILED — steal rename
-  completed but read-back found ...` → clear `_LOCK_CLAIM_TOKEN`, return 1, re-enter wait.
-  **UNCOVERED.** Verified: no test greps the F2 string; Test 32's gate excludes it (at the
-  steal read-back `_LOCK_CLAIM_TOKEN` is set); on the success-rename path `:1171` is the only
-  `_lock_cur_token` call with the claim token set (`_lock_rename_over` `:961-979` makes none).
-
-F2 is the higher-stakes twin: it fires AFTER `STOLE-BY-CLAIM` (ghost already destroyed), so a
-future regression here (wrongly taking the hold on a mismatched read-back, or failing to clear
-`_LOCK_CLAIM_TOKEN`) would be a silent false-hold / mis-attributed release. The code reads
-correctly today — this is a missing-test (regression exposure), not a present bug.
-
-The suite's closing NOTE (`:2119-2121`) says "lock_acquire's read-back-verification failure
-lane … not suite-covered", but Test 32 already covers the create lane — the note is stale and
-does not distinguish F2.
-
-## Change (test-only)
-1. Add **Test 32b** immediately after Test 32, mirroring Test 32 with the INVERSE token gate
-   so the fault injection lands at the STEAL read-back:
-   - Set up a stale ghost (`fabricate_lock` + `backdate 9999`) so a steal is attempted.
-   - In a sourced subshell, `clone_fn _lock_cur_token _ct_orig`; shadow it to fire ONCE
-     (flag FILE `$SF1`, subshell-safe) when `[ ! -e "$SF1" ] && [ "${_LOCK_HELD:-0}" = 0 ]
-     && [ -n "$_LOCK_CLAIM_TOKEN" ]` — i.e. at the steal read-back (`:1171`), where the claim
-     token is set and the hold is not yet taken. On firing: `backdate "$AGENT_LOCK_PATH"
-     9999` (so the just-installed abandoned lock is immediately re-stealable — same trick as
-     Test 32, keeps it fast/deterministic), `printf ""` (blank read-back → F2), `return 0`.
-   - `lock_acquire || exit 72; lock_release || exit 74; exit 0`.
-   - Flow: attempt 1 — claim won, rename won (`STOLE-BY-CLAIM`), read-back blanked → F2
-     WARNING → re-enter wait; the abandoned lock is stale → attempt 2 steals it, read-back now
-     real (SF1 set) → HOLD → `ACQUIRED` → release rc 0.
-   - Assertions: rc 0; the **F2-specific** string `steal rename completed but read-back`
-     fired (else `bad "F2 lane never ran"` — guards vacuity / proves the steering reached
-     `:1171`); the WARNING precedes the final `ACQUIRED` (no false-hold on attempt 1);
-     `STOLE-BY-CLAIM` count ≥ 2 (re-stole after the failed read-back); no leftover lock/claim
-     after release.
-2. Update the stale NOTE (`:2119-2121`): both read-back lanes are now suite-covered — create
-   by Test 32, steal by Test 32b — via `_lock_cur_token` fault injection.
-
-## Why deterministic / load-robust
-Internal steering (no scheduling race); the backdate-9999 trick removes any aging wait so the
-re-steal is immediate; `MAX_WAIT=30`, `POLL=0.1` give ample headroom under CI load. Same shape
-as the already-load-robust Test 32.
-
-## Logging
-No product logging change. The new test asserts on existing product log lines (the F2 WARNING,
-`STOLE-BY-CLAIM`, `ACQUIRED`).
-
-## Out of scope / NOT changed
-- Product code (`git-commit-lock.sh`, `.ps1`) — no defect; F2 reads correct.
-- Lower-priority gaps from the audit (A2/G2 wrong-type appearing at the lock path mid-steal;
-  platform-only feeder #3) — left for a separate decision.
-
-## Testing
-1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate).
-2. Local: run the new test (and the full suite); it MUST exercise the F2 string (the
-   `bad "F2 lane never ran"` guard fails loudly if the steering misses `:1171`).
-3. Real proof: CI under load (the hunt) stays green with the new test.
-
-## Changelog (implementation)
-- Added Test 32b to `tests/git-commit-lock.test.sh` (after Test 32) and updated the closing
-  NOTE so both read-back lanes read as covered (create by Test 32, steal/F2 by Test 32b).
-  Product untouched.
-- Verified the steering empirically: a standalone extract of Test 32b (suite header + the
-  Test 32b block, `LIB` pinned absolute) passed 6/6 with the F2-specific line
-  `the steal-path read-back-verification failure lane ran (F2)` firing — proving the fault
-  lands at `git-commit-lock.sh:1171` (`_LOCK_CLAIM_TOKEN` set there; `_lock_rename_over`
-  makes no read; the create read-back at :1353 has it empty).
-- Static: `bash -n` + `shellcheck -S style` (v0.11.0) clean.
-- Local: full unit suite **220 passed / 0 failed** (count varies run-to-run via the fan-out
-  tests; 0 failed is the invariant). Test 32b: rc 0, F2 string fired, STOLE-BY-CLAIM x2,
-  WARNING-before-ACQUIRED, no leftovers.
-- Impl review (2 independent, both clean): fresh Claude reviewer ("VERDICT: CORRECT … No
-  defects") — independently ran the suite twice (220/0), grepped every `_LOCK_CLAIM_TOKEN`
-  set/clear and `_lock_cur_token` call site, confirmed gate precision (all `_lock_discover`
-  branches clear the claim token first, so the `-n` gate excludes :820; release excluded via
-  `_lock_take_hold`), determinism, non-vacuity, termination. Codex `exec` read-only ("No
-  findings … correct and non-vacuous"), confirming the same with file:line cites. Two minor
-  non-blocking notes (the SF1 flag file lives in the throwaway WORK dir; `_ct_orig "$@"` is
-  harmless) — no action.
-- Real proof: CI under load (the hunt) with Test 32b in the tree.
diff --git a/.plans/2026-06-17-ci-stress-test31a-flake-plan.md b/.plans/2026-06-17-ci-stress-test31a-flake-plan.md
deleted file mode 100644
index be8d801..0000000
--- a/.plans/2026-06-17-ci-stress-test31a-flake-plan.md
+++ /dev/null
@@ -1,135 +0,0 @@
-# Plan: de-flake unit Test 31(a) (leaked-claim discovery-route race) under load
-
-Status: **DONE** — diagnosis converged across 4 independent reviews (my code-read +
-leak.log + a fresh-context Claude subagent that did NOT read the prior diagnosis + Codex
-foreign-model review); fix implemented; implementation reviewed clean (see changelog).
-Test-only change; product untouched. Awaiting CI-under-load confirmation.
-
-## Reviewer notes (top; do not renumber)
-_(none yet)_
-
-## Context
-CI stress under both/load=2 (moderate, 4 hogs on a 4-core ubuntu runner — NOT the
-8-hog oversubscription regime) failed ONE assertion in unit **Test 31 sub-leg (a)**
-(`tests/git-commit-lock.test.sh:1582`), run 27626826865:
-```
-FAIL: no leaked-token-memory DISCOVERY-HOLD
-```
-Every other (a) assertion passed (recheck-unreadable feeder fired; rc 0; lock released
-cleanly; no claim/lock leftover); sub-legs (b)(c)(d) passed.
-
-### Mechanism (test-orchestration race; product correct)
-The product has TWO valid, equally-correct ways to adopt a leaked claim that a rival has
-installed at the lock path, and both log a `DISCOVERY-HOLD` line:
-- **D1 — inline ownership-discovery read.** `_lock_discover` (`git-commit-lock.sh:819`,
-  log at `:822` `DISCOVERY-HOLD: our claim ... installed ... by a rival's rename`) is the
-  unconditional final act of every post-claim non-rename exit. In (a) the steered
-  recheck-unreadable exit runs `_lock_leaked_add` (`:1112`, the `LEAKED-CLAIM` log) and
-  then **immediately, one statement later**, `_lock_discover "$tok"` (`:1114`).
-- **D2 — per-poll leaked-token-memory check.** `git-commit-lock.sh:1382`
-  (`DISCOVERY-HOLD (leaked-token memory): ...`) fires on a LATER blocked poll while the
-  memory list is non-empty.
-
-Sub-leg (a)'s harness is open-loop: it `wait_for_grep`s the `LEAKED-CLAIM` line
-(`:1574`) then does `mv -f -- "$LOCK.next" "$LOCK"` (`:1576`, the rival install). That
-`mv` races the leaver's inline `_lock_discover` at `:1114`:
-- mv lands **before** the inline discover → **D1** wins (the `:822` line). ← failing run
-- mv lands **after** the inline discover (it misses; later poll) → **D2** wins (`:1382`).
-
-The assertion at `:1582` hard-pins **D2** (`grep -q "DISCOVERY-HOLD (leaked-token
-memory)"`). Under load the leaver was descheduled between `:1112` and `:1114`, the
-harness `mv` landed first, D1 fired, D2 never logged → the assertion failed. The product
-behaved correctly in BOTH cases (token remembered, same token observed installed,
-adopted, rc 0, clean release, no residue). Classification: **test flake, product
-correct** — the assertion over-specified an implementation-incidental, scheduler-chosen
-route rather than the contract (a leaked claim installed by a rival is adopted and
-cleaned up).
-
-### Coverage (why relaxing (a) loses nothing)
-- **D2 (memory route)** is covered DETERMINISTICALLY by **sub-leg (b)** (`:1592-1627`):
-  it drives the rival install from inside `_lock_new_token` at NTC=2 so the leaver runs a
-  full aborting claim attempt and adopts only on the per-poll memory check; it asserts
-  `DISCOVERY-HOLD (leaked-token memory)` and the `leak < abort < adoption` ordering.
-- **D1 (direct route)** is covered DETERMINISTICALLY by **Test 25** (`:1323-1425`), the
-  discovery-position matrix: 7 internally-steered positions, each asserting the generic
-  `grep -q "DISCOVERY-HOLD"` + rc 0 + no orphan. (Test 25 already uses the generic grep
-  idiom this fix adopts for (a).)
-
-So (a)'s distinct, irreplaceable job is the END-TO-END "external rival installs a
-recheck-unreadable leaked claim → adopted & cleaned up" scenario, where either route is a
-correct outcome.
-
-## Fix (Option A — accept either discovery route; recommended by all four reviews)
-Test-only, in `tests/git-commit-lock.test.sh` sub-leg (a):
-1. Replace the single D2-pinning assertion (`:1582-1583`) with a three-way check that
-   accepts EITHER route, records WHICH fired (telemetry for the load hunt), and only
-   fails if NEITHER `DISCOVERY-HOLD` route adopted the claim:
-   ```sh
-   if grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG"; then
-     ok "... per-poll memory route ..."
-   elif grep -q "DISCOVERY-HOLD:" "$LOG"; then
-     ok "... inline direct-discovery route ... (memory route pinned by sub-leg (b)) ..."
-   else
-     bad "no DISCOVERY-HOLD adoption of the leaked claim by EITHER route"
-   fi
-   ```
-   `"DISCOVERY-HOLD:"` (immediate colon) matches ONLY D1; D2's text is
-   `DISCOVERY-HOLD (leaked-token memory):` (space+paren after the dash), so the two
-   patterns are disjoint and D2 is checked first regardless.
-2. Update sub-leg (a)'s header comment (`:1550-1552`) to state honestly that adoption may
-   go through either route, that the choice is a load-sensitive scheduling race, and that
-   the memory route is pinned deterministically by (b) and the direct route by Test 25.
-
-### Why A (not B/C)
-- **A** matches (a)'s real intent; not vacuous — still requires the recheck-unreadable
-  feeder (`:1574`), rc 0 (`:1581`), clean release + no leftover (`:1584-1585`), AND a
-  `DISCOVERY-HOLD` adoption (the log line only appears when `_lock_take_hold` runs via a
-  discovery path). No new timing introduced. Keeps (a) as the load-tolerant main leg.
-- **B** (force the memory route via internal steering) duplicates (b).
-- **C** (force the direct route) duplicates Test 25; also `_lock_discover` direct
-  coverage is already comprehensive there. (NB: the subagent's specific C steering — do
-  the mv inside the fire-once read shadow before returning empty — would actually
-  mis-classify the claim as `gone` not `unreadable`, killing the leak feeder; another
-  reason to avoid C. Verified against `_lock_claim_state`, `git-commit-lock.sh:840-850`.)
-
-## Out of scope / NOT changed
-- Product code (`git-commit-lock.sh`, `.ps1`) — no defect.
-- Sub-legs (b)(c)(d), Test 25, any other test.
-
-## Logging
-No product logging change. The new three-way `ok` line records which discovery route
-adopted the claim each run — a small telemetry win making the previously-hidden route
-choice visible in every (a) run's output (helps confirm load is exercising both routes).
-
-## Testing
-1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) on the test file.
-2. Local: run the unit suite on this box; Test 31 (all sub-legs) must pass; confirm the
-   new `ok` line reports a route. Run Test 31 in a loop to confirm no regression.
-3. Real proof: CI under both/load=2 where (a) previously failed — the unit leg must stay
-   green and report a route each run.
-
-## Changelog (implementation)
-- Implemented Fix A in `tests/git-commit-lock.test.sh` sub-leg (a): the single
-  D2-pinning assertion became a three-way `if/elif/else` (memory route → ok; direct route
-  via `grep "DISCOVERY-HOLD:"` → ok; neither → bad). Rewrote (a)'s header comment to
-  document both routes, the load-sensitive race, and the deterministic coverage of each
-  (sub-leg (b) for memory, Test 25 for direct). No product code, no other test touched.
-- Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) clean.
-- Local (Windows MSYS bash, pwsh 7.5.5): full unit suite **207 passed / 0 failed**
-  (fan-out auto-REDUCED under the box load). Sub-leg (a) passed via the memory route on
-  this UNLOADED box (`adoption went through the leaked-token memory (per-poll route ...)`),
-  confirming the normal path still fires and the new assertion accepts it; (b)(c)(d) green.
-- Diagnosis review (4 independent, all converged: test flake / product correct / Fix A):
-  my code-read + the verbatim leak.log, a fresh-context Claude subagent that did NOT read
-  the prior diagnosis, and a Codex foreign-model review. Codex additionally noted D1 is
-  already covered by Test 25's discovery-position matrix → option C (a new D1 sub-leg) is
-  redundant. (I verified Test 25 covers all 7 positions deterministically myself.)
-- Implementation review (2 independent, both clean / no findings): a fresh Claude reviewer
-  ("the change is correct ... no defect found") and Codex `exec` read-only ("None. The fix
-  is correct."). Both verified: grep patterns disjoint (BRE parens literal; `DISCOVERY-HOLD:`
-  needs an immediate colon, absent from the memory line), non-vacuity (a `DISCOVERY-HOLD`
-  line is logged one statement before the pure-assignment `_lock_take_hold`, so it reliably
-  implies a taken hold; backstopped by rc 0 + no-leftover + the feeder assertion), no new
-  race (greps run only after `wait "$w31"`), `$LOG` leg-dedicated (no cross-talk), and the
-  comment's sh:822/1382/1112/1114 line refs accurate.
-- Real proof pending: CI under both/load=2 where (a) previously failed (run 27626826865).
diff --git a/.plans/2026-06-18-ci-stress-canary-split-plan.md b/.plans/2026-06-18-ci-stress-canary-split-plan.md
deleted file mode 100644
index c90254b..0000000
--- a/.plans/2026-06-18-ci-stress-canary-split-plan.md
+++ /dev/null
@@ -1,158 +0,0 @@
-# Plan: extract the concurrency canary (Test 1) into its own suite file
-
-Status: **PROPOSAL (Phase 2) — for Ben's review.** Supersedes the sharding approach (the
-`GCL_TEST_SHARD` mechanism + the fixed-split balance plan), which has been **unwound** via
-explicit `git revert` (`89de803` + `143e280`; verified byte-identical to the pre-shard tree).
-No implementation until Ben's go.
-
-## Why
-The Windows-unit CI leg is the wall-clock bottleneck (~360s, ~2× the others) and **one test
-drives ~half of it**: Test 1, the full-width concurrency **canary** (25 workers × 8 rounds
-racing the lock), measures **~151s on the Windows runner** (the other 56 unit tests sum to
-~158s). It is *cheap* on Linux/macOS (fast process spawn) — pathological only on Windows.
-
-Rather than shard one file across runners (assignment machinery, a maintained split, a guard),
-**move the canary into its own file** so it runs as a naturally-parallel CI job. Same wall-clock
-win (~360s → macOS-gated ~210s or better) with **zero sharding machinery**. Test 1 is genuinely
-a *different kind* of test — a statistical concurrency canary ("repetition at width is its
-coverage") vs the targeted unit/steering tests — so the seam is natural, not arbitrary.
-
-## The extraction (mechanically clean — feasibility confirmed by exploration)
-**New file `tests/git-commit-lock.canary.test.sh`** — sources `tests/_harness.sh` like the other
-suites; copies the minimal preamble the canary needs and the Test 1 block **verbatim**:
-- Preamble to copy from `tests/git-commit-lock.test.sh`: the `set -uo pipefail` + shellcheck
-  disables; the `_HARNESS_DIR`/source idiom; `DIR`/`ROOT`/`LIB`; the `GCL_TEST_FULL` →
-  `GCL_MODE`/`T1_ROUNDS`/`T1_N` width block (only the `T1_*` knobs are needed); `WORK` +
-  `cleanup()` + `trap finish EXIT`; the `INCR` critical-section string (**used by Test 1 only**).
-- The **Test 1 `if section "Test 1: …"; then … fi` block moves verbatim** (it namespaces all its
-  files under `$WORK`; zero cross-test coupling — nothing else reads/produces its state).
-- Tail: `selector_report` + `DONE=1` + the `RESULT`/`1..$TAPN` lines + `[ "$FAIL" = 0 ]` (copy
-  from the unit suite's end). (`GCL_TEST_ONLY` is near-pointless in a one-test file but the call
-  is zero-cost and keeps the `finish`/zero-match scaffolding uniform.)
-  - **`ENV_WARN` (review catch):** the unit suite's `RESULT` line expands `$ENV_WARN`, which is
-    defined in the envelope section we are NOT copying — so under `set -u` the canary's RESULT
-    line would crash. Fix: define `ENV_WARN=0` near the canary's inits (the canary uses plain
-    `ok`/`bad`, no envelope), so the standard RESULT line works unchanged.
-- **Do NOT copy** the unit-file-local helpers the canary doesn't use: `clone_fn`+`export -f`,
-  `wait_for_file`, the `ok_envelope`/`bad_envelope` envelope tier, `T_AXIS_A`/sweep. (Verified
-  unused by Test 1.)
-
-**`tests/git-commit-lock.test.sh`:** delete the Test 1 block (lines of the `if section "Test 1:
-…"; then … fi`). The suite's count self-adjusts — `TAPN` is a running counter, so the `1..N`
-plan line and `RESULT` drop by Test 1's assertions automatically (no hardcoded total to edit);
-`DONE`/`finish`/`selector_report` are count-agnostic. `INCR` moves out with Test 1 (confirmed no
-other unit test uses it).
-
-## CI wiring (`.github/workflows/tests.yml`) — canary as its own cell on ALL arches
-Per Ben: run the canary in parallel on every arch (uniform; the extra POSIX job is cheap). Four
-suite files now; the `canary` leg is a separate cell on ubuntu, macOS, and Windows.
-
-Proposed `matrix.include` (7 test cells + `lint`):
-```yaml
-- { os: ubuntu-24.04,  leg: all,                  job_timeout: 35 }   # unit+interop+integration (NOT canary)
-- { os: ubuntu-24.04,  leg: canary,               job_timeout: 15 }
-- { os: macos-15,      leg: all,                  job_timeout: 35 }
-- { os: macos-15,      leg: canary,               job_timeout: 15 }
-- { os: windows-2025,  leg: unit,                 job_timeout: 20 }   # unit minus canary
-- { os: windows-2025,  leg: interop-integration,  job_timeout: 22 }
-- { os: windows-2025,  leg: canary,               job_timeout: 15 }
-```
-Step gating (so the canary runs in exactly one cell per arch, never doubled):
-- **New "Canary suite" step:** `if: ${{ matrix.leg == 'canary' }}` → `bash tests/git-commit-lock.canary.test.sh` (own `GCL_TEST_PRESERVE_DIR=…/failed-work/canary`; step `timeout-minutes` ~7 Windows / ~6 POSIX, sized from ~151s Windows + headroom).
-- **Unit step:** `if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }}` (unchanged form) → unit suite (now minus canary). So `leg: all` runs unit+interop+integration but **not** canary (its step only fires on `leg: canary`).
-- **Interop / Integration steps:** unchanged (`!cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration')`).
-- Job-name template + artifact name already key on `matrix.leg` → the `canary` leg gets a unique name/artifact for free (no shard suffix needed).
-
-Other CI bookkeeping:
-- Add `tests/git-commit-lock.canary.test.sh` to the **shellcheck file list** in the `lint` job.
-- Update the "Sourced by all three suites" comment in `_harness.sh` (and any "three suites" prose) → **four**.
-
-## Other workflow callers (review catch — the canary is now a 4th suite file)
-The canary currently runs **only** via `tests/git-commit-lock.test.sh`, which three other CI
-spots invoke. After extraction each must also run `tests/git-commit-lock.canary.test.sh`, or it
-silently loses the canary:
-- **`nightly.yml` stress cells** (run the unit suite under load): add the canary so it's still
-  stress-tested under oversubscription (concurrency + load is the highest-value canary scenario).
-  Run it in the relevant cells (sequentially after the unit suite is fine — nightly isn't
-  dev-blocking; no separate parallel cell needed there).
-- **`nightly.yml` kcov job** (measures `git-commit-lock.sh` line coverage from the unit suite,
-  gated at the **0.80** floor with only ~3pp headroom): **run the unit suite AND the canary file
-  under kcov (merged output)** so the canary's coverage contribution is preserved — otherwise
-  the floor could regress. (kcov merges multiple runs into one `--include-path` output dir.)
-- **`deep-sweep.yml`** (on-demand deep flake hunt under load+repeat): add the canary file — the
-  concurrency canary is exactly what a deep hunt should exercise.
-Principle: treat the canary like any new suite file — every workflow/job that enumerates the
-suites (and the shellcheck lint list) must include it. (`tests.yml` is the only one that gets the
-*parallel-cell* treatment, for the per-PR wall-clock win; the others just add the file to what
-they already run.)
-
-## Coverage-safety
-- **No test is lost or doubled:** Test 1 runs in exactly the `canary` cell on each arch; the
-  other 56 run in the `all`/`unit` cells. Union across cells == the original 57 on every arch.
-  (The canary step gates only on `leg == 'canary'`; the unit step never runs canary.)
-- **Verification (local proof, Phase-2 of impl):** (a) the new canary file runs standalone green
-  (Test 1's same assertions); (b) the unit suite runs green minus Test 1 (count = old 315 − Test
-  1's assertions); (c) canary-count + unit-count == the old 315 (no assertion lost); (d) interop
-  141/0, integration 12/0 unchanged; (e) `shellcheck -S style` clean (incl. the new file);
-  `actionlint` clean.
-- **Cross-platform CI** is the authoritative gate: all 7 cells green; the canary runs on each arch.
-
-## Predicted timings
-- Windows: `unit` (minus canary) ~158s ‖ `canary` ~151s ‖ `interop-integration` ~140s → Windows
-  wall-clock ~max ≈ **~174s** (incl. overhead), down from ~360s.
-- ubuntu/macOS: `all` (minus the now-tiny canary) ≈ unchanged-to-slightly-lower (~180/~190s) ‖
-  `canary` cheap (~tens of s).
-- **Overall CI gated by the slowest cell ≈ macOS `all` (~190–210s)** — the same win as sharding,
-  with no sharding machinery. (Exact numbers confirmed by the post-implementation CI run.)
-
-## Phasing (implementation — on Ben's go)
-1. Create `tests/git-commit-lock.canary.test.sh` (preamble + Test 1 verbatim + tail); delete the
-   Test 1 block from `tests/git-commit-lock.test.sh`; add the canary file to the shellcheck list;
-   fix the "three suites" → "four" comment.
-2. **Local proof** (the coverage-safety checks above) — canary standalone green, unit-minus-canary
-   green, counts reconcile to the old 315, lint clean.
-3. Rewire **all** workflows to include the canary file: `tests.yml` (7-cell matrix + the canary
-   step — the parallel-cell win); `nightly.yml` (add the canary to the stress cells + make the
-   kcov job run unit **and** canary under kcov, merged); `deep-sweep.yml` (add the canary to its
-   cells). `actionlint` clean on all three.
-4. Push + **CI verify** cross-platform: all 7 `tests.yml` cells green; the ~174s Windows /
-   macOS-gated overall. (nightly/deep-sweep can't dispatch until on `main`, but their canary
-   wiring is statically validated; the kcov merged-coverage stays ≥ the 0.80 floor since the
-   same tests run, just split across two files.)
-5. Commit incrementally under the lock; ships on `ci-stress`, lands via the merge PR.
-
-## Logging / observability
-- The canary file keeps the standard `RESULT`/`1..$TAPN`/`finish`-sentinel output, so its CI job
-  log is self-describing. Per-test timing (if ever re-measured) uses the CI job-log timestamps,
-  as before.
-
-## Supersedes
-- `.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md` (the `GCL_TEST_SHARD` mechanism) and
-  `.plans/2026-06-18-ci-stress-shard-balance-plan.md` (the fixed Test-1-vs-rest split) — both
-  obsoleted by this file-extraction approach; the sharding was unwound (`89de803`+`143e280`).
-  (Leave those plan files in place per "leave history be"; add a superseded-by pointer at their top.)
-
-## Results (CI verified — run 27728088150, all 8 jobs green)
-Implemented in `5fe15c9` (canary file + unit removal + harness) + `b1eb0a8` (CI wiring). Local
-proof passed (canary 2/0 standalone; unit-minus-canary 313/0; 313+2=315 disjoint/union==original;
-interop 141/0, integration 12/0; shellcheck + actionlint clean). Cross-platform CI **succeeded**:
-
-| | windows | macOS | ubuntu | overall (slowest) |
-|---|---|---|---|---|
-| **pre-shard** (`27716080146`) | unit **360s** | 194s | 182s | **360s** |
-| **sharding** (`27723744798`) | unit 242s ‖ 99s (imbalanced) | 210s | 181s | 242s |
-| **canary split** (`27728088150`) | unit **181s** ‖ canary 165s ‖ interop 130s | all 167s ‖ canary 33s | all 165s ‖ canary 18s | **~181s** |
-
-- **Overall CI 360s → ~181s (~50% faster)** — gated by the windows unit-minus-canary cell (181s),
-  with the windows canary (165s) well-balanced beside it.
-- macOS dropped 194→167s (the canary moved out of its `leg: all` into a cheap 33s cell); the
-  POSIX canary cells are cheap (ubuntu 18s, macOS 33s) and off the critical path.
-- **Beats the sharding (242s, imbalanced) AND is far simpler** — zero `GCL_TEST_SHARD` machinery;
-  the canary is just its own suite file. The sharding was unwound (`89de803`+`143e280`).
-- The kcov merged-coverage run and the nightly/deep-sweep canary steps are statically validated
-  (actionlint-clean); their first live exercise is post-merge (those workflows dispatch only from
-  the default branch).
-
-## Out of scope
-- Reducing the canary's own ~151s width (a test-design change — the width *is* its coverage;
-  worth a separate look, not here). Sharding/`GCL_TEST_SHARD` (removed). `n>2` (N/A — files, not shards).
diff --git a/.plans/2026-06-18-ci-stress-shard-balance-plan.md b/.plans/2026-06-18-ci-stress-shard-balance-plan.md
deleted file mode 100644
index 9188417..0000000
--- a/.plans/2026-06-18-ci-stress-shard-balance-plan.md
+++ /dev/null
@@ -1,128 +0,0 @@
-# Subplan: balance the Windows-unit shards with a fixed (measured) split
-
-**SUPERSEDED 2026-06-18** by `.plans/2026-06-18-ci-stress-canary-split-plan.md`. The "Test 1 vs
-rest" insight here was right, but the cleaner realization is to make Test 1 its own *file* (no
-sharding at all) — so the `GCL_TEST_SHARD` machinery was unwound (`89de803` + `143e280`) and the
-canary is extracted instead. Original status retained below for record.
-
-Status: **ENDORSED by Ben (2026-06-18) — split = "Test 1" vs "not Test 1"; implementing.**
-The change is a tiny assignment swap on the already-3-round-reviewed shard mechanism, so the
-local proof + CI run are the gates (no separate review rounds). Follow-on to
-`2026-06-18-ci-stress-windows-unit-shard-plan.md` (the shard *mechanism*, shipped in `a01a8e3`
-+ `2de66ff`). That used naive round-robin-by-index and balanced poorly in practice (242s vs
-99s). This plan replaces the *assignment* with a **fixed, measured split** — still a static
-deterministic assignment (no live cost-table maintenance, per Ben), but chosen to balance.
-No implementation until review converges + Ben's go.
-
-## Review issues (record at top; do not renumber on resolution)
-*(reviewers: add numbered findings here)*
-
----
-
-## The finding that drives the design (measured, not estimated)
-Per-test **full-mode Windows** durations, parsed from the green CI run `27723744798`'s job-log
-timestamps (each `== Test N ==` header line is timestamped; the delta to the next header is that
-test's duration; combined across both shard logs). Method is reproducible from the run log via
-`gh run view <id> --log`; raw table in `.agent-testing/shard-timing/` (gitignored).
-
-- **Test 1 (the 8×25 FULL-width concurrency canary) = ~151s — about HALF of the entire ~309s
-  suite.** It is one indivisible test.
-- The other 56 tests sum to ~158s; the next-largest are Test 22 (~20s), Test 2b (~12s), Test 17
-  (~9s), Test 33 (~8s), then a long tail ≤7s.
-- So the round-robin imbalance (shard1 odd = 226s vs shard2 even = 83s of test time) was **not**
-  "heavies scattered on odd indices" — it was **one dominant test (the canary, index 1 → shard
-  1)** plus the rest happening to land light on shard 2.
-
-**Consequences:**
-- A balanced n=2 split is nearly trivial: **canary alone on one shard (~151s), the other 56
-  tests on the other (~158s).** ~151 vs ~158 — well balanced.
-- Windows-unit leg wall-clock → ~**167s** (151 + ~16s job overhead). That is **below macOS's
-  210s**, so macOS becomes the overall CI floor: **overall 242s → ~210s** (the ~32s the previous
-  plan predicted, now confirmed and explained).
-- **More shards don't help:** Test 1's 151s is an irreducible per-shard floor; n=3 still yields a
-  ~151s shard. So **n stays 2**.
-
-## Approach: a fixed, measured assignment (NOT round-robin, NOT a live cost table)
-Replace the round-robin gate with a **static per-test→shard assignment**, derived **once** from
-the measured costs by greedy LPT (longest-processing-time: sort tests desc, put each on the
-currently-lighter shard) and **frozen** into a small hard-coded list in `tests/_harness.sh`.
-
-For the current data the greedy result is essentially **shard 1 = {Test 1}; shard 2 = {all
-others}** (151 vs 158). Because shard 1's membership is tiny, encode it as "shard-1 label
-prefixes; everything else → shard 2":
-
-```sh
-# n=2 fixed split (measured 2026-06-18; re-tune if the per-shard wall-clock drifts — see below).
-# Test 1 (the FULL-width canary) is ~half the suite, so it gets its own shard.
-_shard_of() {   # echoes the shard (1..n) that owns the test label "$1"
-  case "$1" in
-    "Test 1:"*) echo 1 ;;
-    *)          echo 2 ;;
-  esac
-}
-```
-
-`section()` (still gated on `GCL_TEST_SHARD=i/n`, lazy-parsed, mutually exclusive with
-`GCL_TEST_ONLY` — all unchanged from the shipped mechanism) runs a block iff
-`[ "$(_shard_of "$1")" = "$SHARD_I" ]` instead of the round-robin residue test. The CI interface
-(`tests.yml` matrix passing `1/2` and `2/2`) is unchanged.
-
-### Why this is a "fixed split," not the rejected "cost-aware split"
-- It is a **static, hand-frozen assignment** set from **one** measurement — no per-run cost
-  computation, no maintained cost table, no dynamic bin-packing in the harness.
-- New/unknown tests fall to the **default shard (2)** — they always run (never dropped), and a
-  new *light* test just nudges shard 2 (which has ~7s of headroom and is the lighter side
-  anyway). Only a new *heavy* test (or the canary changing) would need a re-tune, which the
-  drift log surfaces (below). That is occasional manual re-tuning, not continuous cost tracking.
-
-## Coverage-safety
-- **Partition by construction:** `_shard_of` is a total function returning exactly one shard per
-  label, so every test belongs to exactly one shard — union == full suite, no overlap, for any
-  membership list. (Same guarantee the round-robin had, via a different total function.)
-- **Empty-shard guard** (keep): in shard mode, `selector_report` bails if `SECTIONS_RUN < 1`
-  (a misconfigured shard with no members). The exact-count guard is dropped as near-tautological
-  (it recomputes `_shard_of`, the same function the gate uses — established in the mechanism
-  plan's round-2 review).
-- **One-time union proof** (the real partition check): run `GCL_TEST_SHARD=1/2` + `=2/2`, assert
-  their run-line sets (`^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)`) union to the unsharded
-  set with **no duplicates** — catches any assignment bug (a label in both/neither shard).
-
-## Maintenance / drift (the low-maintenance story)
-- Each sharded run already logs `GCL_TEST_SHARD=i/n: ran R of T sections` and the CI job
-  duration is visible. If the two shards' wall-clock skews materially (say >25%), re-measure
-  (parse a fresh run log the same way) and adjust the `_shard_of` list. Expected cadence:
-  rarely — only when the canary's cost changes or a new ≥~30s test lands.
-- The measurement method is recorded above so a successor can regenerate the cost table.
-
-## Phasing (implementation)
-1. **`tests/_harness.sh`:** replace the round-robin residue gate in `section()` with
-   `_shard_of`; add the static `_shard_of` (current measured assignment). Drop the now-unused
-   round-robin residue arithmetic + the exact-count guard branch (keep the empty-shard guard).
-   `SECTION_IDX` is no longer needed for assignment — keep it only if still used elsewhere
-   (it isn't, post-change), else remove it and the `RAN:` marker stays shard-gated.
-2. **Local proof:** (a) unsharded byte-identical (315/0, 141/0); (b) `1/2` runs only Test 1
-   (1 section, ~the canary), `2/2` runs the other 56; union == unsharded, no dup; (c) empty/
-   malformed/mutual-exclusion bails unchanged; (d) `shellcheck -S style` clean.
-3. **CI verify:** dispatch `tests.yml`; confirm shard 1 ≈ shard 2 (~167s / ~174s incl. overhead),
-   overall CI ≈ 210s (macOS-gated), both green, full legs unchanged.
-4. Commit incrementally under the lock; ships on `ci-stress`, lands via the merge PR.
-
-`tests.yml` needs **no change** (the matrix already passes `1/2`/`2/2`); the assignment swap is
-entirely in the harness.
-
-## Logging / observability
-- Keep the per-shard verdict line (`ran R of T sections`) + the shard-gated `RAN:` marker.
-- The CI job-log timestamp method (above) is the standing way to re-measure per-test cost — no
-  permanent timing instrumentation needed (kept out to avoid output churn).
-
-## Related observation (out of scope here — flagging for a separate decision)
-The canary (Test 1) being **~50% of the whole suite** is the real cost driver; sharding only
-works *around* it. If its FULL width (8×25) could be reduced without losing meaningful
-concurrency coverage, that would lower the ~151s floor and help more than sharding — but that's
-a **test-design change** (the width *is* its coverage), so it's deliberately out of scope for
-this balance plan. Worth raising separately.
-
-## Out of scope
-- `n > 2` (Test 1's 151s floor makes more shards pointless), cost-aware/dynamic bin-packing
-  (rejected — this is the fixed alternative), sharding other legs/suites/kcov, or changing the
-  canary itself (above).
diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
deleted file mode 100644
index 3e5be59..0000000
--- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md
+++ /dev/null
@@ -1,306 +0,0 @@
-# Subplan: split the Windows unit CI leg into parallel shards
-
-**SUPERSEDED 2026-06-18** by `.plans/2026-06-18-ci-stress-canary-split-plan.md`. The sharding
-was unwound via explicit revert (`89de803` + `143e280`); we extract the canary (Test 1) to its
-own file instead — same CI win, zero sharding machinery. Original status retained below for record.
-
-Status: **CONVERGED (Phase 2) — 3 review rounds (Claude ×3 + Codex ×3); final Codex clean,
-"sound-to-implement". Ready for Ben's go on implementation.** A small
-follow-on to the Bucket-6 CI work, building on the `section()`/selector machinery (commit
-`4ee5899`) and the shared `tests/_harness.sh` (`b8e2951`). No implementation until the review
-converges and Ben gives the go.
-
-## Review issues (record at top; do not renumber on resolution)
-
-**Round 1 (2026-06-18)** — 2 fresh Claude reviewers (correctness/coverage; CI/simplicity) +
-independent Codex. Dispositions (all FIXED in the body below; a confirm round still follows):
-
-1. **[blocking — FIXED] Malformed `GCL_TEST_SHARD` not rejected → mid-suite crash.** The old
-   combined `case "$SHARD_I$SHARD_N"` digit check passed `1/`/`/2`/`/` (empty component), then
-   `[ "" -lt 1 ]`/`% ""` errored falsy under `set -uo pipefail` (no `set -e`) instead of
-   bailing. Codex also flagged **leading zeros** (`08/10`) as a bash-arithmetic **octal** trap.
-   **Fix:** validate with a single regex `^([1-9][0-9]*)/([1-9][0-9]*)$` (rejects empty,
-   non-digit, leading-zero, extra slashes in one shot), then the `i ≤ n` range check.
-2. **[blocking — FIXED] Guard vs `GCL_TEST_ONLY` composition.** The plan advertised AND
-   semantics, but the exact-count guard ignored the selector → false bail; both Claude-A and
-   Codex flagged it. Codex offered the simpler resolution, adopted: **`GCL_TEST_ONLY` and
-   `GCL_TEST_SHARD` are now mutually exclusive** (bail if both set). There is no real use case
-   for combining them, and it removes the guard-fallback edge case entirely — the exact-count
-   guard then *always* applies in shard mode.
-3. **[blocking (Codex, NEW) — FIXED] Eager parse bails the integration suite.** Parsing/bailing
-   `GCL_TEST_SHARD` at `_harness.sh` source-time runs for *all* suites, including integration
-   (which sources the harness before its note-and-ignore) — so malformed input would `exit 1`
-   integration instead of being ignored. **Fix: parse lazily** on the first `section()` call.
-   Integration never calls `section()`, so it neither parses nor bails; its note-and-ignore
-   just prints a notice if the var is set.
-4. **[non-blocking (Codex, NEW) — FIXED] `== Test N ==` headers are NOT a run-set.**
-   `section()` echoes the header *before* gating, so skipped sections print one too. The union
-   proof / per-shard logging must use **run-only** signals (the `PASS:`/`FAIL:` lines, which a
-   skipped test never emits) — optionally a run-only `RAN:` marker for attribution.
-5. **[FIXED] Guard must assert `expected ≥ 1`** — `n` > section-count (e.g. `58/58`) yields
-   `expected==0` which `0==0` would pass silently green. Also: the *existing* `selector_report`
-   zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT cover pure-shard mode
-   — the new guard's `expected ≥ 1` does.
-6. **[FIXED] Unsharded runs stay byte-identical.** All shard logic gated on
-   `[ -n "$GCL_TEST_SHARD" ]`; the interop suite (shares `section()`/`selector_report`, never
-   sharded, on every leg) and unit-on-ubuntu/macos run exactly as today.
-7. **[FIXED] Guard rationale reworded.** It catches a **`section()`-coverage regression** (a
-   test added *outside* the gate), NOT a "modulo bug" (a wrong `%` would be *correlated* between
-   `section()` and the guard). The union proof is a one-time implementation sanity check (n=2),
-   secondary to the by-construction guarantee.
-8. **[FIXED] Job-count prose:** 4 test cells (+`lint`) = 5 jobs → 5 test cells (+`lint`) = 6
-   jobs; well under the concurrency ceiling.
-
-Round-1 verdicts: Reviewer A *needs-changes (1,2)*; Codex *not-sound-yet (1,2,3)*; Reviewer B
-*sound-to-implement*. All folded.
-
-**Round 2 — confirm (2026-06-18)** — fresh Claude (*sound-to-implement*) + independent Codex
-(*not-sound-yet*: 2 accuracy defects). All FIXED below:
-
-9. **[Codex — FIXED] Union-proof run-line set understated.** `PASS:`/`FAIL:` alone undercounts:
-   `ok_envelope` emits `PASS[env]:` and (relaxed) `bad_envelope` emits `WARN[env-relaxed]:`,
-   which the "sum to 315" relies on. Run-line set is
-   `^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)` (or use `GCL_TAP=1`). (Verification runs
-   already used the full regex; this corrects the prose.)
-10. **[Codex — FIXED] Guard does NOT catch an ungated test (was overclaimed).** A test *outside*
-    a `section()` block bumps neither `SECTION_IDX` nor `SECTIONS_RUN`, so the guard stays
-    balanced. Reframed honestly: the guard's value is the **empty-shard `expected ≥ 1`** check
-    + a cheap modulo cross-check (otherwise near-tautological in shard mode). **An ungated test
-    is caught by the union proof's no-duplicate check** (it runs in *both* shards).
-11. **[Claude — FIXED] `RAN:` marker gated on `GCL_TEST_SHARD` set** (shard logic; an
-    unconditional emit would break unsharded byte-identicality).
-12. **[Claude — FIXED] Explicit `selector_report` shard-guard snippet** added (gated on
-    `[ -n "$GCL_TEST_SHARD" ]`).
-
-Plus a **kcov-interaction** note (Ben asked): the coverage job runs the full suite unsharded;
-the sharding code is inert when `GCL_TEST_SHARD` is unset — no interaction.
-
-**Convergence (REACHED):** round 3 — a final independent Codex spot-confirm — returned **no
-findings, "sound-to-implement"** (verified the run-line regex, the honest guard framing, the
-gated `selector_report` snippet's bash-correctness under `set -uo pipefail`, the shard-only
-`RAN:` marker, and the kcov note). The mechanism is verified sound across 3 rounds (Claude ×3 +
-Codex ×3). **Ready for implementation on Ben's go.**
-
----
-
-## Motivation
-The `windows-2025 unit` leg is the CI wall-clock bottleneck: a full reduced unit run is
-~4m38s and the Windows leg is ~2× every other leg (interop ~100s, integration ~28s). A
-measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on the
-2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So running the
-unit suite as **two parallel shards on two runners ~halves** that leg's wall-clock and speeds
-the per-PR dev-feedback loop. **CI-only** — sharding is opt-in via an env var, unset by default,
-so local dev runs are unaffected.
-
-## Decision context
-- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell has
-  **zero required-context fallout** — no aggregator, no gating concern; `tests.yml` reports
-  per-cell contexts directly.
-- The enabling work is done: every unit test is a `section "Test N: …"`-gated block, proven
-  individually selectable with no cross-test ordering deps (the `GCL_TEST_ONLY` selector work).
-  A shard is just "run the subset of sections assigned to me," which slots into the same gate.
-
-## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, lazy-parsed in `section()`
-A new opt-in env var `GCL_TEST_SHARD=<i>/<n>` (e.g. `1/2`) handled in `tests/_harness.sh`.
-Key design choices (from review): **lazy parse** (so non-`section()` suites ignore it),
-**mutually exclusive** with `GCL_TEST_ONLY`, **regex-validated** (rejects empty/non-digit/
-leading-zero). ~15 lines:
-
-```sh
-# declarations near the GCL_* reads (NO eager parse — keeps integration unaffected):
-GCL_TEST_SHARD="${GCL_TEST_SHARD:-}"
-SHARD_I=0; SHARD_N=0; SECTION_IDX=0; SHARD_PARSED=0
-
-_shard_init() {                      # runs once, lazily, on the first section() call
-  SHARD_PARSED=1
-  [ -z "$GCL_TEST_SHARD" ] && return 0
-  if [ -n "${GCL_TEST_ONLY:-}" ]; then           # mutually exclusive (review #2)
-    echo "Bail out! GCL_TEST_ONLY and GCL_TEST_SHARD are mutually exclusive" >&2; exit 1
-  fi
-  if [[ "$GCL_TEST_SHARD" =~ ^([1-9][0-9]*)/([1-9][0-9]*)$ ]]; then   # review #1 (no empty/zero/octal)
-    SHARD_I=${BASH_REMATCH[1]}; SHARD_N=${BASH_REMATCH[2]}
-  else
-    echo "Bail out! GCL_TEST_SHARD must be i/n positive integers (got '$GCL_TEST_SHARD')" >&2; exit 1
-  fi
-  if [ "$SHARD_I" -gt "$SHARD_N" ]; then
-    echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need i<=n)" >&2; exit 1
-  fi
-}
-
-section() {
-  [ "$SHARD_PARSED" = 1 ] || _shard_init        # lazy: only suites that call section() parse
-  SECTION_IDX=$((SECTION_IDX + 1))              # file-order index, bumped for EVERY test
-  echo "== $1 =="
-  if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi
-  if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then
-    return 1
-  fi
-  SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0
-}
-```
-
-(`SECTION_IDX` bumps unconditionally in file order — independent of `GCL_TEST_ONLY`/
-`GCL_TEST_SWEEP`/`GCL_TEST_FULL` — so it is the stable shard-assignment key.)
-
-The verdict helper `selector_report` (already called by the unit + interop suites) gains a
-shard branch, **gated so unsharded runs are untouched** (no `% SHARD_N=0`):
-
-```sh
-# in selector_report, when sharding is active:
-if [ -n "$GCL_TEST_SHARD" ]; then
-  exp=0; k=1
-  while [ "$k" -le "$SECTION_IDX" ]; do
-    [ $(( (k-1) % SHARD_N )) -eq $(( SHARD_I - 1 )) ] && exp=$((exp+1)); k=$((k+1))
-  done
-  echo "GCL_TEST_SHARD=$SHARD_I/$SHARD_N: ran $SECTIONS_RUN of $SECTION_IDX sections (expected $exp)"
-  if [ "$SECTIONS_RUN" -ne "$exp" ] || [ "$exp" -lt 1 ]; then
-    echo "Bail out! shard $SHARD_I/$SHARD_N ran $SECTIONS_RUN, expected $exp" >&2; exit 1
-  fi
-fi
-```
-
-## Why round-robin (alternatives rejected)
-- **Round-robin by index (CHOSEN):** auto-balancing, **zero-maintenance** — new tests
-  distribute themselves. Measured imbalance ~10% at n=2 (well within "roughly halve"); the
-  heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are scattered, so interleaving
-  balances them.
-- **Contiguous halves:** ~17%+ imbalance (heavy tests unevenly placed), same machinery. Rejected.
-- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** a new test matching neither list
-  silently runs in no shard (coverage hole). Rejected.
-- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck entries. Rejected.
-
-## Coverage safety (the cardinal risk + the guarantee)
-The risk: a shard scheme that drops a test reads green → silent coverage hole.
-
-- **Primary guarantee — partition by construction.** Round-robin over the single stable
-  `SECTION_IDX` ordering assigns every section index to **exactly one** residue class. For any
-  `n`, the shards are a true partition (union == full, no overlap, no drops) — by construction,
-  as long as every test goes through `section()` (all 57 do).
-- **Self-contained per-shard guard (belt-and-suspenders).** In the suite verdict (extend
-  `selector_report`), when `GCL_TEST_SHARD` is set, compute
-  `expected = #{k in 1..SECTION_IDX : (k-1)%n == (i-1)}` and assert `SECTIONS_RUN == expected`
-  **and `expected ≥ 1`**; **bail loudly** otherwise. (Mutual exclusion of `GCL_TEST_ONLY`/
-  `GCL_TEST_SHARD` makes this always-valid in shard mode.) **What it actually catches, stated
-  honestly:** the high-value part is the **empty-shard misconfiguration** (`expected==0` when
-  `n` > section-count, e.g. `58/58`) via the `expected ≥ 1` clause; plus a cheap cross-check
-  that the gate's modulo and the verdict's modulo agree. It is otherwise **near-tautological**
-  in pure-shard mode (`SECTIONS_RUN` and `expected` both derive from the same `SECTION_IDX` via
-  the same arithmetic), and it does **NOT** catch a test added *outside* a `section()` block
-  (that bumps neither counter, so the accounting stays balanced) — that case is caught by the
-  union proof's no-duplicate check below. No cross-job artifacts, no unsharded baseline.
-- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that dies
-  early bails) and the `1..$TAPN` plan line (partial-but-correct per shard). Note the *existing*
-  `selector_report` zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT fire
-  in pure-shard mode — the new `expected ≥ 1` clause is what covers an empty shard.
-- **Local union proof (one-time implementation sanity check; secondary to the by-construction
-  guarantee — and the only thing that catches an ungated test).** Once during implementation,
-  run `GCL_TEST_SHARD=1/2` and `=2/2` and assert their **run-line sets** union to the full
-  unsharded set **with no duplicates**. The run-line set is the assertion lines (run-only — a
-  *skipped* test emits none; the `== Test N ==` headers do NOT work, since `section()` prints
-  them before gating): `^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)` — note `ok_envelope`
-  emits `PASS[env]:` and relaxed `bad_envelope` emits `WARN[env-relaxed]:`, so a bare
-  `PASS:`/`FAIL:` grep would undercount the 315 — or simply diff `GCL_TAP=1` TAP counts. The
-  **no-duplicate** half is what catches a test accidentally left *outside* a `section()` gate
-  (it would run in both shards → appear twice). Not a standing CI step.
-
-## Interaction with existing machinery
-- **`GCL_TEST_ONLY` vs `GCL_TEST_SHARD`: mutually exclusive** (bail if both set). No real use
-  case combines them, and exclusivity removes the guard's hardest edge case.
-- **`GCL_TEST_FULL` / reduced:** orthogonal — sharding partitions *which* sections run, not
-  *how*. The `SECTION_IDX` total (57) is identical full vs reduced, so the partition + guard are
-  mode-independent.
-- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests in its
-  shard. (Not combined in CI; harmless if ever combined.)
-- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario). With
-  **lazy parse**, it never calls `section()` → never parses/bails `GCL_TEST_SHARD`. It should
-  **note-and-ignore** the var the same way it does `GCL_TEST_ONLY` (loud stderr note if set,
-  *without* parsing), using the harness-initialized `GCL_TEST_SHARD` (pre-set `""` so no
-  `set -u` trap).
-- **Unsharded runs stay byte-identical.** All shard logic is gated on `[ -n "$GCL_TEST_SHARD" ]`,
-  so the interop suite (shares the helpers, never sharded — every leg) and unit-on-ubuntu/macos
-  (`leg: all`, full) run exactly as today.
-
-## CI wiring (`.github/workflows/tests.yml`) — Windows unit only
-- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` cell with **two** cells
-  carrying `shard: 1` / `shard: 2` (same `job_timeout`; keep the existing step timeout — a
-  half-run finishes well within it; generous-over-tight matches the repo's "backstop only"
-  philosophy and avoids flakiness).
-- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` — yields `1/2`/`2/2` on the shard cells and `''` (effectively unset, per the harness's `${GCL_TEST_SHARD:-}`) on every other cell, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged. (`/2` is hardcoded; the harness is `n`-generic, so only this one CI string ties to 2 — easy to extend later. NB GHA treats `0` as falsy, so keep shard indices 1-based.)
-- **Artifact name** gains the shard: `test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}` → `…-unit-1`/`…-unit-2` (v4+ rejects duplicate names); other cells' names are byte-identical to today.
-- The job-name template (already includes `leg`) gains the shard so the two unit jobs are distinguishable.
-- **Scope:** Windows unit **only**. Do NOT shard the fast legs (interop, integration, all of
-  ubuntu/macos) or `nightly.yml` (background, not dev-blocking; optional future).
-- **kcov coverage is orthogonal — leave it whole.** The kcov job (`nightly.yml`, Linux) runs
-  the **full unit suite unsharded** in one process, because line coverage of `git-commit-lock.sh`
-  is only meaningful measured across the whole suite in one run, and it's gated on the 0.80
-  floor. It never sets `GCL_TEST_SHARD`, and the sharding code is **inert when `GCL_TEST_SHARD`
-  is unset** (lazy parse → no shard gate), so the kcov run is byte-identical to today — no
-  interaction with this change. (If one ever wanted coverage *from* sharded runs, kcov can merge
-  per-shard output dirs, but that's strictly more machinery for no gain over the single whole
-  run — so we don't.)
-- **Runner budget:** 4 test cells + `lint` = 5 jobs today → 5 test cells + `lint` = 6 jobs;
-  well under GitHub's concurrency ceiling — no queueing.
-
-## Logging / observability (per engineering practices)
-- Each sharded run logs one greppable verdict line: `GCL_TEST_SHARD=i/n: ran R of T sections
-  (expected E)` — captured in the CI suite log (`tee … unit-suite.log`) and the uploaded
-  artifact, so a future agent can reconstruct which shard ran what.
-- For per-test attribution in a sharded run, `section()` emits a **run-only** marker
-  (e.g. `RAN: <label>`) **only when `GCL_TEST_SHARD` is set** (it is shard logic — an
-  unconditional emit would add lines to unsharded runs and break byte-identicality) — needed
-  because the `== Test N ==` headers print for *skipped* tests too (echoed before gating), so
-  they are not a run-set.
-- The guard's failure is a loud `Bail out! shard i/n ran R, expected E` → the step fails and
-  the per-shard CI job name (`… (unit, shard 1)`) makes the red attributable.
-
-## Phasing (implementation)
-1. **`_harness.sh`:** add the lazy `_shard_init` (regex-validated, mutually-exclusive with
-   `GCL_TEST_ONLY`) + `SECTION_IDX` + the `section()` shard gate + the run-only `RAN:` marker +
-   the `selector_report` expected-count/`expected ≥ 1` guard. Integration suite: add the
-   `GCL_TEST_SHARD` note-and-ignore (no parse).
-2. **Local proof:** confirm (a) default (no shard) byte-identical — unit 315/0, interop 141/0
-   (current counts); (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose **run-line
-   sets** (`^(PASS:|FAIL:|PASS\[env\]:|WARN\[env-relaxed\]:)`) union to the unsharded set
-   (sum to 315) with no dup, and whose section counts sum to 57; (c) the **union proof's
-   no-duplicate check** catches a test left *outside* a `section()` gate (it runs in both
-   shards) — the guard does NOT (an ungated test bumps neither counter); the guard bails when
-   `expected==0` (`58/58`); (d) malformed `GCL_TEST_SHARD` — `1/0`, `3/2`, `a/b`, `1/`, `/2`,
-   `2/3/4`, `08/10` — each bails cleanly, and `GCL_TEST_ONLY`+`GCL_TEST_SHARD` together bails;
-   `''` is a no-op; (e) integration with `GCL_TEST_SHARD` set prints the ignore note and runs
-   all 12; (f) `shellcheck -S style` clean.
-3. **`tests.yml`:** split the windows-unit cell into shard 1/2 (env + artifact name + job name).
-   `actionlint -shellcheck=` clean.
-4. **CI verification:** dispatch `tests.yml`; confirm both Windows-unit shards green, each ~half
-   wall-clock (~halved leg), artifact names unique, and the full legs (ubuntu/macos/
-   windows-interop) unchanged.
-5. Commit incrementally under the lock; ships with `ci-stress` and lands on `main` via the same
-   merge PR.
-
-## Results (CI verification, 2026-06-18 — run 27723744798, all green)
-Implemented in `a01a8e3` (harness mechanism) + `2de66ff` (tests.yml). Local proof passed
-(unsharded byte-identical 315/141; shards disjoint, union==unsharded no-dup, 148+167=315 /
-29+28=57; malformed bails; lint clean). CI cross-platform run **succeeded**, both shards green:
-
-| | windows-unit | macos | ubuntu | win-interop | overall (slowest) |
-|---|---|---|---|---|---|
-| **before** (`27716080146`) | **360s** | 194s | 182s | 140s | **360s** |
-| **after** (`27723744798`) | shard1 **242s** ‖ shard2 **99s** | 210s | 181s | 142s | **242s** |
-
-- **Overall CI 360s → 242s (≈33% faster); windows-unit is no longer the ~2× outlier** (242s ≈
-  macos 210s). The stated goal (windows-unit "twice as long as everything else") is met.
-- **Balance was poor: 242 vs 99 (≈2.4×), NOT the planned ~10%.** Root cause: the ~10% estimate
-  used **reduced-mode** per-section timings, but CI runs **full mode** (`GCL_TEST_FULL=1`), where
-  the full-only 8×25 canary (Test 1 → index 1 → shard 1) and other heavies cluster in shard 1.
-  **Lesson: estimate shard balance from the mode CI actually runs.**
-- **Decision — accept as-is (recommended):** a perfectly balanced split (~170/170) could not beat
-  **macos's 210s**, which becomes the floor, so re-balancing would gain only ~32s more (242→210)
-  while reintroducing the maintained cost-table this plan deliberately rejected. The 118s win is
-  already captured; round-robin's imbalance is an acceptable, zero-maintenance trade. (Mechanism
-  is correct + green regardless of balance.)
-
-## Out of scope
-- Sharding the interop/integration suites or the nightly/deep-sweep tiers; `n>2` or cross-OS
-  extension (the harness is already `n`-generic — only the CI string is 2-bound).
-- Cost-aware (greedy) sharding — ~0% imbalance but needs a maintained per-test cost table;
-  round-robin's ~10% is sufficient and maintenance-free.
-- Any product-code change. Test-harness + CI only.
diff --git a/AGENTS.md b/AGENTS.md
deleted file mode 100644
index c9186db..0000000
--- a/AGENTS.md
+++ /dev/null
@@ -1,137 +0,0 @@
-# AGENTS.md — CI flakiness stress hunt (branch `ci-stress`)
-
-> This branch exists to **flush out CI flakiness** in the test suites by running them
-> on GitHub Actions many times, under artificial load, and fixing every flake found via
-> a formal loop. Written 2026-06-16 so the mission + process survive context compaction.
-> A successor instance: read this top-to-bottom, then check `.agent-testing/` for live state.
-
-## Mission (Ben, 2026-06-16)
-Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**. Each
-time a run fails, fix the flake with the formal loop below, reset the streak to 0 (we want
-50 clean on the *fixed* code), and resume. Ben also asked to run under **CPU + disk load**
-to surface load-sensitive flakes faster.
-
-**NO CREDITS / NO BUDGET LIMIT — DON'T PAUSE FOR "CREDITS".** Ben (2026-06-17, explicit):
-there are no credits to worry about — this is a PUBLIC repo, so we can run UNLIMITED CI,
-capped only by GitHub concurrency (excess just queues — throughput, not cost). Keep going;
-dispatch freely; run full review loops. Only surface a genuine blocker or a real decision
-for Ben.
-
-## The formal diagnosis→fix loop (run on EVERY failure)
-1. **Capture** the failure: which leg/suite/test, the assertion, logs + preserved
-   artifacts. Save under `.agent-testing/failures/<run_id>/` (or `interop-fail-*.log`).
-2. **Diagnose** — spawn a subagent (fresh context) to root-cause from the evidence + the
-   code. Give it the evidence, WITHHOLD your own conclusion (let it reason independently).
-3. **Independent review of the diagnosis** — get a *foreign model* (Codex) to verify the
-   diagnosis against the code (uncorrelated with Claude). `codex exec --sandbox read-only
-   -c service_tier=default - < prompt > out.md` (NO `-o` — it corrupts output; capture stdout).
-4. **Classify**: test-flake (timing assumption breaks; product is correct) vs product bug.
-5. **Plan** the fix in `.plans/YYYY-MM-DD-ci-stress-<task>-plan.md`; commit it.
-6. **Plan review/fix rounds until clean** — fresh Claude reviewer AND Codex each round;
-   block ONLY on real design defects (not plan-doc pedantry); iterate until both CONVERGE.
-   Verify every reviewer finding against the actual code yourself (reviewers are fallible
-   and Claude-correlated).
-7. **Implement** the fix (test or product). `bash -n` + `shellcheck -S style` (v0.11.0 —
-   the CI gate) must stay clean. Run the affected suite locally to confirm.
-8. **Implementation review/fix rounds** — fresh Claude reviewer + Codex on the diff; clean.
-9. **Commit** to `ci-stress` under the git commit lock (`~/.local/bin/git-commit-lock.sh
-   run -- ...`, stage only your paths), **push**, mark the plan DONE + changelog.
-10. **Reset** the streak (`rm .agent-testing/clean_count`) and **resume** the driver.
-
-Quality bar (Ben): "I'm intending this library to be great" — spend tokens on rigor;
-don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow.
-
-## Mechanics (all under the `ci-stress` worktree)
-- Worktree: `C:/agent_data/commit-lock/worktrees/ci-stress`. Repo public: `bentoner/git-commit-lock`.
-- **Auth**: `GH_TOKEN=$(printf 'protocol=https\nhost=github.com\n\n' | git credential fill | grep '^password=' | cut -d= -f2-)`. `gh` is at `~/scoop/shims` (add to PATH).
-- **Stress-only commits — DO NOT MERGE to main**: the workflow `concurrency` tweak
-  (unique-per-run group, so parallel dispatches don't cancel) and `tests/with-load.sh` +
-  the workflow's load wiring (inputs `stress_kind`/`stress_load`, wrapped suite steps,
-  raised timeouts). Any *test/product fixes* ARE normal mergeable commits.
-- **Driver**: `.agent-testing/driver.sh` — keeps `MAXC=5` runs in flight via
-  `workflow_dispatch` (with `-f stress_kind=$STRESS_KIND`), polls, records
-  `results.tsv`/`clean_count`/`status.txt`, and EXITS on the first failure (sentinel
-  `FAIL:<id>`, captures diagnostics) or at `TARGET` (sentinel `DONE`). Launch:
-  `cd .agent-testing && rm -f clean_count sentinel STOP && STRESS_KIND=both TARGET=50 bash ./driver.sh` (background).
-- **Load**: `tests/with-load.sh` wraps each suite, spawning N CPU spin-loops and/or N disk
-  create/write+fsync/delete loops (`GCL_STRESS_KIND`, `GCL_STRESS_LOAD`). Hogs reaped by
-  exact PID. The runner is 4-core; `load=4` saturates it.
-- **Flake-condition meter**: Test 17d's `note: T17d outcomes rc0=.. rc1=.. rc97=.. rc98=..
-  ; WAITING=..` line (in each unit-leg log) shows how hard load is biting (rc97 dropping /
-  rc0 rising == the original flake condition). Read it to confirm load is effective.
-
-## Process hygiene (LEARNED THE HARD WAY 2026-06-16)
-- **`TaskStop` does NOT kill a background bash script** — it keeps running and dispatching.
-  After stopping, VERIFY via `powershell Get-CimInstance Win32_Process -Filter
-  "Name='bash.exe'"` (match CommandLine on `driver.sh`/`calibrate.sh`) and
-  `taskkill //F //T //PID <winpid>` the SPECIFIC pid. The driver also honors a graceful
-  **STOP file**: `touch .agent-testing/STOP` → it cancels inflight and exits (sentinel STOPPED).
-- **Exactly ONE dispatcher alive at a time.** A surviving zombie + a relaunch = two
-  dispatchers racing on `ci-stress` (this corrupted a calibration run-id correlation).
-- **NEVER blanket-kill** by name (`Stop-Process -Name`, `taskkill /IM`, `pkill`) — Ben's
-  box is shared; kill only specific PIDs you spawned.
-
-## Progress log
-- **Test 17d (unit, `git-commit-lock.test.sh`)** — `got97>=1` was timing-fragile
-  (windows-unit flaked at normal load, run 27616343269). FIXED (commit 58c3741): replaced
-  with rc∈{0,1,97,98} + drop-free `WAITING>=1` anti-vacuity canary + `note:` meter.
-  Diagnosis+plan+impl all reviewed clean by Claude+Codex. See the plan in `.plans/`.
-- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load (3/3 cpu
-  runs): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. Mechanism
-  (diagnosis + Codex, NOT "token not-yet-visible"): `kill -9 "$hpid"` missed the native
-  pwsh (MSYS `$!` is a shim), so pwsh ran its full `Start-Sleep 60` and exited gracefully,
-  firing the `PowerShell.Exiting` backstop that DELETED its own lock — so the read hit a
-  gone file; `backdate`(touch) then re-created it empty, making the 3 "steal" PASSes
-  vacuous. Test bug, product correct. FIXED (commit <see git log>): holder now self-exits
-  via `[Environment]::Exit(0)` (bypasses release + backstop) leaving a deterministic
-  token'd orphan — no kill. Reviewed clean Claude+Codex; local interop 141/0.
-- **Calibration finding (load=4 on a 4-core runner):** `cpu` reliably breaks interop Test 5
-  (above) and otherwise the unit suite is fine. `disk` shifts Test 17d toward the acquire
-  regime (rc0 up to 4/12 — Ben's disk instinct was apt) but nothing fails. `both` (8 hogs
-  on 4 cores) is the most extreme and additionally trips TWO unit tests only under that
-  pathological oversubscription: `recovery took 33s (>20s)` (+ "rc=97 behind a crashed
-  claim" / "no STOLE-BY-CLAIM") and `claim-path warning fired 0 times (want 1)`. These two
-  are SUSPECTED load-too-high artifacts (tight internal budgets exceeded by 2x CPU
-  oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the
-  50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden
-  those two budgets. Data: `.agent-testing/calibration.tsv`.
-- **Test 31(a) (unit, `git-commit-lock.test.sh`)** — FOUND on **ubuntu** under both/load=2
-  (moderate, genuine), run 27626826865: `FAIL: no leaked-token-memory DISCOVERY-HOLD`.
-  Mechanism: the product has two valid DISCOVERY-HOLD adoption paths — direct
-  `_lock_discover` (sh:822) and the per-poll leaked-token-memory check (sh:1382). 31(a)'s
-  external `mv` (installs the leaked claim at the lock path) RACES the leaver's inline
-  `_lock_discover` (called one statement after the leak-add: sh:1112 -> sh:1114); under
-  load the mv landed first, so 822 adopted instead of the 1382 memory path the assertion
-  pinned. Product correct (rc 0, clean release, no leftover all PASSed); test-orchestration
-  race. **FIXED (commit 51a1753):** Fix A — sub-leg (a)'s assertion now accepts EITHER
-  DISCOVERY-HOLD route and records which fired (memory route still pinned deterministically
-  by 31(b); direct route by Test 25's 7-position discovery matrix, so no coverage lost).
-  Diagnosis converged across 4 independent reviews (code-read + leak.log + fresh Claude
-  subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See
-  `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load.
-
-## Coverage work (not a flake — Ben asked, 2026-06-17)
-- **F2 read-back lane (commit 19a28fd):** a coverage audit (subagent + my code verification)
-  found the steal-path acquire read-back-verification failure lane uncovered — the stealer
-  WINS the claim race AND the rename-over (`STOLE-BY-CLAIM` logged, ghost destroyed) but the
-  post-rename read-back (`git-commit-lock.sh:1171`) reads the wrong token → must re-enter wait,
-  not false-hold. Its create-path twin (`:1358`) was covered by Test 32; F2 was not. Added
-  **Test 32b** (deterministic; mirrors Test 32 with the inverse `[ -n "$_LOCK_CLAIM_TOKEN" ]`
-  gate to land the fault at the steal read-back). Reviewed clean by fresh Claude + Codex;
-  suite 0-failed; F2 lane empirically exercised. Plan:
-  `.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md`. Product unchanged (F2 reads correct;
-  this was regression-exposure, not a bug). Audit also flagged LOWER-priority gaps left for
-  Ben: A2/G2 (a non-file appearing AT the lock path mid-steal — `CLAIM-ABORT (wrong-type)` /
-  `(rename-refused)`), and that feeder-#3/blocked-unlink legs are Windows+pwsh-only.
-
-## Hunt status (as of 2026-06-17 ~03:20 local)
-- The `both`/load=2 hunt reached **40/50 clean** on the post-31(a)-fix tree (810ee41) with
-  ZERO failures, then I gracefully STOPped it to fold in the Test 32b coverage addition.
-  Restarted at **0/50 on the final tree 19a28fd** (with Test 32b) — a test-only change resets
-  the streak per the "50 clean on the current tree" rule, so the contiguous-50 is measured on
-  the final suite. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a).
-- To resume after any halt: `cd .agent-testing && rm -f clean_count sentinel STOP &&
-  STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). First verify no
-  stray dispatcher + current HEAD (see Process hygiene).
-- THREE flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e),
-  Test 31(a) (51a1753). Plus one coverage addition: Test 32b / F2 (19a28fd).