From 980856bc1488e8266507579bac427910c7028093 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Tue, 16 Jun 2026 21:32:45 +1000 Subject: [PATCH 01/76] CI(stress-branch): unique-per-run concurrency group for parallel dispatch Stress-test-only change so many workflow_dispatch runs execute in parallel on this single branch without cancel-in-progress killing each other. Do NOT merge to main. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/tests.yml | 8 ++++++-- 1 file changed, 6 insertions(+), 2 deletions(-) diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index 3d1424c..c9a99da 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -19,8 +19,12 @@ on: workflow_dispatch: concurrency: - group: ${{ github.workflow }}-${{ github.ref }} - cancel-in-progress: true + # STRESS-BRANCH ONLY — do NOT merge to main. The per-run-unique group + no + # cancellation lets many workflow_dispatch runs execute in parallel on this one + # branch (flakiness stress test). On main the group is + # `${{ github.workflow }}-${{ github.ref }}` with cancel-in-progress: true. + group: ${{ github.workflow }}-${{ github.ref }}-${{ github.run_id }} + cancel-in-progress: false permissions: contents: read From 5dec024e18e572e442991b09a1fa98542ee6fc47 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Tue, 16 Jun 2026 22:33:45 +1000 Subject: [PATCH 02/76] Plan: de-flake Test 17d got97 assertion (CI stress find) Diagnosis+Codex review: windows-2025 unit flake is a timing-fragile self-validation assertion, not a product bug. Plan replaces got97>=1 with rc-in-{0,97,98} + a WAITING- based anti-vacuity canary; keeps the warn17d TOCTOU regression guard untouched. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...2026-06-16-ci-stress-test17d-flake-plan.md | 123 ++++++++++++++++++ 1 file changed, 123 insertions(+) create mode 100644 .plans/2026-06-16-ci-stress-test17d-flake-plan.md diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md new file mode 100644 index 0000000..87b31dc --- /dev/null +++ b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md @@ -0,0 +1,123 @@ +# Plan: de-flake Test 17d (`got97 >= 1`) in the unit suite + +Status: DRAFT — awaiting review (Claude reviewer + Codex), then implement. + +## Reviewer notes (add at top; do not renumber) +_(none yet)_ + +## Context +CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run +27616343269 failed only on `windows-2025 (unit)` with one assertion in +`tests/git-commit-lock.test.sh` Test 17d: + +``` +PASS: 12 waiters polled through churn with ZERO spurious non-lock warnings +FAIL: no waiter reached 97 under churn (got97=0/12) — timeout lane bypassed? +``` + +Diagnosis (Claude subagent) + independent review (Codex) — both in +`.agent-testing/failures/27616343269/{DIAGNOSIS.md,codex-diag-review.md}`: + +- **Root cause.** The Windows pwsh churner (`tests/git-commit-lock.test.sh:925-931`) + does `WriteAllText → Delete` with **no present-hold**, unlike the POSIX perl churner + which sleeps 2ms present each iteration (`:944-947`). On the loaded 2-core + windows-2025 VM, per-iteration pwsh/.NET overhead widened the *absent* + (Delete→next-Write) window past the 20ms poll interval, so all 12 waiters won an + ordinary `O_EXCL` create-race in an absent window (`git-commit-lock.sh:1323-1356`) + and exited rc 0 — none reached the `MAX_WAIT=2` timeout, so `got97=0`. Proof: every + waiter in `churn.log` carries its **own** `tok....` token (not the churner's + `tok.churn.1.1`) and there are no steal/TIMEOUT lines; the leg ran 17d in 4.4s + (too short for twelve 2s timeouts). +- **Classification: test-flake, not a product bug.** Acquiring during a genuinely + absent window is correct behavior. `got97 >= 1` is a *self-validation* guard (was + the timeout lane exercised?), not a product requirement. In this test shape rc ∈ + {0 (create-win), 97 (timeout), 98 (churner overwrote the hold before release — + designed theft detection; present in this run, waiter 36836 / `t17d.3.3.err`)} are + **all** correct outcomes. Which one occurs is machine-speed luck. + +The real regression Test 17d guards — `warn17d == 0`, the per-poll non-lock-warning +TOCTOU guard — PASSED and is untouched by this plan. + +## Goal +Make Test 17d non-flaky across fast and slow runners **without weakening the +`warn17d == 0` regression guard**, while keeping a real anti-vacuous-pass canary so a +dead/absent churner can't let the test pass without exercising the guarded poll path. + +## Fix (replaces the single `got97 >= 1` assertion; keeps everything else) +Within the `for r in 1 2 3` waiter loop, replace the `got97` accumulation and its +assertion with three assertions: + +1. **Regression guard — unchanged.** `warn17d == 0` ("12 waiters polled through churn + with ZERO spurious non-lock warnings"). Keep verbatim. + +2. **Every waiter reaches a designed terminal state.** Accumulate each waiter's rc; + require all 12 ∈ {0, 97, 98}. Any other rc (crash, 96 config error, 99, …) ⇒ `bad`, + listing the offending `round.idx=rc`. This is *stricter* than the old test, which + ignored every rc except 97. + +3. **Anti-vacuity: contention actually happened (the guarded path ran).** Require + `grep -c 'WAITING for lock' "$LOG" >= 1`. `WAITING` is logged **only** after a + waiter's create was blocked by a present file (`git-commit-lock.sh:1363-1370`), + immediately before the per-poll type-guard loop (`:1388-1570`) that `warn17d` + guards — so ≥1 `WAITING` proves at least one waiter entered the exact path under + test. A dead/absent-only churner produces 0 `WAITING` and fails this. Threshold is + **≥1** (the weakest non-vacuous signal) to stay robust on absent-dominant runners; + the failing run already had 9 `WAITING` lines, so ≥1 has wide margin both ways. + +### Why ≥1 WAITING is robust (not a new flake) +`WAITING` count is machine-dependent in the *opposite* direction to `got97`: a +present-dominant (fast) runner blocks most waiters (lots of WAITING, got97 high); an +absent-dominant (slow) runner lets waiters acquire (fewer WAITING, got97 low) — but +even the worst observed case (this failure) still logged 9 WAITING. The only way to +get 0 WAITING is no contention at all (churner never ran / always absent), which is +exactly the vacuity we want to fail on. So ≥1 has margin on both ends; no threshold +near the machine-variance band is introduced. + +### Secondary hardening (cheap, include if clean) +- **Churner readiness proves churn began.** Today the start marker is written *before* + the loop (`:926`), so "started" doesn't prove a single cycle ran. Move the start-marker + write to *after* the churner's first successful write+delete cycle (both pwsh and perl + branches) so `wait_for_file "$START"` implies the churn loop is actually turning over. +- **Churner alive at reap.** Capture `kill -0 "$churn_pid"` right before `touch "$STOP"`; + assert it was alive ⇒ catches a churner that crashed mid-test (another vacuity route). + This is non-flaky: the churner loops 2,000,000× and the test lasts ~4-6s, so it is + always alive at reap unless it actually crashed. + +If either hardening proves fiddly or risks its own flake, the plan's load-bearing fix +is assertions 1-3 alone; the start-marker move and alive-check are defense-in-depth and +can be dropped without losing the de-flake. (Decide during implementation; record in +changelog.) + +## Observability (per logging practice) +Keep the data that made this diagnosable: emit a `note:` line with the rc distribution +and the WAITING count every run, e.g. +`note: T17d outcomes rc0=$n0 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited` — so a +future failure can be classified from the suite log without re-deriving it. (The old +test discarded this.) + +## Out of scope / explicitly NOT changed +- The `warn17d`/TOCTOU regression logic and its assertion. +- The churner shapes' core (pwsh on Windows, perl elsewhere) beyond the start-marker move. +- Product code (`git-commit-lock.sh`) — no product defect found. +- The `.ps1` port and other suites — Test 17d is bash-unit-only. + +## Testing +1. **Static:** `bash -n tests/git-commit-lock.test.sh`; shellcheck `-S style` (the CI + lint gate) on the test file — must stay clean. +2. **Local sanity (Windows, this box):** run Test 17d in isolation a handful of times via + the suite's single-test selector if present, else the whole unit suite once, in + `.agent-testing/` — confirm it passes and the new `note:` line shows a sane rc/WAITING + mix. (Local box is faster/less loaded, so it will likely be present-dominant — expect + high got97; that's fine, the test no longer asserts on it.) +3. **Real proof = CI stress.** The genuine signal is the GitHub windows-2025 (unit) leg + under load. After implementing, resume the stress driver (streak reset to 0) and + require the previously-flaky path to survive the run to 50 clean. If 17d flakes again + we re-open. + +## Rollout +Commit the test fix to `ci-stress` (under the git commit lock). This is a normal, +mergeable fix (unlike the stress-only concurrency commit 980856b). Reset +`clean_count`, relaunch the driver, continue toward 50 clean in a row. + +## Changelog (implementation) +_(to be appended during implementation)_ From 9f76c292f117d26df8ebec2d4b366c0e961e46ff Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Tue, 16 Jun 2026 22:48:20 +1000 Subject: [PATCH 03/76] Plan v2: address review round 1 (rc-set {0,1,97,98}; drop-free WAITING canary) Round-1 review (Claude + Codex, verified in code): rc-set must include 1 (lock_run demotes a clean command to 1 on unverifiable-empty release); WAITING canary must read per-waiter logs, not the shared churn.log (concurrent appends drop lines). Secondary hardenings dropped. See reviewer notes at top of plan. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...2026-06-16-ci-stress-test17d-flake-plan.md | 87 +++++++++++++------ 1 file changed, 60 insertions(+), 27 deletions(-) diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md index 87b31dc..c5fcbe7 100644 --- a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md +++ b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md @@ -3,7 +3,33 @@ Status: DRAFT — awaiting review (Claude reviewer + Codex), then implement. ## Reviewer notes (add at top; do not renumber) -_(none yet)_ +Round 1 — fresh Claude reviewer + Codex (both independent), findings verified by me +against the product code: + +1. **[BLOCKING — fixed in plan v2] rc-set `{0,97,98}` is not exhaustive of correct + outcomes → must be `{0,1,97,98}`.** Under this churn a clean `true` whose release + reads the held lock EMPTY (the churner's create→write window) gets release rc 2, + which `lock_run` maps to **rc 1** (`git-commit-lock.sh:1739-1744`). rc 1 is the + documented "ownership unverifiable, successful command demoted" outcome — correct, + not a defect. Verified. The original `{0,97,98}` was the *same class* of + timing-fragile assumption as the bug being fixed. Fixed below. +2. **[BLOCKING — fixed in plan v2] the `WAITING` canary must not read the SHARED log.** + Plan v1 grepped `WAITING` from the single shared `churn.log` (line 916), but the + suite itself documents `# per-waiter logs: concurrent appends to one log drop lines` + (`tests/git-commit-lock.test.sh:258`) and uses per-waiter logs elsewhere for exactly + this reason. A shared-log `WAITING` count can under-count under concurrency and the + canary would itself flake. Fixed: give each waiter its OWN `AGENT_LOCK_LOG` + (single-writer ⇒ drop-free), count `WAITING` across those, and concatenate them into + `churn.log` afterwards so the preserved artifact is unchanged. +3. **[disposition] Secondary hardenings DROPPED.** Reviewers flagged the + start-marker-after-first-cycle and alive-at-reap hardenings as needing care (the + alive check can false-fail if the churner's iteration cap is ever hit; both add + machinery to a delicate timing path). They are also largely redundant with the + drop-free `WAITING>=1` canary, which already proves the churner produced contention. + To keep the change minimal and the timing path untouched, v2 drops both. The + load-bearing fix is assertions 1-3. +4. **[non-blocking, adopted] observability buckets** updated to `rc0/rc1/rc97/rc98/other` + and emitted unconditionally (pass and fail), so a drift toward an edge is visible. ## Context CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run @@ -43,20 +69,33 @@ Make Test 17d non-flaky across fast and slow runners **without weakening the `warn17d == 0` regression guard**, while keeping a real anti-vacuous-pass canary so a dead/absent churner can't let the test pass without exercising the guarded poll path. -## Fix (replaces the single `got97 >= 1` assertion; keeps everything else) -Within the `for r in 1 2 3` waiter loop, replace the `got97` accumulation and its -assertion with three assertions: +## Fix (v2) — replaces the single `got97 >= 1` assertion; keeps everything else +**Structural A — per-waiter lock logs (drop-free).** Today all 12 waiters share +`AGENT_LOCK_LOG="$LOG"` (`$LOG=churn.log`, line 916). Change each waiter to its OWN log +`AGENT_LOCK_LOG="$WORK/t17d.$r.$i.log"` (the churner writes only the lock *file*, never +the log, so per-waiter logs lose nothing). After the 3 rounds, +`cat "$WORK"/t17d.*.log > "$LOG"` to rebuild the consolidated `churn.log` artifact. +`warn17d` is unaffected — it greps the per-waiter `.err` STDERR files, not the log. + +Then replace the `got97` accumulation + its assertion with three assertions: 1. **Regression guard — unchanged.** `warn17d == 0` ("12 waiters polled through churn with ZERO spurious non-lock warnings"). Keep verbatim. 2. **Every waiter reaches a designed terminal state.** Accumulate each waiter's rc; - require all 12 ∈ {0, 97, 98}. Any other rc (crash, 96 config error, 99, …) ⇒ `bad`, - listing the offending `round.idx=rc`. This is *stricter* than the old test, which - ignored every rc except 97. + require all 12 ∈ **{0, 1, 97, 98}**. For `bash -c 'true'` under this churn: `0` + acquired+clean release; `1` acquired but release read the held lock EMPTY (churner's + create→write window) ⇒ release rc 2 ⇒ `lock_run` demotes the clean command to 1 + (`git-commit-lock.sh:1739-1744`), ownership-unverifiable/correct; `97` timed out; + `98` churner overwrote the hold before release (designed theft detection). Any OTHER + rc (crash/139, 96 config error, 99, …) ⇒ `bad`, listing the offending `round.idx=rc`. + Stricter than the old test (which ignored every rc but 97) and is the real new + product-regression check. Comment must name why rc 1 is correct so a successor does + not "tighten" the set back and re-introduce the flake. 3. **Anti-vacuity: contention actually happened (the guarded path ran).** Require - `grep -c 'WAITING for lock' "$LOG" >= 1`. `WAITING` is logged **only** after a + `cat "$WORK"/t17d.*.log | grep -c 'WAITING for lock' >= 1` (counted from the + single-writer per-waiter logs ⇒ drop-free; see reviewer note 2). `WAITING` is logged **only** after a waiter's create was blocked by a present file (`git-commit-lock.sh:1363-1370`), immediately before the per-poll type-guard loop (`:1388-1570`) that `warn17d` guards — so ≥1 `WAITING` proves at least one waiter entered the exact path under @@ -73,31 +112,25 @@ get 0 WAITING is no contention at all (churner never ran / always absent), which exactly the vacuity we want to fail on. So ≥1 has margin on both ends; no threshold near the machine-variance band is introduced. -### Secondary hardening (cheap, include if clean) -- **Churner readiness proves churn began.** Today the start marker is written *before* - the loop (`:926`), so "started" doesn't prove a single cycle ran. Move the start-marker - write to *after* the churner's first successful write+delete cycle (both pwsh and perl - branches) so `wait_for_file "$START"` implies the churn loop is actually turning over. -- **Churner alive at reap.** Capture `kill -0 "$churn_pid"` right before `touch "$STOP"`; - assert it was alive ⇒ catches a churner that crashed mid-test (another vacuity route). - This is non-flaky: the churner loops 2,000,000× and the test lasts ~4-6s, so it is - always alive at reap unless it actually crashed. - -If either hardening proves fiddly or risks its own flake, the plan's load-bearing fix -is assertions 1-3 alone; the start-marker move and alive-check are defense-in-depth and -can be dropped without losing the de-flake. (Decide during implementation; record in -changelog.) +### Secondary hardening — DROPPED (reviewer note 3) +v1 proposed two extra hardenings (move the start-marker after the churner's first +write+delete cycle; assert the churner is alive at reap). Both are dropped in v2: they +add machinery to a delicate timing path, the alive-check can false-fail if the churner's +iteration cap is ever hit, and both are largely redundant with the drop-free +`WAITING>=1` canary (which already proves the churner produced real contention — a +waiter can only log `WAITING` if the churner had the lock file present). The +load-bearing fix is the per-waiter logs + assertions 1-3. ## Observability (per logging practice) Keep the data that made this diagnosable: emit a `note:` line with the rc distribution -and the WAITING count every run, e.g. -`note: T17d outcomes rc0=$n0 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited` — so a -future failure can be classified from the suite log without re-deriving it. (The old -test discarded this.) +and the WAITING count **unconditionally** (both pass and fail paths), e.g. +`note: T17d outcomes rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited` +— so a future failure (or a pass drifting toward an edge) can be classified from the +suite log without re-deriving it. (The old test discarded this.) ## Out of scope / explicitly NOT changed - The `warn17d`/TOCTOU regression logic and its assertion. -- The churner shapes' core (pwsh on Windows, perl elsewhere) beyond the start-marker move. +- The churner shapes' core (pwsh on Windows, perl elsewhere) — unchanged in v2. - Product code (`git-commit-lock.sh`) — no product defect found. - The `.ps1` port and other suites — Test 17d is bash-unit-only. From 58c3741b5af4c6986f49eec9985785982a8ab46c Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Tue, 16 Jun 2026 23:18:32 +1000 Subject: [PATCH 04/76] Test 17d: de-flake got97>=1 (CI stress find on windows-2025 unit) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The got97>=1 self-validation assertion was timing-fragile: on a loaded 2-core Windows runner all 12 waiters won create-races in the churner absent windows (rc 0) instead of timing out, so got97=0 and the test failed though the product was correct (run 27616343269; 29 prior identical runs were green). Replace it (diagnosis + Codex review confirmed test-flake, not a product bug): - per-waiter AGENT_LOCK_LOG (single-writer => drop-free; shared logs drop lines under concurrent appends), rebuilt into churn.log for the artifact; - assert all 12 waiters reach a DESIGNED terminal rc in {0,1,97,98} (rc 1 = clean command demoted on unverifiable-empty release) — catches real product regressions; - anti-vacuity: require >=1 WAITING line (proves the churn produced real contention and the guarded per-poll type-guard path ran); - unconditional note: rc distribution + WAITING count for future triage. The warn17d==0 TOCTOU regression guard is unchanged. Local: unit suite 214/0; shellcheck -S style (v0.11.0) + bash -n clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...2026-06-16-ci-stress-test17d-flake-plan.md | 34 ++++++++++++++- tests/git-commit-lock.test.sh | 42 ++++++++++++++++--- 2 files changed, 69 insertions(+), 7 deletions(-) diff --git a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md index c5fcbe7..c2f4bb8 100644 --- a/.plans/2026-06-16-ci-stress-test17d-flake-plan.md +++ b/.plans/2026-06-16-ci-stress-test17d-flake-plan.md @@ -1,6 +1,7 @@ # Plan: de-flake Test 17d (`got97 >= 1`) in the unit suite -Status: DRAFT — awaiting review (Claude reviewer + Codex), then implement. +Status: **DONE** (implemented + reviewed clean by Claude and Codex; local unit suite +214/0; awaiting CI-stress confirmation toward 50 clean in a row). ## Reviewer notes (add at top; do not renumber) Round 1 — fresh Claude reviewer + Codex (both independent), findings verified by me @@ -31,6 +32,17 @@ against the product code: 4. **[non-blocking, adopted] observability buckets** updated to `rc0/rc1/rc97/rc98/other` and emitted unconditionally (pass and fail), so a drift toward an edge is visible. +Round 2 — confirming review (fresh Claude + Codex, both independent): **CONVERGED, ok to +implement.** Both verified against the product code that the rc-set {0,1,97,98} is +exhaustive and tight (release rc 2 is remapped to 1, never leaks; acquire exposes only +0/97; reentrant-1 unreachable from a fresh CLI process), per-waiter `AGENT_LOCK_LOG` +auto-creates and breaks nothing, and `WAITING>=1` is a sound non-flaky floor. Two +implementation reminders adopted: (a) `bad` is a function — name the "other" rc bucket +something else (e.g. `nother`) and an offenders string; (b) avoid `cat … | grep -c` +(ShellCheck SC2002 fires at the CI style gate). Resolution for (b): rebuild churn.log via +`cat "$WORK"/t17d.*.log > "$LOG"` (a redirect, not a pipe — no SC2002), then +`grep -c 'WAITING for lock' "$LOG"` on the single rebuilt file. + ## Context CI stress test (ci-stress branch, 2026-06-16): 29 identical green runs, then run 27616343269 failed only on `windows-2025 (unit)` with one assertion in @@ -153,4 +165,22 @@ mergeable fix (unlike the stress-only concurrency commit 980856b). Reset `clean_count`, relaunch the driver, continue toward 50 clean in a row. ## Changelog (implementation) -_(to be appended during implementation)_ +- Implemented exactly the Fix v2 design in `tests/git-commit-lock.test.sh` Test 17d + (the `if wait_for_file "$START" 60` block): per-waiter `AGENT_LOCK_LOG`, rc `case` + bucketing into `n0/n1/n97/n98/nother` + `rc_bad` offender list, `cat glob > "$LOG"` + rebuild, `grep -c 'WAITING for lock' "$LOG"` count, unconditional `note:` line, and + the three assertions (warn17d==0 kept verbatim; rc∈{0,1,97,98}; WAITING>=1). Removed + `got97`. No product code or other test touched. +- Static: `bash -n` clean; `shellcheck -S style` v0.11.0 (the CI-pinned gate version) + clean. +- Local run (Windows, this box, REDUCED fan-out — Test 17d is not fan-out-gated so it + runs identically): full unit suite **214 passed / 0 failed**. Test 17d emitted + `note: T17d outcomes rc0=0 rc1=0 rc97=12 rc98=0 other=0; WAITING=12` and all three + assertions PASS. (Idle box ⇒ present-dominant ⇒ all 12 timed out at 97 — the opposite + extreme to the CI failure's rc0-heavy distribution; both now accepted.) +- Implementation review: fresh Claude reviewer — "IMPLEMENTATION OK" (confirmed + set -uo pipefail / no errexit so `grep -c` exit-1 is harmless; empty-glob rebuild + handled; no `bad`/`rc_bad` collision; `warn17d` guard intact). Codex + `exec review --uncommitted` — no blocking bug. Both in `.agent-testing/`. +- Real proof pending: the windows-2025 (unit) leg under CI load. Resuming the stress + driver with the streak reset to 0. diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 021ea22..57265a9 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -962,26 +962,58 @@ if [ -n "$churn_pid" ]; then # never churned, so bash sees it reliably. Budget 60s: pwsh cold start on # a loaded box can take >15s. if wait_for_file "$START" 60; then - warn17d=0; got97=0 + # Per-waiter lock logs (single-writer => drop-free): a SHARED log drops lines + # under concurrent appends (cf. the per-waiter logs at Test 2B), which would make + # the WAITING anti-vacuity count below unreliable. Rebuilt into $LOG after the runs. + warn17d=0; n0=0; n1=0; n97=0; n98=0; nother=0; rc_bad="" for r in 1 2 3; do pids=() for i in 1 2 3 4; do - AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 \ + AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/t17d.$r.$i.log" AGENT_LOCK_STALE_SECS=300 \ AGENT_LOCK_POLL_SECS=0.02 AGENT_LOCK_MAX_WAIT=2 \ bash "$LIB" run -- bash -c 'true' 2> "$WORK/t17d.$r.$i.err" & pids+=($!) done for i in 1 2 3 4; do wait "${pids[$((i-1))]}"; rc=$? - [ "$rc" = 97 ] && got97=$((got97+1)) + # A CLEAN command ('true') under this churn has exactly FOUR correct terminal + # codes — do NOT tighten this set: rc 1 is the real catch that made the old + # got97>=1 assertion flaky (see the Test 17d de-flake plan). + # 0 acquired in an absent window, clean release + # 1 acquired, but release read the held lock EMPTY (the churner's + # create->write window) -> release rc 2 -> lock_run demotes the clean + # command to 1 (ownership unverifiable; correct, not a defect) + # 97 never won an absent window within MAX_WAIT -> timed out + # 98 churner overwrote the hold before release -> designed theft detection + case "$rc" in + 0) n0=$((n0+1)) ;; + 1) n1=$((n1+1)) ;; + 97) n97=$((n97+1)) ;; + 98) n98=$((n98+1)) ;; + *) nother=$((nother+1)); rc_bad="$rc_bad $r.$i=$rc" ;; + esac n="$(grep -c 'is not a lock file' "$WORK/t17d.$r.$i.err")" warn17d=$((warn17d+n)) done done + # Rebuild the consolidated churn.log artifact from the drop-free per-waiter logs. + # 'cat glob > file' is a redirect, not a pipe (no SC2002); then count WAITING from + # the single rebuilt file. + cat "$WORK"/t17d.*.log > "$LOG" 2>/dev/null || : + waited="$(grep -c 'WAITING for lock' "$LOG")" + echo "note: T17d outcomes rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98 other=$nother; WAITING=$waited" [ "$warn17d" = 0 ] && ok "12 waiters polled through churn with ZERO spurious non-lock warnings" \ || bad "churned regular file fired $warn17d non-lock warning(s) — per-poll guard TOCTOU regression!" - [ "$got97" -ge 1 ] && ok "waiters still timed out at 97 under churn ($got97/12)" \ - || bad "no waiter reached 97 under churn (got97=$got97/12) — timeout lane bypassed?" + # Replaces the old got97>=1 assertion (timeout is only ONE of the correct outcomes; + # which one occurs is machine-speed luck). Assert each waiter reached a DESIGNED + # terminal state instead — catches a real product regression (crash/139, 96, …). + [ "$nother" = 0 ] && ok "all 12 waiters reached a designed terminal state (rc in {0,1,97,98})" \ + || bad "waiter(s) hit an undesigned rc under churn:$rc_bad (rc0=$n0 rc1=$n1 rc97=$n97 rc98=$n98)" + # Anti-vacuity: WAITING is logged only after a create was blocked by a PRESENT lock, + # immediately before the per-poll type guard that warn17d guards — so >=1 proves the + # churn produced real contention and the guarded path ran. 0 => dead/absent churner. + [ "$waited" -ge 1 ] && ok "churn exercised the blocked-poll type-guard lane ($waited WAITING line(s))" \ + || bad "no WAITING logged under churn — contention never happened; test ran vacuously" else bad "T17d churner never signalled its start marker" echo " diag: churner pid=$churn_pid alive=$(kill -0 "$churn_pid" 2>/dev/null && echo yes || echo no)" From b430d739e96a7b913908cb6de306f0b869c58f53 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Tue, 16 Jun 2026 23:37:23 +1000 Subject: [PATCH 05/76] CI(stress-branch): add CPU/disk load wrapper to surface timing flakes STRESS-BRANCH ONLY (do not merge). tests/with-load.sh runs each suite while N CPU spin-loops and/or N disk create/write+fsync/delete loops saturate the runner, to widen the timing windows that latency/race flakes depend on (Test 17d absent window is driven by both CPU descheduling and slow file IO). Selected via new workflow_dispatch inputs stress_kind (none|cpu|disk|both, default both) and stress_load (blank=core count); empty on push/schedule => none. Step/job timeouts raised so load slowness does not trip a timeout and look like a flake. Hogs reaped by exact PID (never by name). Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/tests.yml | 34 +++++++++++------ tests/with-load.sh | 73 +++++++++++++++++++++++++++++++++++++ 2 files changed, 96 insertions(+), 11 deletions(-) create mode 100644 tests/with-load.sh diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index c9a99da..52961e6 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -17,6 +17,13 @@ on: schedule: - cron: '17 3 * * 1' # weekly Monday run: catches runner-image/tool drift workflow_dispatch: + inputs: + stress_kind: + description: 'STRESS BRANCH: artificial load during suites — none|cpu|disk|both' + default: both + stress_load: + description: 'STRESS BRANCH: hogs per kind (blank = runner core count)' + default: '' concurrency: # STRESS-BRANCH ONLY — do NOT merge to main. The per-run-unique group + no @@ -41,17 +48,22 @@ jobs: # process-spawn overhead, not the PowerShell engines). Suites must NOT run # concurrently inside one runner: they're timing-sensitive on 2-core # runners. POSIX legs are fast enough to stay single-job. - include: - - { os: ubuntu-24.04, leg: all, job_timeout: 35 } - - { os: macos-15, leg: all, job_timeout: 35 } - - { os: windows-2025, leg: unit, job_timeout: 20 } - - { os: windows-2025, leg: interop-integration, job_timeout: 22 } + include: # STRESS BRANCH: job_timeouts raised to clear the summed step budgets under artificial load + - { os: ubuntu-24.04, leg: all, job_timeout: 80 } + - { os: macos-15, leg: all, job_timeout: 80 } + - { os: windows-2025, leg: unit, job_timeout: 40 } + - { os: windows-2025, leg: interop-integration, job_timeout: 50 } timeout-minutes: ${{ matrix.job_timeout }} # backstop only: sum of the leg's step budgets + upload headroom defaults: run: shell: bash # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires env: GCL_TEST_FULL: 1 # full fan-out — CI runners are dedicated; the reduced default protects live dev boxes (TODO 58) + # STRESS-BRANCH ONLY (do not merge): artificial CPU/disk load wrapped around each + # suite (tests/with-load.sh) to widen timing windows and surface latency/race + # flakes. From the workflow_dispatch inputs; empty on push/schedule => 'none'. + GCL_STRESS_KIND: ${{ inputs.stress_kind || 'none' }} + GCL_STRESS_LOAD: ${{ inputs.stress_load }} steps: - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0 # v6.0.3, SHA-pinned with: @@ -76,30 +88,30 @@ jobs: - name: Unit suite if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }} - timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }} # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang + timeout-minutes: ${{ matrix.os == 'windows-2025' && 30 || 25 }} # STRESS BRANCH: raised (15->30 / 10->25) so artificial load slowness doesn't trip the step timeout and masquerade as a flake env: GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit run: | mkdir -p test-output - bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log + bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log - name: Interop suite (bash + pwsh) if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} # run even if an earlier suite failed — every signal is useful - timeout-minutes: 10 + timeout-minutes: 25 # STRESS BRANCH: raised 10->25 for artificial load env: GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop run: | mkdir -p test-output - bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log + bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log - name: Integration suite (real concurrent commits) if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} - timeout-minutes: 7 # its internal AGENT_LOCK_MAX_WAIT cap is 240s + timeout-minutes: 20 # STRESS BRANCH: raised 7->20 for artificial load (internal AGENT_LOCK_MAX_WAIT cap is 240s) env: GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration run: | mkdir -p test-output - bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log + bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log - name: Upload failure diagnostics if: ${{ failure() || cancelled() }} # failure() covers step timeouts (they fail the step); cancelled() is best-effort cover for manual cancels / the job-level backstop diff --git a/tests/with-load.sh b/tests/with-load.sh new file mode 100644 index 0000000..e19ae5f --- /dev/null +++ b/tests/with-load.sh @@ -0,0 +1,73 @@ +#!/usr/bin/env bash +# STRESS-BRANCH ONLY — do NOT merge to main. +# +# Run "$@" while artificial CPU and/or disk load saturates the runner, to widen the +# timing windows that latency/race flakes depend on (e.g. Test 17d's churn "absent +# window" — driven by both CPU descheduling of the churner AND slow file create/delete +# IO). Hogs are reaped by their EXACT PIDs afterward (never by name), so this is safe on +# a shared machine; on an ephemeral CI runner it is doubly safe. +# +# GCL_STRESS_KIND = none | cpu | disk | both (default: both) +# GCL_STRESS_LOAD = N hogs of EACH selected kind (default: detected core count) +# +# CPU hog = a bare bash spin loop (one core each). +# Disk hog = a tight create / write+fsync / delete loop of a small file on the same +# volume as the test's scratch dir (TMPDIR) — metadata + write-back pressure +# that contends with the lock-file create/delete the suite itself does. +set -uo pipefail + +kind="${GCL_STRESS_KIND:-both}" +cores="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)" +load="${GCL_STRESS_LOAD:-$cores}" +case "$load" in ''|*[!0-9]*) load="$cores" ;; esac # guard non-numeric / empty + +hogdir="${TMPDIR:-/tmp}/gcl-stress.$$" +mkdir -p "$hogdir" 2>/dev/null || hogdir="." + +hogs=() +spawn_cpu() { + local i + for ((i = 0; i < load; i++)); do + bash -c 'while :; do :; done' & + hogs+=("$!") + done +} +spawn_disk() { + local i + for ((i = 0; i < load; i++)); do + bash -c ' + d="$1"; j=0 + while :; do + f="$d/dh.$$.$((j % 24))" + dd if=/dev/zero of="$f" bs=32k count=8 conv=fsync 2>/dev/null + rm -f "$f" + j=$((j + 1)) + done' _ "$hogdir" & + hogs+=("$!") + done +} +cleanup() { + local p + for p in "${hogs[@]:-}"; do + [ -n "$p" ] && kill "$p" 2>/dev/null + done + rm -rf "$hogdir" 2>/dev/null +} +trap cleanup EXIT INT TERM + +case "$kind" in + cpu) spawn_cpu ;; + disk) spawn_disk ;; + both) spawn_cpu; spawn_disk ;; + none) : ;; + *) echo "with-load: unknown GCL_STRESS_KIND='$kind' — running with NO load" >&2 ;; +esac +echo "stress: kind=$kind load=$load cores=$cores hogs=${#hogs[@]} :: $*" + +"$@" +rc=$? + +cleanup +hogs=() +echo "stress: hogs reaped; wrapped command rc=$rc" +exit "$rc" From 2e483de058cb2f9084a141e75c4057881b56b000 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 00:00:31 +1000 Subject: [PATCH 06/76] AGENTS.md: record the CI flake-hunt mission + formal diagnosis loop So the process (dispatch -> on failure: subagent diagnose -> Codex review -> plan -> review/fix rounds -> implement -> review/fix rounds -> commit -> reset streak -> resume), the mechanics, the process-hygiene lessons, and the progress log survive context compaction. Co-Authored-By: Claude Opus 4.8 (1M context) --- AGENTS.md | 79 +++++++++++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) create mode 100644 AGENTS.md diff --git a/AGENTS.md b/AGENTS.md new file mode 100644 index 0000000..9f10699 --- /dev/null +++ b/AGENTS.md @@ -0,0 +1,79 @@ +# AGENTS.md — CI flakiness stress hunt (branch `ci-stress`) + +> This branch exists to **flush out CI flakiness** in the test suites by running them +> on GitHub Actions many times, under artificial load, and fixing every flake found via +> a formal loop. Written 2026-06-16 so the mission + process survive context compaction. +> A successor instance: read this top-to-bottom, then check `.agent-testing/` for live state. + +## Mission (Ben, 2026-06-16) +Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**, or +until agent credits run out (tell Ben; GitHub minutes are FREE — public repo — so the +only budget is agent compute). Each time a run fails, fix the flake with the formal loop +below, reset the streak to 0 (we want 50 clean on the *fixed* code), and resume. Ben also +asked to run under **CPU + disk load** to surface load-sensitive flakes faster. + +## The formal diagnosis→fix loop (run on EVERY failure) +1. **Capture** the failure: which leg/suite/test, the assertion, logs + preserved + artifacts. Save under `.agent-testing/failures//` (or `interop-fail-*.log`). +2. **Diagnose** — spawn a subagent (fresh context) to root-cause from the evidence + the + code. Give it the evidence, WITHHOLD your own conclusion (let it reason independently). +3. **Independent review of the diagnosis** — get a *foreign model* (Codex) to verify the + diagnosis against the code (uncorrelated with Claude). `codex exec --sandbox read-only + -c service_tier=default - < prompt > out.md` (NO `-o` — it corrupts output; capture stdout). +4. **Classify**: test-flake (timing assumption breaks; product is correct) vs product bug. +5. **Plan** the fix in `.plans/YYYY-MM-DD-ci-stress--plan.md`; commit it. +6. **Plan review/fix rounds until clean** — fresh Claude reviewer AND Codex each round; + block ONLY on real design defects (not plan-doc pedantry); iterate until both CONVERGE. + Verify every reviewer finding against the actual code yourself (reviewers are fallible + and Claude-correlated). +7. **Implement** the fix (test or product). `bash -n` + `shellcheck -S style` (v0.11.0 — + the CI gate) must stay clean. Run the affected suite locally to confirm. +8. **Implementation review/fix rounds** — fresh Claude reviewer + Codex on the diff; clean. +9. **Commit** to `ci-stress` under the git commit lock (`~/.local/bin/git-commit-lock.sh + run -- ...`, stage only your paths), **push**, mark the plan DONE + changelog. +10. **Reset** the streak (`rm .agent-testing/clean_count`) and **resume** the driver. + +Quality bar (Ben): "I'm intending this library to be great" — spend tokens on rigor; +don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow. + +## Mechanics (all under the `ci-stress` worktree) +- Worktree: `C:/agent_data/commit-lock/worktrees/ci-stress`. Repo public: `bentoner/git-commit-lock`. +- **Auth**: `GH_TOKEN=$(printf 'protocol=https\nhost=github.com\n\n' | git credential fill | grep '^password=' | cut -d= -f2-)`. `gh` is at `~/scoop/shims` (add to PATH). +- **Stress-only commits — DO NOT MERGE to main**: the workflow `concurrency` tweak + (unique-per-run group, so parallel dispatches don't cancel) and `tests/with-load.sh` + + the workflow's load wiring (inputs `stress_kind`/`stress_load`, wrapped suite steps, + raised timeouts). Any *test/product fixes* ARE normal mergeable commits. +- **Driver**: `.agent-testing/driver.sh` — keeps `MAXC=5` runs in flight via + `workflow_dispatch` (with `-f stress_kind=$STRESS_KIND`), polls, records + `results.tsv`/`clean_count`/`status.txt`, and EXITS on the first failure (sentinel + `FAIL:`, captures diagnostics) or at `TARGET` (sentinel `DONE`). Launch: + `cd .agent-testing && rm -f clean_count sentinel STOP && STRESS_KIND=both TARGET=50 bash ./driver.sh` (background). +- **Load**: `tests/with-load.sh` wraps each suite, spawning N CPU spin-loops and/or N disk + create/write+fsync/delete loops (`GCL_STRESS_KIND`, `GCL_STRESS_LOAD`). Hogs reaped by + exact PID. The runner is 4-core; `load=4` saturates it. +- **Flake-condition meter**: Test 17d's `note: T17d outcomes rc0=.. rc1=.. rc97=.. rc98=.. + ; WAITING=..` line (in each unit-leg log) shows how hard load is biting (rc97 dropping / + rc0 rising == the original flake condition). Read it to confirm load is effective. + +## Process hygiene (LEARNED THE HARD WAY 2026-06-16) +- **`TaskStop` does NOT kill a background bash script** — it keeps running and dispatching. + After stopping, VERIFY via `powershell Get-CimInstance Win32_Process -Filter + "Name='bash.exe'"` (match CommandLine on `driver.sh`/`calibrate.sh`) and + `taskkill //F //T //PID ` the SPECIFIC pid. The driver also honors a graceful + **STOP file**: `touch .agent-testing/STOP` → it cancels inflight and exits (sentinel STOPPED). +- **Exactly ONE dispatcher alive at a time.** A surviving zombie + a relaunch = two + dispatchers racing on `ci-stress` (this corrupted a calibration run-id correlation). +- **NEVER blanket-kill** by name (`Stop-Process -Name`, `taskkill /IM`, `pkill`) — Ben's + box is shared; kill only specific PIDs you spawned. + +## Progress log +- **Test 17d (unit, `git-commit-lock.test.sh`)** — `got97>=1` was timing-fragile + (windows-unit flaked at normal load, run 27616343269). FIXED (commit 58c3741): replaced + with rc∈{0,1,97,98} + drop-free `WAITING>=1` anti-vacuity canary + `note:` meter. + Diagnosis+plan+impl all reviewed clean by Claude+Codex. See the plan in `.plans/`. +- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load + (load=4): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. The + precondition read (`head -n 1 "$LOCK"` after killing the pwsh holder) is a single + non-retrying read that catches the token not-yet-visible under load; the actual + cross-impl steal asserts PASS. Looks like a test-flake (fragile precondition read). + STATUS: in the formal loop (diagnosis stage) as of this writing. From 06c6d8e614262da42abd8254145b048eb94ec54f Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 00:30:39 +1000 Subject: [PATCH 07/76] Interop Test 5: de-flake via deterministic pwsh orphan (CPU-load find) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Under CPU load the kill -9 of the pwsh holder missed the native pwsh.exe (MSYS dollar-bang is a shim), so pwsh ran its full Start-Sleep 60 and exited gracefully — its PowerShell.Exiting backstop DELETED the lock, so the precondition read got an empty/gone file and the 3 steal asserts were vacuous (stole a backdate-recreated empty file). Diagnosis + Codex agreed: test bug, product correct. Fix (Option D): the holder now does if (-not (Lock-Acquire)) { Exit 3 }; write READY; [Environment]::Exit(0) Environment.Exit bypasses BOTH release and the backstop, leaving a deterministic token-bearing orphan with no external kill. bash drops the kill and just reaps. The tok.ps token assertion is now genuine every run, not vacuous. Local interop suite 141/0. Reviewed clean by fresh Claude reviewer + Codex. shellcheck -S style + bash -n clean. Found via the CI load stress test. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...6-17-ci-stress-interop-test5-flake-plan.md | 119 ++++++++++++++++++ AGENTS.md | 25 +++- tests/git-commit-lock.interop.test.sh | 19 +-- 3 files changed, 150 insertions(+), 13 deletions(-) create mode 100644 .plans/2026-06-17-ci-stress-interop-test5-flake-plan.md diff --git a/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md b/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md new file mode 100644 index 0000000..a6f9e8d --- /dev/null +++ b/.plans/2026-06-17-ci-stress-interop-test5-flake-plan.md @@ -0,0 +1,119 @@ +# Plan: de-flake interop Test 5 (genuine-pwsh-orphan steal) under load + +Status: **DONE** — diagnosis + fix D validated by Claude subagent + Codex; implemented; +implementation reviewed clean by fresh Claude reviewer ("IMPLEMENTATION OK") + Codex ("no +correctness issues"); local interop suite 141/0 with a genuine `tok.ps.*` token. Awaiting +CI-under-load confirmation. + +## Reviewer notes (top; do not renumber) +_(none yet)_ + +## Context +CI stress under CPU load (load=4, 4-core Windows runner) reproducibly fails the **interop +suite Test 5** ("bash steals a STALE lock GENUINELY created by pwsh (holder killed +mid-hold)"), `tests/git-commit-lock.interop.test.sh:308-334`: +``` +FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got '' +PASS: bash run exited 0 after stealing pwsh's stale lock (+2 more PASS) +``` +Diagnosis (Claude subagent) + independent Codex review — both in +`.agent-testing/failures/interop-test5/{DIAGNOSIS.md,b5.log}` and +`.agent-testing/codex-t5-diag-review.txt`. Agreed mechanism (high confidence, +triple-corroborated by b5.log): + +- The holder is `pwsh ... Lock-Acquire; write READY; Start-Sleep 60 &`, with `hpid=$!`. + bash waits READY then `kill -9 "$hpid"`. **That kill does not terminate the native + pwsh** (MSYS `$!` names a shim, not `pwsh.exe`; under load it misses). Proof: b5.log + shows ACQUIRED 13:42:45 → RELEASED 13:43:45 = **exactly 60s = the full Start-Sleep**, + and the release reason is **`engine-event backstop at process exit`** which fires ONLY + on graceful exit (`git-commit-lock.ps1:1299-1322`), never on a hard kill. +- That graceful-exit backstop **deletes the lock file** (`git-commit-lock.ps1:1319-1321`) + before bash reads it, so `head -n 1 "$LOCK"` (:320) returns `''` — a **gone file**, not + a slow-to-appear token. `backdate "$LOCK" 9999` (:325 = `touch`, no `-c`, :107-115) + then **re-creates it empty+ancient**, and bash steals THAT empty orphan (`ghost=?`, + b5.log). So the 3 downstream PASSes are **vacuous** (they steal an empty file, not a + genuine `tok.ps.*` orphan); the only assertion checking the real premise correctly FAILed. +- **Classification: test bug, product correct.** Every product action in b5.log is right. +- **Why load:** unloaded, the kill lands by timing luck before the sleep ends; under load + the kill misses and the holder self-releases. + +Scope: this kill-a-holder-then-read-its-orphan pattern is unique to Test 5. The other +interop kill (`:787`, `w14b`) is cleanup of a *hung waiter* after a regression `bad` — no +orphan read depends on it — so it is NOT affected. + +## Fix (Option D — make the orphan deterministic; remove the unreliable kill) +Both reviewers recommend D over hardening the kill (B/C): it eliminates the flaky +mechanism instead of making it reliable, and is the smaller, more deterministic change. + +Have the pwsh holder **acquire, signal READY, then self-exit via +`[Environment]::Exit(0)`** — the product's *documented* hard-exit that bypasses BOTH +`Lock-Release` and the `PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-224`, +`:1299-1301`), so it leaves a genuine token'd orphan every time, with no external kill and +no timing dependence. `Lock-Acquire` writes+flushes+closes the token before returning +(`git-commit-lock.ps1:650-664`) and READY is written only after acquire, so the moment +bash sees READY the `tok.ps.*` token is already durably on disk. + +Concretely in `tests/git-commit-lock.interop.test.sh` Test 5: +1. Holder command (`:314-315`): replace + `. '$PS1WIN'; Lock-Acquire | Out-Null; [IO.File]::WriteAllText('$READY','r'); Start-Sleep 60` + with + `. '$PS1WIN'; if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; [IO.File]::WriteAllText('$READY','r'); [Environment]::Exit(0)` + (`Lock-Acquire` returns `$false` on failure, `git-commit-lock.ps1:1350`; guard it so a + failed acquire never writes READY → the existing else-branch "never readied" fires.) +2. Success branch (`:317-324`): drop the unreliable `kill -9 "$hpid"; wait "$hpid"; sleep + 0.3` and replace with just `wait "$hpid" 2>/dev/null` (reap the self-exited holder). + Keep the token read + `case tok.ps.*` assertion + `backdate` + the steal asserts + unchanged — but now the orphan deterministically carries the genuine pwsh token, so the + `tok.ps.*` assertion (and the downstream steal) are no longer vacuous. +3. Comment (`:309-311`): rewrite to describe the new mechanism honestly — the holder + acquires, signals ready, then exits via `[Environment]::Exit(0)`, a CLR hard-exit that + bypasses release (no `PowerShell.Exiting` event), leaving a genuine no-release token'd + orphan; deterministically equivalent (same on-disk state) to a holder killed mid-hold, + without depending on a scheduler-raced external kill. +4. else branch (`:331-333`): keep its `kill -9 "$hpid"` cleanup (harmless; the holder may + still be starting if it never readied). + +### Why D is faithful (not a weakening) +Test 5 verifies **bash stealing a genuine stale pwsh-created lock cross-impl**. What +matters is the on-disk state at steal time: a live lock file whose line 1 is a real +`tok.ps.*` token, with the holder gone and no release performed. D produces exactly that +state deterministically. The literal "killed by external TerminateProcess" flavor is only +test *setup*, not the product behavior under test; D's CLR hard-exit leaves the identical +artifact. The fix makes the long-vacuous downstream PASSes actually meaningful. + +## Also +- Correct the `AGENTS.md` Test 5 progress-log note (it currently states the wrong + mechanism — "token not-yet-visible under load"); replace with the missed-kill / + graceful-release-deleted-the-file mechanism. + +## Out of scope / NOT changed +- Product code (`git-commit-lock.ps1` / `.sh`) — no product defect. +- The bash-worker kills in the unit suite (they kill native bash where `$!` is correct and + no orphan-read depends on them; they passed under load). +- Other interop tests. + +## Testing +1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) on the interop test. +2. Local: run the interop suite once on this box (pwsh present) — Test 5 must pass and the + token assertion must see a real `tok.ps.*` token. (Unloaded local box can't reproduce + the original miss, but confirms the rewrite is correct.) +3. Real proof = CI under load: dispatch ci-stress with stress_kind=cpu/both several times; + the interop leg must stay green where it previously failed deterministically. + +## Changelog (implementation) +- Implemented Fix D in `tests/git-commit-lock.interop.test.sh` Test 5: holder command now + `if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; write READY; [Environment]::Exit(0)` + (was `Lock-Acquire | Out-Null; write READY; Start-Sleep 60`); success branch drops + `kill -9 "$hpid"; sleep 0.3`, keeps `wait "$hpid"` to reap; ok-message + comment updated. + No product code, no other test touched. `Lock-Acquire` returns a strict boolean + (git-commit-lock.ps1:1350 etc.) so the `-not` guard is valid; the token is flushed+closed + during acquire (before READY) so it is durably visible before `[Environment]::Exit`. +- Static: `bash -n` + `shellcheck -S style` (v0.11.0) clean. +- Local (Windows, pwsh 7.5.5): interop suite **141 passed / 0 failed**; Test 5 token + assertion now PASSes with a real `tok.ps.*` token (e.g. `tok.ps.76676.…`) — no longer the + vacuous empty-orphan steal. +- Review: fresh Claude reviewer "IMPLEMENTATION OK" (verified Lock-Acquire boolean contract, + no pipeline pollution from dropping Out-Null, token durability, race-free `wait`, quoting); + Codex `exec review --uncommitted` "no correctness issues." Both in `.agent-testing/`. +- AGENTS.md Test 5 progress note corrected (was the wrong "token not-yet-visible" mechanism). +- Real proof pending: CI interop leg under CPU load where it previously failed 3/3. diff --git a/AGENTS.md b/AGENTS.md index 9f10699..07a06bd 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -71,9 +71,22 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow (windows-unit flaked at normal load, run 27616343269). FIXED (commit 58c3741): replaced with rc∈{0,1,97,98} + drop-free `WAITING>=1` anti-vacuity canary + `note:` meter. Diagnosis+plan+impl all reviewed clean by Claude+Codex. See the plan in `.plans/`. -- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load - (load=4): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. The - precondition read (`head -n 1 "$LOCK"` after killing the pwsh holder) is a single - non-retrying read that catches the token not-yet-visible under load; the actual - cross-impl steal asserts PASS. Looks like a test-flake (fragile precondition read). - STATUS: in the formal loop (diagnosis stage) as of this writing. +- **Test 5 (interop, `git-commit-lock.interop.test.sh`)** — FOUND under CPU load (3/3 cpu + runs): `FAIL: expected a tok.ps.* token on line 1 of the orphan lock, got ''`. Mechanism + (diagnosis + Codex, NOT "token not-yet-visible"): `kill -9 "$hpid"` missed the native + pwsh (MSYS `$!` is a shim), so pwsh ran its full `Start-Sleep 60` and exited gracefully, + firing the `PowerShell.Exiting` backstop that DELETED its own lock — so the read hit a + gone file; `backdate`(touch) then re-created it empty, making the 3 "steal" PASSes + vacuous. Test bug, product correct. FIXED (commit ): holder now self-exits + via `[Environment]::Exit(0)` (bypasses release + backstop) leaving a deterministic + token'd orphan — no kill. Reviewed clean Claude+Codex; local interop 141/0. +- **Calibration finding (load=4 on a 4-core runner):** `cpu` reliably breaks interop Test 5 + (above) and otherwise the unit suite is fine. `disk` shifts Test 17d toward the acquire + regime (rc0 up to 4/12 — Ben's disk instinct was apt) but nothing fails. `both` (8 hogs + on 4 cores) is the most extreme and additionally trips TWO unit tests only under that + pathological oversubscription: `recovery took 33s (>20s)` (+ "rc=97 behind a crashed + claim" / "no STOLE-BY-CLAIM") and `claim-path warning fired 0 times (want 1)`. These two + are SUSPECTED load-too-high artifacts (tight internal budgets exceeded by 2x CPU + oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the + 50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden + those two budgets. Data: `.agent-testing/calibration.tsv`. diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh index 06fe746..8d2a566 100644 --- a/tests/git-commit-lock.interop.test.sh +++ b/tests/git-commit-lock.interop.test.sh @@ -306,20 +306,25 @@ grep -q "holder=pid=99999 host=ghost" "$LOG" \ || bad "holder from line 2 missing in pwsh's STALE log line" echo "== Test 5: bash steals a STALE lock GENUINELY created by pwsh (holder killed mid-hold) ==" -# The stale lock really is pwsh's: a pwsh process dot-sources the lock, acquires, -# signals ready, then is hard-killed by PID mid-hold (TerminateProcess — no -# release, no exit event), leaving its live lock FILE (token line 1) behind. +# The stale lock really is pwsh's: a pwsh process dot-sources the lock, acquires (writing +# its tok.ps.* token to line 1 and flushing+closing the file), signals ready, then +# SELF-EXITS via [Environment]::Exit(0) — the port's documented hard-exit that bypasses +# BOTH Lock-Release AND the PowerShell.Exiting backstop — leaving its live token'd lock +# FILE behind with no release. This is DETERMINISTIC: the same on-disk state as a holder +# killed mid-hold, but without an external kill. (An MSYS `kill -9 "$!"` does NOT reliably +# terminate the native pwsh.exe under load — it survived, ran to completion, and its +# graceful-exit backstop DELETED the lock, leaving an empty file to steal; observed under +# CPU load, run 27621668323. See the Test 5 de-flake plan.) LOCK="$WORK/b5.lock"; LOG="$WORK/b5.log"; : > "$LOG"; MARK="$WORK/b5.mark"; printf '%s' before > "$MARK" READY="$WORK/b5.ready"; rm -f "$READY" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 \ - pwsh -NoProfile -Command ". '$PS1WIN'; Lock-Acquire | Out-Null; [IO.File]::WriteAllText('$READY','r'); Start-Sleep 60" & + pwsh -NoProfile -Command ". '$PS1WIN'; if (-not (Lock-Acquire)) { [Environment]::Exit(3) }; [IO.File]::WriteAllText('$READY','r'); [Environment]::Exit(0)" & hpid=$! if wait_for "$READY"; then - kill -9 "$hpid" 2>/dev/null; wait "$hpid" 2>/dev/null - sleep 0.3 + wait "$hpid" 2>/dev/null # holder self-exited via [Environment]::Exit (no release); reap it tok="$(head -n 1 "$LOCK" 2>/dev/null | tr -d '\r\n')" case "$tok" in - tok.ps.*) ok "dead pwsh holder left its own lock file behind (token $tok)" ;; + tok.ps.*) ok "self-exited pwsh holder left its own token'd lock behind (token $tok)" ;; *) bad "expected a tok.ps.* token on line 1 of the orphan lock, got '$tok'" ;; esac backdate "$LOCK" 9999 # age the orphan past any stale window From 3270fbd37db19221d6419a3bb60ed2c9f2df19eb Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 01:21:45 +1000 Subject: [PATCH 08/76] AGENTS.md: record Test 31(a) diagnosis + hunt status (16/50, halted on T31a) Third flake found by the load hunt (ubuntu, both/load=2): Test 31(a) leaked-token-memory DISCOVERY-HOLD assertion races the external mv install vs the leaver _lock_discover; under load the direct-discover path (sh:822) adopts the claim instead of the memory path (sh:1382) the assertion pins. Product correct; test-orchestration race; 31(b) covers the memory path deterministically. Diagnosed, fix pending the formal loop. Hunt at 16/50 clean, halted here. Co-Authored-By: Claude Opus 4.8 (1M context) --- AGENTS.md | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) diff --git a/AGENTS.md b/AGENTS.md index 07a06bd..381f331 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -90,3 +90,23 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the 50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden those two budgets. Data: `.agent-testing/calibration.tsv`. +- **Test 31(a) (unit, `git-commit-lock.test.sh:1582`)** — FOUND on **ubuntu** under + both/load=2 (moderate, likely genuine), run 27626826865: `FAIL: no leaked-token-memory + DISCOVERY-HOLD`. DIAGNOSED (not yet fixed): the product has two valid DISCOVERY-HOLD + paths — direct `_lock_discover` (sh:822) and the per-poll leaked-token-memory check + (sh:1382). 31(a)'s external `mv` (installs the leaked claim at the lock path) RACES the + leaver's `_lock_discover`; under load the mv landed first, so 822 adopted the claim + instead of the 1382 memory path the assertion pins. Product correct (rc 0, clean + release, no leftover all PASSed); test-orchestration race. Sibling 31(b) already covers + the memory path DETERMINISTICALLY (internal steering) and passed. Fix options + recommend + in `.agent-testing/failures/unit-test31/DIAGNOSIS.md` (recommend A: relax 31a to accept + generic DISCOVERY-HOLD since 31b covers memory — but VERIFY via the formal loop it's not + vacuous). NEEDS: subagent-diagnosis confirm + Codex review + plan + impl review. + +## Hunt status (as of 2026-06-17 ~01:15 local) +- `both`/load=2 hunt reached **16/50 clean** then halted on Test 31(a) above. The driver + exited cleanly (sentinel FAIL); no stray dispatcher; no in-flight runs. +- To RESUME after fixing Test 31(a): `cd .agent-testing && rm -f clean_count sentinel STOP + && STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). Expect it to + surface further flakes (each is a fresh loop). Load=2 avoids the 8-hog budget artifacts. +- TWO flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e). From b7af8102123cb455065898514a470e1c1e889182 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 01:29:26 +1000 Subject: [PATCH 09/76] =?UTF-8?q?AGENTS.md:=20no=20CI=20credit=20limit=20(?= =?UTF-8?q?public=20repo,=20unlimited=20CI)=20=E2=80=94=20keep=20going?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ben clarified: there are no credits to worry about; public repo => unlimited CI capped only by GitHub concurrency. Corrects the earlier misleading agent-credits framing that prompted a premature budget pause. Resume guidance: dispatch freely, keep going to 50 clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- AGENTS.md | 15 ++++++++++----- 1 file changed, 10 insertions(+), 5 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 381f331..b15e5d2 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -6,11 +6,16 @@ > A successor instance: read this top-to-bottom, then check `.agent-testing/` for live state. ## Mission (Ben, 2026-06-16) -Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**, or -until agent credits run out (tell Ben; GitHub minutes are FREE — public repo — so the -only budget is agent compute). Each time a run fails, fix the flake with the formal loop -below, reset the streak to 0 (we want 50 clean on the *fixed* code), and resume. Ben also -asked to run under **CPU + disk load** to surface load-sensitive flakes faster. +Run the `tests` workflow on `ci-stress` repeatedly until **50 clean runs in a row**. Each +time a run fails, fix the flake with the formal loop below, reset the streak to 0 (we want +50 clean on the *fixed* code), and resume. Ben also asked to run under **CPU + disk load** +to surface load-sensitive flakes faster. + +**NO CREDITS / NO BUDGET LIMIT — DON'T PAUSE FOR "CREDITS".** Ben (2026-06-17, explicit): +there are no credits to worry about — this is a PUBLIC repo, so we can run UNLIMITED CI, +capped only by GitHub concurrency (excess just queues — throughput, not cost). Keep going; +dispatch freely; run full review loops. Only surface a genuine blocker or a real decision +for Ben. ## The formal diagnosis→fix loop (run on EVERY failure) 1. **Capture** the failure: which leg/suite/test, the assertion, logs + preserved From 51a1753b83da1f55d728d47e5d44756d293d9bcc Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 01:59:01 +1000 Subject: [PATCH 10/76] Test 31(a): de-flake the leaked-claim discovery-route race (CI load find) Sub-leg (a) installs a recheck-unreadable leaked claim at the lock path via an external mv, then asserted adoption went through the per-poll leaked-token-memory route ("DISCOVERY-HOLD (leaked-token memory)"). But the product can adopt the claim via EITHER of two correct routes: the inline ownership-discovery read (git-commit-lock.sh:822) if the mv lands before it, or the per-poll memory check (git-commit-lock.sh:1382) on a later poll if it lands after. Which fires is a pure scheduling race -- the external mv vs the leaver's inline discover one statement after the leak-add (sh:1112 -> sh:1114). Under both/load=2 on ubuntu the mv won and the direct route fired, so the memory-pinned assertion failed spuriously (run 27626826865). The product behaved correctly in both cases (token remembered, same token observed installed, adopted, rc 0, clean release, no residue). Fix is test-only: sub-leg (a) now accepts EITHER DISCOVERY-HOLD route and records which fired, failing only if neither adopted the claim. No coverage is lost -- the memory route stays pinned deterministically by sub-leg (b), and the direct route by Test 25's 7-position discovery-position matrix. Diagnosis converged across four independent reviews (my code-read + the verbatim leak.log, a fresh-context Claude subagent that did not read the prior diagnosis, and a Codex foreign-model review). Implementation reviewed clean by a fresh Claude reviewer and by Codex. Static checks (bash -n + shellcheck -S style v0.11.0) clean; local unit suite 207 passed / 0 failed. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...2026-06-17-ci-stress-test31a-flake-plan.md | 135 ++++++++++++++++++ AGENTS.md | 26 ++-- tests/git-commit-lock.test.sh | 28 +++- 3 files changed, 173 insertions(+), 16 deletions(-) create mode 100644 .plans/2026-06-17-ci-stress-test31a-flake-plan.md diff --git a/.plans/2026-06-17-ci-stress-test31a-flake-plan.md b/.plans/2026-06-17-ci-stress-test31a-flake-plan.md new file mode 100644 index 0000000..be8d801 --- /dev/null +++ b/.plans/2026-06-17-ci-stress-test31a-flake-plan.md @@ -0,0 +1,135 @@ +# Plan: de-flake unit Test 31(a) (leaked-claim discovery-route race) under load + +Status: **DONE** — diagnosis converged across 4 independent reviews (my code-read + +leak.log + a fresh-context Claude subagent that did NOT read the prior diagnosis + Codex +foreign-model review); fix implemented; implementation reviewed clean (see changelog). +Test-only change; product untouched. Awaiting CI-under-load confirmation. + +## Reviewer notes (top; do not renumber) +_(none yet)_ + +## Context +CI stress under both/load=2 (moderate, 4 hogs on a 4-core ubuntu runner — NOT the +8-hog oversubscription regime) failed ONE assertion in unit **Test 31 sub-leg (a)** +(`tests/git-commit-lock.test.sh:1582`), run 27626826865: +``` +FAIL: no leaked-token-memory DISCOVERY-HOLD +``` +Every other (a) assertion passed (recheck-unreadable feeder fired; rc 0; lock released +cleanly; no claim/lock leftover); sub-legs (b)(c)(d) passed. + +### Mechanism (test-orchestration race; product correct) +The product has TWO valid, equally-correct ways to adopt a leaked claim that a rival has +installed at the lock path, and both log a `DISCOVERY-HOLD` line: +- **D1 — inline ownership-discovery read.** `_lock_discover` (`git-commit-lock.sh:819`, + log at `:822` `DISCOVERY-HOLD: our claim ... installed ... by a rival's rename`) is the + unconditional final act of every post-claim non-rename exit. In (a) the steered + recheck-unreadable exit runs `_lock_leaked_add` (`:1112`, the `LEAKED-CLAIM` log) and + then **immediately, one statement later**, `_lock_discover "$tok"` (`:1114`). +- **D2 — per-poll leaked-token-memory check.** `git-commit-lock.sh:1382` + (`DISCOVERY-HOLD (leaked-token memory): ...`) fires on a LATER blocked poll while the + memory list is non-empty. + +Sub-leg (a)'s harness is open-loop: it `wait_for_grep`s the `LEAKED-CLAIM` line +(`:1574`) then does `mv -f -- "$LOCK.next" "$LOCK"` (`:1576`, the rival install). That +`mv` races the leaver's inline `_lock_discover` at `:1114`: +- mv lands **before** the inline discover → **D1** wins (the `:822` line). ← failing run +- mv lands **after** the inline discover (it misses; later poll) → **D2** wins (`:1382`). + +The assertion at `:1582` hard-pins **D2** (`grep -q "DISCOVERY-HOLD (leaked-token +memory)"`). Under load the leaver was descheduled between `:1112` and `:1114`, the +harness `mv` landed first, D1 fired, D2 never logged → the assertion failed. The product +behaved correctly in BOTH cases (token remembered, same token observed installed, +adopted, rc 0, clean release, no residue). Classification: **test flake, product +correct** — the assertion over-specified an implementation-incidental, scheduler-chosen +route rather than the contract (a leaked claim installed by a rival is adopted and +cleaned up). + +### Coverage (why relaxing (a) loses nothing) +- **D2 (memory route)** is covered DETERMINISTICALLY by **sub-leg (b)** (`:1592-1627`): + it drives the rival install from inside `_lock_new_token` at NTC=2 so the leaver runs a + full aborting claim attempt and adopts only on the per-poll memory check; it asserts + `DISCOVERY-HOLD (leaked-token memory)` and the `leak < abort < adoption` ordering. +- **D1 (direct route)** is covered DETERMINISTICALLY by **Test 25** (`:1323-1425`), the + discovery-position matrix: 7 internally-steered positions, each asserting the generic + `grep -q "DISCOVERY-HOLD"` + rc 0 + no orphan. (Test 25 already uses the generic grep + idiom this fix adopts for (a).) + +So (a)'s distinct, irreplaceable job is the END-TO-END "external rival installs a +recheck-unreadable leaked claim → adopted & cleaned up" scenario, where either route is a +correct outcome. + +## Fix (Option A — accept either discovery route; recommended by all four reviews) +Test-only, in `tests/git-commit-lock.test.sh` sub-leg (a): +1. Replace the single D2-pinning assertion (`:1582-1583`) with a three-way check that + accepts EITHER route, records WHICH fired (telemetry for the load hunt), and only + fails if NEITHER `DISCOVERY-HOLD` route adopted the claim: + ```sh + if grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG"; then + ok "... per-poll memory route ..." + elif grep -q "DISCOVERY-HOLD:" "$LOG"; then + ok "... inline direct-discovery route ... (memory route pinned by sub-leg (b)) ..." + else + bad "no DISCOVERY-HOLD adoption of the leaked claim by EITHER route" + fi + ``` + `"DISCOVERY-HOLD:"` (immediate colon) matches ONLY D1; D2's text is + `DISCOVERY-HOLD (leaked-token memory):` (space+paren after the dash), so the two + patterns are disjoint and D2 is checked first regardless. +2. Update sub-leg (a)'s header comment (`:1550-1552`) to state honestly that adoption may + go through either route, that the choice is a load-sensitive scheduling race, and that + the memory route is pinned deterministically by (b) and the direct route by Test 25. + +### Why A (not B/C) +- **A** matches (a)'s real intent; not vacuous — still requires the recheck-unreadable + feeder (`:1574`), rc 0 (`:1581`), clean release + no leftover (`:1584-1585`), AND a + `DISCOVERY-HOLD` adoption (the log line only appears when `_lock_take_hold` runs via a + discovery path). No new timing introduced. Keeps (a) as the load-tolerant main leg. +- **B** (force the memory route via internal steering) duplicates (b). +- **C** (force the direct route) duplicates Test 25; also `_lock_discover` direct + coverage is already comprehensive there. (NB: the subagent's specific C steering — do + the mv inside the fire-once read shadow before returning empty — would actually + mis-classify the claim as `gone` not `unreadable`, killing the leak feeder; another + reason to avoid C. Verified against `_lock_claim_state`, `git-commit-lock.sh:840-850`.) + +## Out of scope / NOT changed +- Product code (`git-commit-lock.sh`, `.ps1`) — no defect. +- Sub-legs (b)(c)(d), Test 25, any other test. + +## Logging +No product logging change. The new three-way `ok` line records which discovery route +adopted the claim each run — a small telemetry win making the previously-hidden route +choice visible in every (a) run's output (helps confirm load is exercising both routes). + +## Testing +1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) on the test file. +2. Local: run the unit suite on this box; Test 31 (all sub-legs) must pass; confirm the + new `ok` line reports a route. Run Test 31 in a loop to confirm no regression. +3. Real proof: CI under both/load=2 where (a) previously failed — the unit leg must stay + green and report a route each run. + +## Changelog (implementation) +- Implemented Fix A in `tests/git-commit-lock.test.sh` sub-leg (a): the single + D2-pinning assertion became a three-way `if/elif/else` (memory route → ok; direct route + via `grep "DISCOVERY-HOLD:"` → ok; neither → bad). Rewrote (a)'s header comment to + document both routes, the load-sensitive race, and the deterministic coverage of each + (sub-leg (b) for memory, Test 25 for direct). No product code, no other test touched. +- Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate) clean. +- Local (Windows MSYS bash, pwsh 7.5.5): full unit suite **207 passed / 0 failed** + (fan-out auto-REDUCED under the box load). Sub-leg (a) passed via the memory route on + this UNLOADED box (`adoption went through the leaked-token memory (per-poll route ...)`), + confirming the normal path still fires and the new assertion accepts it; (b)(c)(d) green. +- Diagnosis review (4 independent, all converged: test flake / product correct / Fix A): + my code-read + the verbatim leak.log, a fresh-context Claude subagent that did NOT read + the prior diagnosis, and a Codex foreign-model review. Codex additionally noted D1 is + already covered by Test 25's discovery-position matrix → option C (a new D1 sub-leg) is + redundant. (I verified Test 25 covers all 7 positions deterministically myself.) +- Implementation review (2 independent, both clean / no findings): a fresh Claude reviewer + ("the change is correct ... no defect found") and Codex `exec` read-only ("None. The fix + is correct."). Both verified: grep patterns disjoint (BRE parens literal; `DISCOVERY-HOLD:` + needs an immediate colon, absent from the memory line), non-vacuity (a `DISCOVERY-HOLD` + line is logged one statement before the pure-assignment `_lock_take_hold`, so it reliably + implies a taken hold; backstopped by rc 0 + no-leftover + the feeder assertion), no new + race (greps run only after `wait "$w31"`), `$LOG` leg-dedicated (no cross-talk), and the + comment's sh:822/1382/1112/1114 line refs accurate. +- Real proof pending: CI under both/load=2 where (a) previously failed (run 27626826865). diff --git a/AGENTS.md b/AGENTS.md index b15e5d2..7d93530 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -95,18 +95,20 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow oversubscription + heavy disk), NOT yet confirmed genuine. STATUS: to classify before the 50-clean hunt — decide hunt load level (cpu-only vs moderate both) and whether to harden those two budgets. Data: `.agent-testing/calibration.tsv`. -- **Test 31(a) (unit, `git-commit-lock.test.sh:1582`)** — FOUND on **ubuntu** under - both/load=2 (moderate, likely genuine), run 27626826865: `FAIL: no leaked-token-memory - DISCOVERY-HOLD`. DIAGNOSED (not yet fixed): the product has two valid DISCOVERY-HOLD - paths — direct `_lock_discover` (sh:822) and the per-poll leaked-token-memory check - (sh:1382). 31(a)'s external `mv` (installs the leaked claim at the lock path) RACES the - leaver's `_lock_discover`; under load the mv landed first, so 822 adopted the claim - instead of the 1382 memory path the assertion pins. Product correct (rc 0, clean - release, no leftover all PASSed); test-orchestration race. Sibling 31(b) already covers - the memory path DETERMINISTICALLY (internal steering) and passed. Fix options + recommend - in `.agent-testing/failures/unit-test31/DIAGNOSIS.md` (recommend A: relax 31a to accept - generic DISCOVERY-HOLD since 31b covers memory — but VERIFY via the formal loop it's not - vacuous). NEEDS: subagent-diagnosis confirm + Codex review + plan + impl review. +- **Test 31(a) (unit, `git-commit-lock.test.sh`)** — FOUND on **ubuntu** under both/load=2 + (moderate, genuine), run 27626826865: `FAIL: no leaked-token-memory DISCOVERY-HOLD`. + Mechanism: the product has two valid DISCOVERY-HOLD adoption paths — direct + `_lock_discover` (sh:822) and the per-poll leaked-token-memory check (sh:1382). 31(a)'s + external `mv` (installs the leaked claim at the lock path) RACES the leaver's inline + `_lock_discover` (called one statement after the leak-add: sh:1112 -> sh:1114); under + load the mv landed first, so 822 adopted instead of the 1382 memory path the assertion + pinned. Product correct (rc 0, clean release, no leftover all PASSed); test-orchestration + race. **FIXED (commit ):** Fix A — sub-leg (a)'s assertion now accepts EITHER + DISCOVERY-HOLD route and records which fired (memory route still pinned deterministically + by 31(b); direct route by Test 25's 7-position discovery matrix, so no coverage lost). + Diagnosis converged across 4 independent reviews (code-read + leak.log + fresh Claude + subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See + `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load. ## Hunt status (as of 2026-06-17 ~01:15 local) - `both`/load=2 hunt reached **16/50 clean** then halted on Test 31(a) above. The driver diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 57265a9..26fe69d 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -1548,8 +1548,18 @@ bad_touch="$(grep 'touch ' "$LIB" | grep '_LOCK_CLAIM_PATH' | grep -v -- '-c')" echo "== Test 31: LEAKED-claim discovery — the leaked-token memory closes the unverified-claim lanes ==" # (a) main leg: a recheck-unreadable exit leaks the claim token; a rival -# later installs that claim as the lock; the leaver's per-poll memory check -# adopts it (HOLD) and release returns 0. +# (the external mv below) then installs that claim as the lock; the leaver +# adopts it (HOLD) and release returns 0. Adoption may go through EITHER of +# the product's two discovery routes — both correct: the inline +# ownership-discovery read that is the unreadable branch's final act +# (git-commit-lock.sh:822, "DISCOVERY-HOLD: ...") if the external mv lands +# before it, or the per-poll leaked-token-memory check +# (git-commit-lock.sh:1382, "DISCOVERY-HOLD (leaked-token memory)") on a later +# poll if it lands after. Which wins is a pure scheduling race — the external +# mv vs the leaver's inline discover ONE statement later (sh:1112 leak-add -> +# sh:1114 discover) — and is load-sensitive, so this leg accepts either and +# records which fired. The memory route is pinned DETERMINISTICALLY by +# sub-leg (b) below; the direct route by Test 25's discovery-position matrix. # NB: _lock_read_tok / _lock_cur_token shadows run inside COMMAND # SUBSTITUTIONS (subshells), so their fire-once state must live in flag # FILES — a variable assignment would be lost when the subshell exits. @@ -1579,8 +1589,18 @@ else fi wait "$w31"; rc=$? [ "$rc" = 0 ] && ok "leaver discovered its installed leaked claim and released rc 0" || bad "leaked-discovery harness rc=$rc" -grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG" && ok "adoption went through the leaked-token memory" \ - || bad "no leaked-token-memory DISCOVERY-HOLD" +# Either discovery route is correct here (see the leg comment); accept both, +# record which fired, fail only if NEITHER adopted the leaked claim. ("$LOG" +# is dedicated to this leg, so there is no cross-talk.) "DISCOVERY-HOLD:" +# (immediate colon) matches ONLY the direct route; the memory route reads +# "DISCOVERY-HOLD (leaked-token memory):" — disjoint, and checked first. +if grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG"; then + ok "adoption went through the leaked-token memory (per-poll route; the mv landed after the inline discover)" +elif grep -q "DISCOVERY-HOLD:" "$LOG"; then + ok "adoption went through the inline ownership-discovery read (direct route; the mv landed first) — memory route pinned by sub-leg (b)" +else + bad "no DISCOVERY-HOLD adoption of the leaked claim by EITHER route" +fi [ -e "$LOCK" ] && bad "lock leftover after leaked-claim adoption" || ok "lock released cleanly after adoption" [ -e "$LOCK.next" ] && bad "claim leftover after leaked-claim adoption" || ok "no claim leftover" # Hmm wait: STALE=300 — the ghost is backdated 9999 so it IS stale; fine. From 810ee415f398d02b920ffd274d68d206d138a24a Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 02:00:27 +1000 Subject: [PATCH 11/76] AGENTS.md: mark Test 31(a) fixed (51a1753); resume hunt; ignore *.stackdump Fill the real fix SHA into the Test 31(a) progress entry, update the hunt status (clean_count reset; both/load=2 hunt resumed toward 50 clean on the fixed tree; three flakes fixed this session). Add *.stackdump to .gitignore so the suite's transient Cygwin crash dumps stop cluttering git status during the hunt. Co-Authored-By: Claude Opus 4.8 (1M context) --- .gitignore | 1 + AGENTS.md | 19 +++++++++++-------- 2 files changed, 12 insertions(+), 8 deletions(-) diff --git a/.gitignore b/.gitignore index be293f3..9bdb6bd 100644 --- a/.gitignore +++ b/.gitignore @@ -5,6 +5,7 @@ # OS / editor cruft .DS_Store Thumbs.db +*.stackdump /.agent/review-queue /.agent/review-queue.lock /.agent/review-queue.lock.* diff --git a/AGENTS.md b/AGENTS.md index 7d93530..1a9f4ae 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -103,17 +103,20 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow `_lock_discover` (called one statement after the leak-add: sh:1112 -> sh:1114); under load the mv landed first, so 822 adopted instead of the 1382 memory path the assertion pinned. Product correct (rc 0, clean release, no leftover all PASSed); test-orchestration - race. **FIXED (commit ):** Fix A — sub-leg (a)'s assertion now accepts EITHER + race. **FIXED (commit 51a1753):** Fix A — sub-leg (a)'s assertion now accepts EITHER DISCOVERY-HOLD route and records which fired (memory route still pinned deterministically by 31(b); direct route by Test 25's 7-position discovery matrix, so no coverage lost). Diagnosis converged across 4 independent reviews (code-read + leak.log + fresh Claude subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load. -## Hunt status (as of 2026-06-17 ~01:15 local) -- `both`/load=2 hunt reached **16/50 clean** then halted on Test 31(a) above. The driver - exited cleanly (sentinel FAIL); no stray dispatcher; no in-flight runs. -- To RESUME after fixing Test 31(a): `cd .agent-testing && rm -f clean_count sentinel STOP - && STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). Expect it to - surface further flakes (each is a fresh loop). Load=2 avoids the 8-hog budget artifacts. -- TWO flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e). +## Hunt status (as of 2026-06-17 ~02:30 local) +- Test 31(a) FIXED (51a1753) via the full formal loop; clean_count reset to 0 and the + `both`/load=2 hunt RESUMED toward 50 clean (the prior 16/50 streak was on pre-fix code, + so it does not count — we want 50 clean on the FIXED tree). Expect more flakes; each is a + fresh loop. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a). +- To resume after any halt: `cd .agent-testing && rm -f clean_count sentinel STOP && + STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). First verify no + stray dispatcher + current HEAD (see Process hygiene). +- THREE flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e), + Test 31(a) (51a1753). From 19a28fd294ee5fb663ba4472b64aed49ae78fcdd Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 03:15:18 +1000 Subject: [PATCH 12/76] =?UTF-8?q?Test=2032b:=20cover=20F2=20=E2=80=94=20st?= =?UTF-8?q?eal=20rename=20WON=20but=20read-back=20verification=20FAILED?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit A coverage audit (subagent + my own verification against the code) found the product's two acquire read-back-verification failure lanes were asymmetrically covered. The create-path lane (O_EXCL create wins, path reads back the wrong token, git-commit-lock.sh:1354-1360) is covered by Test 32. Its steal-path twin "F2" (git-commit-lock.sh:1168-1179) was NOT: the stealer wins the claim race AND wins the rename-over (STOLE-BY-CLAIM logged, ghost destroyed), but the mandatory post-rename read-back at :1171 reads back the wrong token, so the product must clear its claim token and re-enter the wait loop rather than take the hold. After a STOLE-BY-CLAIM a silent false-hold there would be a mis-attributed hold of a destroyed-ghost path, so this is the higher-stakes twin — and nothing exercised it. Test 32b closes the gap. It mirrors Test 32 with the INVERSE token gate: a one-shot _lock_cur_token shadow gated on [ -n "$_LOCK_CLAIM_TOKEN" ] lands the read-back fault at the STEAL read-back (:1171), not the create one (:1353, where the claim token is empty). On firing it backdates the just-installed abandoned lock stale so the re-steal is immediate (same trick as Test 32 — keeps it fast and deterministic); the second attempt (shadow spent) reads back the real token and acquires, releasing rc 0. The test asserts the F2-specific log line (not the shared "acquire verification FAILED" prefix), STOLE-BY-CLAIM x2, the WARNING preceding the eventual ACQUIRED (no false-hold), and no leftovers. The stale closing NOTE that called the read-back lanes "not suite-covered" is corrected (create by Test 32, F2 by Test 32b). Product code is unchanged; F2 reads correct today — this is a missing-test (regression exposure), not a present bug. Diagnosis from a coverage-audit subagent, verified by me against the code. Test-only; no product change. Static checks clean; local suite 0 failed; Test 32b verified to exercise the F2 lane (standalone + full suite). Implementation reviewed clean by a fresh Claude reviewer and by Codex. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...6-06-17-ci-stress-test-f2-coverage-plan.md | 97 +++++++++++++++++++ tests/git-commit-lock.test.sh | 61 +++++++++++- 2 files changed, 155 insertions(+), 3 deletions(-) create mode 100644 .plans/2026-06-17-ci-stress-test-f2-coverage-plan.md diff --git a/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md b/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md new file mode 100644 index 0000000..e1d9f4e --- /dev/null +++ b/.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md @@ -0,0 +1,97 @@ +# Plan: cover F2 — steal rename WON but read-back verification FAILED (coverage gap) + +Status: **DONE** — implemented; reviewed clean (see changelog). Test-only addition; product +untouched. + +## Reviewer notes (top; do not renumber) +_(none yet)_ + +## Context +A coverage audit (subagent + my own verification against the code) found that the product's +two acquire read-back-verification failure lanes are asymmetrically covered: +- **Create path (outcome I)** — `git-commit-lock.sh:1354-1360`: O_EXCL create wins, the path + read-back ≠ our token → `WARNING: acquire verification FAILED — create won but read-back + found ...` → re-enter wait. **Covered** by Test 32 (`tests/git-commit-lock.test.sh:1760`), + whose `_lock_cur_token` shadow is gated `[ -z "$_LOCK_CLAIM_TOKEN" ]` (fires only at the + create read-back). +- **Steal path (outcome F2)** — `git-commit-lock.sh:1168-1179`: the stealer WON the claim + race AND won the rename-over (`STOLE-BY-CLAIM` already logged, ghost destroyed), but the + post-rename read-back ≠ our token → `WARNING: acquire verification FAILED — steal rename + completed but read-back found ...` → clear `_LOCK_CLAIM_TOKEN`, return 1, re-enter wait. + **UNCOVERED.** Verified: no test greps the F2 string; Test 32's gate excludes it (at the + steal read-back `_LOCK_CLAIM_TOKEN` is set); on the success-rename path `:1171` is the only + `_lock_cur_token` call with the claim token set (`_lock_rename_over` `:961-979` makes none). + +F2 is the higher-stakes twin: it fires AFTER `STOLE-BY-CLAIM` (ghost already destroyed), so a +future regression here (wrongly taking the hold on a mismatched read-back, or failing to clear +`_LOCK_CLAIM_TOKEN`) would be a silent false-hold / mis-attributed release. The code reads +correctly today — this is a missing-test (regression exposure), not a present bug. + +The suite's closing NOTE (`:2119-2121`) says "lock_acquire's read-back-verification failure +lane … not suite-covered", but Test 32 already covers the create lane — the note is stale and +does not distinguish F2. + +## Change (test-only) +1. Add **Test 32b** immediately after Test 32, mirroring Test 32 with the INVERSE token gate + so the fault injection lands at the STEAL read-back: + - Set up a stale ghost (`fabricate_lock` + `backdate 9999`) so a steal is attempted. + - In a sourced subshell, `clone_fn _lock_cur_token _ct_orig`; shadow it to fire ONCE + (flag FILE `$SF1`, subshell-safe) when `[ ! -e "$SF1" ] && [ "${_LOCK_HELD:-0}" = 0 ] + && [ -n "$_LOCK_CLAIM_TOKEN" ]` — i.e. at the steal read-back (`:1171`), where the claim + token is set and the hold is not yet taken. On firing: `backdate "$AGENT_LOCK_PATH" + 9999` (so the just-installed abandoned lock is immediately re-stealable — same trick as + Test 32, keeps it fast/deterministic), `printf ""` (blank read-back → F2), `return 0`. + - `lock_acquire || exit 72; lock_release || exit 74; exit 0`. + - Flow: attempt 1 — claim won, rename won (`STOLE-BY-CLAIM`), read-back blanked → F2 + WARNING → re-enter wait; the abandoned lock is stale → attempt 2 steals it, read-back now + real (SF1 set) → HOLD → `ACQUIRED` → release rc 0. + - Assertions: rc 0; the **F2-specific** string `steal rename completed but read-back` + fired (else `bad "F2 lane never ran"` — guards vacuity / proves the steering reached + `:1171`); the WARNING precedes the final `ACQUIRED` (no false-hold on attempt 1); + `STOLE-BY-CLAIM` count ≥ 2 (re-stole after the failed read-back); no leftover lock/claim + after release. +2. Update the stale NOTE (`:2119-2121`): both read-back lanes are now suite-covered — create + by Test 32, steal by Test 32b — via `_lock_cur_token` fault injection. + +## Why deterministic / load-robust +Internal steering (no scheduling race); the backdate-9999 trick removes any aging wait so the +re-steal is immediate; `MAX_WAIT=30`, `POLL=0.1` give ample headroom under CI load. Same shape +as the already-load-robust Test 32. + +## Logging +No product logging change. The new test asserts on existing product log lines (the F2 WARNING, +`STOLE-BY-CLAIM`, `ACQUIRED`). + +## Out of scope / NOT changed +- Product code (`git-commit-lock.sh`, `.ps1`) — no defect; F2 reads correct. +- Lower-priority gaps from the audit (A2/G2 wrong-type appearing at the lock path mid-steal; + platform-only feeder #3) — left for a separate decision. + +## Testing +1. Static: `bash -n` + `shellcheck -S style` (v0.11.0, the CI gate). +2. Local: run the new test (and the full suite); it MUST exercise the F2 string (the + `bad "F2 lane never ran"` guard fails loudly if the steering misses `:1171`). +3. Real proof: CI under load (the hunt) stays green with the new test. + +## Changelog (implementation) +- Added Test 32b to `tests/git-commit-lock.test.sh` (after Test 32) and updated the closing + NOTE so both read-back lanes read as covered (create by Test 32, steal/F2 by Test 32b). + Product untouched. +- Verified the steering empirically: a standalone extract of Test 32b (suite header + the + Test 32b block, `LIB` pinned absolute) passed 6/6 with the F2-specific line + `the steal-path read-back-verification failure lane ran (F2)` firing — proving the fault + lands at `git-commit-lock.sh:1171` (`_LOCK_CLAIM_TOKEN` set there; `_lock_rename_over` + makes no read; the create read-back at :1353 has it empty). +- Static: `bash -n` + `shellcheck -S style` (v0.11.0) clean. +- Local: full unit suite **220 passed / 0 failed** (count varies run-to-run via the fan-out + tests; 0 failed is the invariant). Test 32b: rc 0, F2 string fired, STOLE-BY-CLAIM x2, + WARNING-before-ACQUIRED, no leftovers. +- Impl review (2 independent, both clean): fresh Claude reviewer ("VERDICT: CORRECT … No + defects") — independently ran the suite twice (220/0), grepped every `_LOCK_CLAIM_TOKEN` + set/clear and `_lock_cur_token` call site, confirmed gate precision (all `_lock_discover` + branches clear the claim token first, so the `-n` gate excludes :820; release excluded via + `_lock_take_hold`), determinism, non-vacuity, termination. Codex `exec` read-only ("No + findings … correct and non-vacuous"), confirming the same with file:line cites. Two minor + non-blocking notes (the SF1 flag file lives in the throwaway WORK dir; `_ct_orig "$@"` is + harmless) — no action. +- Real proof: CI under load (the hunt) with Test 32b in the tree. diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 26fe69d..b5ca5ee 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -1801,6 +1801,59 @@ grep -q "DISCOVERY-HOLD" "$LOG" && bad "FALSE discovery-HOLD on the abandoned ow grep -q "STOLE-BY-CLAIM" "$LOG" && ok "the abandoned lock was then reclaimed by a normal steal" \ || bad "no STOLE-BY-CLAIM of the abandoned lock" +echo "== Test 32b: steal-path read-back FAILED — rename-over WON but the lock did not read back our token (F2) ==" +# The steal-path twin of Test 32. Here the stealer WINS the claim race AND wins +# the rename-over (STOLE-BY-CLAIM is logged, the ghost is destroyed), but the +# mandatory post-rename read-back verification (git-commit-lock.sh:1171) comes +# back wrong. The product must NOT take the hold: it clears its claim token and +# re-enters the wait loop (git-commit-lock.sh:1176-1179) — never a silent +# false-hold (which, after a STOLE-BY-CLAIM, would mean a mis-attributed hold of +# a destroyed-ghost path). We fault-inject the read-back with a one-shot +# _lock_cur_token shadow gated on the claim token being SET (the INVERSE of Test +# 32's `-z` gate), so it lands at the STEAL read-back (claim token live, not yet +# held), not the create one. On firing we also backdate the just-installed +# abandoned lock stale so the re-steal is immediate (same trick as Test 32 — +# keeps it fast and deterministic). Attempt 2 (shadow spent) reads back the real +# token and acquires normally. +LOCK="$WORK/stealrb.lock"; LOG="$WORK/stealrb.log"; : > "$LOG" +fabricate_lock "$LOCK" "tok.ghost.t32b" "pid=9 host=ghost"; backdate "$LOCK" 9999 +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=5 \ + AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \ + bash -c ' + source "$1" || exit 70 + clone_fn _lock_cur_token _ct_orig + SF1="$AGENT_LOCK_PATH.steer1" # flag FILE: the cur_token shadow runs in subshells + _lock_cur_token() { + if [ ! -e "$SF1" ] && [ "${_LOCK_HELD:-0}" = 0 ] && [ -n "$_LOCK_CLAIM_TOKEN" ]; then + : > "$SF1" + backdate "$AGENT_LOCK_PATH" 9999 2>/dev/null || true + printf "" + return 0 + fi + _ct_orig "$@" + } + lock_acquire || exit 72 + lock_release || exit 74 + exit 0 + ' _ "$LIB" 2>/dev/null; rc=$? +[ "$rc" = 0 ] && ok "steal read-back failure re-entered wait; a later steal acquired and released rc 0" \ + || bad "steal-readback harness rc=$rc" +grep -q "steal rename completed but read-back" "$LOG" \ + && ok "the steal-path read-back-verification failure lane ran (F2)" \ + || bad "F2 lane never ran (the read-back fault did not land at the steal read-back)" +nstole="$(grep -c "STOLE-BY-CLAIM" "$LOG")" +[ "$nstole" -ge 2 ] && ok "re-stole after the failed read-back (STOLE-BY-CLAIM x$nstole)" \ + || bad "expected >=2 STOLE-BY-CLAIM (won-rename then re-steal), got $nstole" +warn_line="$(grep -n "steal rename completed but read-back" "$LOG" | head -1 | cut -d: -f1)" +acq_line="$(grep -n "ACQUIRED " "$LOG" | tail -1 | cut -d: -f1)" +if [ -n "$warn_line" ] && [ -n "$acq_line" ] && [ "$warn_line" -lt "$acq_line" ]; then + ok "no false-hold: the read-back WARNING preceded the eventual ACQUIRED" +else + bad "ordering: expected the F2 WARNING (line $warn_line) before ACQUIRED (line $acq_line)" +fi +[ -e "$LOCK" ] && bad "lock leftover after the steal-readback walk" || ok "lock released cleanly" +[ -e "$LOCK.next" ] && bad "claim leftover after the steal-readback walk" || ok "no claim leftover" + echo "== Test 33: TERM mid-claim — the trap deletes the claim (token-checked), no 98, no ageout penalty ==" # (a) main: claimant paused inside its claim window (at the touch), TERM'd. # The trap must delete OUR claim, run the discovery read (miss: the ghost is @@ -2116,9 +2169,11 @@ rm -f "$LOCK" "$LOCK.next" # blocker is most naturally a pwsh FileShare.Read holder, so the interop # suite owns that test (on POSIX, unlink never blocks on open handles and # the lane is unreachable). -# * lock_acquire's read-back-verification failure lane needs fault injection -# to make a winning create read back wrong; it is defence in depth (see the -# ACQUIRE VERIFICATION header section), not suite-covered. +# * lock_acquire's read-back-verification failure lanes (defence in depth; see +# the ACQUIRE VERIFICATION header section) are covered via _lock_cur_token +# fault injection: the create-path lane (create won, read-back wrong) by +# Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by +# Test 32b. echo echo "==== RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" From c762899aa99597f343f9c39bce5eca4d7099b82f Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 03:19:24 +1000 Subject: [PATCH 13/76] AGENTS.md: record F2 coverage addition (Test 32b, 19a28fd) and hunt restart on final tree --- AGENTS.md | 27 +++++++++++++++++++++------ 1 file changed, 21 insertions(+), 6 deletions(-) diff --git a/AGENTS.md b/AGENTS.md index 1a9f4ae..c9186db 100644 --- a/AGENTS.md +++ b/AGENTS.md @@ -110,13 +110,28 @@ don't cap review rounds for cost; a wrong fix that resurfaces is worse than slow subagent + Codex); impl reviewed clean by Claude + Codex; local unit suite 207/0. See `.plans/2026-06-17-ci-stress-test31a-flake-plan.md`. Real proof pending: CI under load. -## Hunt status (as of 2026-06-17 ~02:30 local) -- Test 31(a) FIXED (51a1753) via the full formal loop; clean_count reset to 0 and the - `both`/load=2 hunt RESUMED toward 50 clean (the prior 16/50 streak was on pre-fix code, - so it does not count — we want 50 clean on the FIXED tree). Expect more flakes; each is a - fresh loop. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a). +## Coverage work (not a flake — Ben asked, 2026-06-17) +- **F2 read-back lane (commit 19a28fd):** a coverage audit (subagent + my code verification) + found the steal-path acquire read-back-verification failure lane uncovered — the stealer + WINS the claim race AND the rename-over (`STOLE-BY-CLAIM` logged, ghost destroyed) but the + post-rename read-back (`git-commit-lock.sh:1171`) reads the wrong token → must re-enter wait, + not false-hold. Its create-path twin (`:1358`) was covered by Test 32; F2 was not. Added + **Test 32b** (deterministic; mirrors Test 32 with the inverse `[ -n "$_LOCK_CLAIM_TOKEN" ]` + gate to land the fault at the steal read-back). Reviewed clean by fresh Claude + Codex; + suite 0-failed; F2 lane empirically exercised. Plan: + `.plans/2026-06-17-ci-stress-test-f2-coverage-plan.md`. Product unchanged (F2 reads correct; + this was regression-exposure, not a bug). Audit also flagged LOWER-priority gaps left for + Ben: A2/G2 (a non-file appearing AT the lock path mid-steal — `CLAIM-ABORT (wrong-type)` / + `(rename-refused)`), and that feeder-#3/blocked-unlink legs are Windows+pwsh-only. + +## Hunt status (as of 2026-06-17 ~03:20 local) +- The `both`/load=2 hunt reached **40/50 clean** on the post-31(a)-fix tree (810ee41) with + ZERO failures, then I gracefully STOPped it to fold in the Test 32b coverage addition. + Restarted at **0/50 on the final tree 19a28fd** (with Test 32b) — a test-only change resets + the streak per the "50 clean on the current tree" rule, so the contiguous-50 is measured on + the final suite. Load=2 (4 hogs/4 cores) avoids the 8-hog budget artifacts (Test 21/22a). - To resume after any halt: `cd .agent-testing && rm -f clean_count sentinel STOP && STRESS_KIND=both STRESS_LOAD=2 TARGET=50 bash ./driver.sh` (background). First verify no stray dispatcher + current HEAD (see Process hygiene). - THREE flakes fixed & pushed this session: Test 17d (58c3741), interop Test 5 (06c6d8e), - Test 31(a) (51a1753). + Test 31(a) (51a1753). Plus one coverage addition: Test 32b / F2 (19a28fd). From 9438da0c85040ce4855f515e5fc976504e356dea Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 12:16:51 +1000 Subject: [PATCH 14/76] docs: add failure-modes design map for scope decisions A decision-support document classifying every failure mode into three robustness tiers (correctness / best-effort-in-envelope / out-of-scope), each grounded in product code + tests with file:line citations, with a recommendation on whether it should be an in-scope guarantee. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/failure-modes.md | 685 ++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 685 insertions(+) create mode 100644 docs/failure-modes.md diff --git a/docs/failure-modes.md b/docs/failure-modes.md new file mode 100644 index 0000000..199e9da --- /dev/null +++ b/docs/failure-modes.md @@ -0,0 +1,685 @@ +# git-commit-lock: failure-mode map and scope decisions + +**Status:** decision-support document. For each failure mode it states the +tool's *current* behavior (grounded in the product code and tests), classifies +it into one of three robustness tiers, and recommends whether it should be an +in-scope guarantee. The owner uses this to deliberately decide, per mode, "yes, +we guarantee this" or "no, out of scope." + +**Sources of truth, in order:** the product code +(`git-commit-lock.sh`, `git-commit-lock.ps1`) and the test suites +(`tests/git-commit-lock.test.sh`, `tests/git-commit-lock.interop.test.sh`, +`tests/git-commit-lock.integration.test.sh`). Every claim below cites +`file:line`. The narrative docs (`README.md`, `docs/git-commit-lock.md`) and +the implementation header comments are corroborating, not authoritative — where +this document relies on a header comment it has verified the comment against the +code. (Cited line numbers are against the tree at commit `c762899`; treat them +as anchors, not exact addresses, if the files move.) + +A note on epistemics: the bash file's header (`git-commit-lock.sh:1-426`) is +itself an exhaustive design narrative and the ps1 header +(`git-commit-lock.ps1:41-177`) mirrors it. They are unusually trustworthy as +documentation *because* the tests pin the behaviors they describe. This document +does not re-derive the protocol; it re-classifies it for a scope decision and +flags the boundaries the headers state but a reader might skip. + +--- + +## 1. The core guarantee (what must hold under ANY conditions) + +**Mutual exclusion + detectable failure.** At most one process at a time +believes it holds the lock *and* is right about it. The lock cannot be silently +lost: a holder whose lease was taken from it learns so — `lock_release` returns +**98** and logs a loud WARNING — rather than reporting a serialized commit that +wasn't (`git-commit-lock.sh:1607-1688`; `git-commit-lock.ps1:1700-1845`). The +two reserved failure codes mean the wrapped command was provably *not* run +(96 usage, 97 timeout) or provably *not serialized* (98) +(`git-commit-lock.sh:392-415`). There is no fourth outcome in which two +processes both believe they hold an exclusive lock and both are wrong. + +This is a **lease, not a kernel lock** (`docs/git-commit-lock.md:60-126` +explains why no OS primitive spans bash-on-MINGW and PowerShell/.NET). The +deliberate consequence: a hold longer than the staleness window (default 300s) +*can* be stolen mid-work — "fail-open." That is accepted by design and made +*detectable* (the 98 path), not prevented (`git-commit-lock.sh:213-227`). So the +core guarantee is precisely: **no silent lost update.** Liveness (eventual +recovery from any crash) and bounded stalls are best-effort within an operating +envelope (Tier 2), not absolute. + +The integration suite is the end-to-end witness for this guarantee on the real +use case: many workers committing into one repo, audited for "every commit +lands, history linear, no sweep-up, no `index.lock` races, no stolen leases, +clean tree" (`tests/git-commit-lock.integration.test.sh:10-12, 226-283`). + +### The three tiers used throughout + +1. **Correctness guarantee** — must hold under *any* conditions (load, slow FS, + adversarial scheduling): mutual exclusion, no corruption, no silent loss, + eventual recovery. If one of these can break, it is a bug. +2. **Best-effort within a stated envelope** — holds under normal/expected + conditions, degrades gracefully (and *detectably*) under pathological ones. + Everything wall-clock-bounded lives here, because wall-clock bounds depend on + scheduling: timeouts, recovery latency, the diagnostic warnings that depend + on timing. Correctness is preserved; only liveness/latency degrades. +3. **Out of scope** — explicitly not handled; the operating envelope excludes + it. Damage, if any, is bounded and documented. + +--- + +## 2. Summary table + +Legend — **Tier:** 1 correctness / 2 best-effort-in-envelope / 3 out-of-scope. +**Tested:** ✓ deterministic test · ~ load/timing-sensitive or partial · ○ +robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. + +| # | Failure mode | Current behavior | Tier | Tested | Recommendation | +|---|---|---|---|---|---| +| A1 | Clean high contention (N workers, no crashes) | Serialized; no lost update | 1 | ✓ U:166-195, I:227-261/341-386, integ | **In scope.** Keep. | +| A2 | Thundering herd recovering one dead lock | Claim serializes; exactly one steal, zero displacement | 1 | ✓ U:212-346, I:884-1015 | **In scope.** Keep. | +| A3 | Many concurrent stealers on one ghost | One O_EXCL claim winner | 1 | ✓ U:1095-1128, I:1017-1088 | **In scope.** Keep. | +| B1 | Holder dies (crash/SIGKILL/power) mid-hold | Lease ages out; stolen after STALE | 1 (recovery) / 2 (latency) | ✓ U:197-210/348-361 | **In scope** (recovery). Latency = Tier 2. | +| B2 | Holder dies mid-CLAIM (trappable: INT/TERM) | Trap deletes claim, token-checked; discovery read | 1 | ✓ U:1857-1928, I:1151-1244 | **In scope.** Keep. | +| B3 | Holder dies mid-claim (untrappable: SIGKILL) | Claim ages out ≤ CLAIM_STALE; rival rename can install unowned lock, recovered ≤ STALE | 2 | ✓ U:1648-1677 (forensics) | **Accept** (residual 5). Bounded, no false success. | +| B4 | Slow but UNCONTENDED holder overruns STALE | Keeps its lock (nothing moved it) | 1 | ✓ U:419-429, I:494-499 | **In scope.** Keep. | +| B5 | Slow CONTENDED holder overruns STALE | Stolen; robbed holder detects at release → 98 | 1 (detection) | ✓ U:387-417, I:460-492 | **In scope.** This *is* fail-open-but-detectable. | +| C1 | Orphaned/stale lock | mtime-stale → stolen via claim | 1 | ✓ U:197-210 | **In scope.** Keep. | +| C2 | Empty lock (crash between create+write) | Empty + stale → stealable | 1 | ✓ U:348-361 | **In scope.** Keep. | +| C3 | Crashed-claimant / empty claim orphan | Ages out ≤ CLAIM_STALE; cleared | 1 (recovery) / 2 (latency) | ✓ U:1130-1154 | **In scope.** Keep. | +| C4 | Leaked claim (unverifiable unlink) | Leaked-token memory keeps ownership discoverable | 1 | ✓ U:1549-1758, U:2013-2164 | **In scope.** Keep. | +| D1 | Atomic rename-over (steal install) | `mv -T` / `File.Move(...,true)` / 5.1 unlink+move | 1 (local FS) | ✓ U:212-346, I:16d S:1141 | **In scope on local FS.** Boundary = D-axis. | +| D2 | O_EXCL atomic create | `set -C` redirect / `FileMode.CreateNew` | 1 (local FS) | ✓ throughout | **In scope on local FS.** | +| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262, ~(plat) | **In scope.** ps1-on-POSIX residual = accept. | +| D4 | Non-lock CONTENT at path (user file) | Never stolen (content guard); warn | 1 | ✓ U:1034-1076 | **In scope.** Two accepted residuals (§D4). | +| D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. | +| E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. | +| E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. | +| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ (reasoned, not tested) | **Accept**, document. See §F1. | +| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ | **Accept;** logging is best-effort by design. | +| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ | **Accept**, document. | +| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ | **Accept**, document. See §F4. | +| G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. | +| G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. | +| G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. | +| G4 | `MAX_WAIT ≤ STALE + CLAIM_STALE` (default MW) | Startup warning | 2 | ✓ U:497-522 | **In scope.** Keep. | +| H1 | SIGINT/SIGTERM mid-hold | Release + re-raise (143); traps restored | 1 | ✓ U:577-600/1989-2011 | **In scope.** Keep (bash). ps1 = §H. | +| H2 | EXIT-while-holding | Release + chain caller's EXIT trap | 1 | ✓ U:633-648 | **In scope.** Keep. | +| H3 | ps1 process death under `-File` | `PowerShell.Exiting` does NOT fire; relies on stale window | 2 | ○ (limit documented) | **Accept;** `run` path is covered. See §H. | +| I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. | +| I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. | +| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. | +| K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. | +| K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. | + +U = `tests/git-commit-lock.test.sh`, I = `tests/git-commit-lock.interop.test.sh`, +integ = `tests/git-commit-lock.integration.test.sh`. + +--- + +## 3. Per-mode detail + +### A. High contention / thundering herd + +**A1 — Clean contention, no crashes.** N processes race to acquire a free or +held-then-released lock. The acquire loop is one O_EXCL create attempt per poll; +exactly one creator wins, the rest poll and take turns +(`git-commit-lock.sh:1312-1361`). After winning, the acquirer re-reads its own +token (read-back verification, `git-commit-lock.sh:1352-1361`) before claiming +the hold — so even a create that "won" but whose file was concurrently +clobbered does not produce a false hold. +*Tier 1.* Tested heavily: unit Test 1 (8 rounds × 25 workers at FULL, +`U:166-195`), interop Test 1/Test 6 mixed bash+pwsh (`I:227-261`, the strict +deterministic counter `I:341-386`), and the integration suite's real-commit +swarm. **Recommend: in scope, keep.** This is the tool's whole reason to exist. + +**A2 — Thundering herd recovering one dead lock.** After a holder dies, *every* +waiter judges the same lock stale off the same mtime in the same poll window — +the worst case for displacement. The **claim protocol** is the answer: to steal, +a waiter must first win an O_EXCL claim file `.next`, re-verify staleness +under the claim, then install by one atomic rename-over +(`git-commit-lock.sh:1070-1218`, the steps narrated at `:82-115`). This +*prevents* the straggler-robs-recovery-winner race rather than detecting and +repairing it. *Tier 1.* Tested: unit Test 2b asserts zero spurious 98s, exactly +one `STOLE-BY-CLAIM` per round, and — via a background sampler — that **no +move-aside `.dead.*` file ever exists** (`U:212-346`); interop Test 16 proves +the same across mixed impls (`I:884-1015`). The header records the unserialized +baseline was probed to displace 5/5 with 4 waiters (`git-commit-lock.sh:233-234`). +**Recommend: in scope, keep — this is a load-bearing correctness property.** + +**A3 — Many concurrent stealers.** Distilled A2: N stealers, one O_EXCL claim +winner, the rest wait and acquire in sequence. *Tier 1.* Tested: unit Test 20 +(`U:1095-1128`), interop Test 16b (one bash claimant vs one ps1 claimant on one +ghost, cross-parsing each other's claim files, `I:1017-1088`). +**Recommend: in scope, keep.** + +> **Load caveat on A2/A3 (see §K):** *correctness* is load-independent (it rests +> on O_EXCL + atomic rename, not timing). What stretches under load is the +> *latency* to recover, and the *test harness's* ability to set up the race +> deterministically — Test 2b/16 carry heavy sync scaffolding and bounded +> discard-and-retry precisely because a fast waiter can complete an entire steal +> before the harness finishes backdating the ghost (`U:70-104, 285-336`). That +> is a test-harness envelope concern, not a protocol gap. + +### B. Holder death + +**B1 — Crash/SIGKILL/power loss mid-hold.** The lease ages out: once the lock +file's mtime is older than `STALE_SECS`, a waiter steals it. *Recovery is Tier +1; recovery latency is Tier 2* (bounded by STALE + poll cadence under normal +load). Tested via the stale-lock and empty-orphan steals (`U:197-210, 348-361`). +**Recommend: in scope (recovery). Document the latency bound (§K).** + +**B2 — Trappable death mid-claim (INT/TERM).** The EXIT/INT/TERM handlers are +armed at acquire *start*, not at hold, in "claim-window mode" +(`git-commit-lock.sh:1299-1310, 987-997`). A trappable exit while a claim is in +flight runs the token-checked claim deletion (one bounded retry) and a final +discovery read; it never runs lock-release (98) semantics on a *mere claim*. +*Tier 1.* Tested: unit Test 33 — TERM mid-claim deletes our claim, leaves a +*foreign* claim intact, no 98, no ageout penalty (`U:1857-1928`); the matching +ps1 lane is interop Test 16e (`I:1151-1244`). **Recommend: in scope, keep.** + +**B3 — Untrappable death mid-claim (SIGKILL between claim and rename).** +Deliberately **accepted, not prevented** (residual 5, +`git-commit-lock.sh:266-282`). The orphaned claim normally just ages out at +CLAIM_STALE; the rare bad case is a suspended rival's rename installing it as an +*unowned* lock that stalls waiters ≤ STALE before the lease recovers it. Crucial +property: **no false success anywhere** — nobody believes they hold; the only +cost is a bounded stall, same class as B1 at far lower probability. The preventing +alternative (a two-rename compare-and-swap) was evaluated and rejected because it +reintroduces crash litter (`git-commit-lock.sh:276-282`). *Tier 2.* Tested for +forensics/recovery via the crashed-leaver leg of Test 31 (`U:1648-1677`). +**Recommend: accept as a documented bounded residual. Do not build the +two-rename CAS** — the cure is worse than the disease and the failure is already +false-success-free. + +**B4 — Slow but uncontended holder.** With no waiter, nothing moves the file; +the token still matches at release; success. *Tier 1.* Tested: unit Test 4c, +interop Test 9 (`U:419-429`, `I:494-499`). **Recommend: in scope, keep** — this +is what stops the lock punishing every slow-but-safe hold. + +**B5 — Slow CONTENDED holder (the fail-open ceiling).** A hold past STALE *with* +a contender gets stolen; the robbed holder detects it at release (file gone, or +a foreign token — both definitive because acquire's read-back proved our token +was at the path) and returns exactly **98** plus a WARNING +(`git-commit-lock.sh:1620-1688`). *Tier 1 for detection.* Tested: unit Test 4b, +interop Test 8 both directions (`U:387-417`, `I:460-492`). **Recommend: in +scope, keep.** This is the deliberate fail-open-but-detectable contract; the +mitigation is operational — "commits must be fast" (the golden rule, +`docs/git-commit-lock.md:433-458`), and raise STALE for a genuinely slow hold. + +### C. Orphaned / stale locks and claims + +**C1/C2 — Stale or empty lock.** Staleness is judged by the lock file's own +mtime; a lock older than STALE and *lock-shaped* (empty, or line 1 starts +`tok.`) is stealable (`git-commit-lock.sh:1408-1446`). The empty case is the +crash-between-create-and-write orphan and is explicitly stealable. *Tier 1.* +Tested: Test 2 (stale), Test 3 (empty orphan regression) (`U:197-210, 348-361`). +**Recommend: in scope, keep.** + +**C3 — Crashed-claimant / empty-claim orphan.** A claim older than CLAIM_STALE +(default 60s; claims are normally held for ms) is cleared by any waiter, which +re-races the claim create (`git-commit-lock.sh:1228-1267`). A crashed claimant +therefore delays only *steals*, only by ≤ the claim window; a free lock path is +never blocked by a claim. *Recovery Tier 1, latency Tier 2.* Tested: Test 21 +(aged foreign claim and empty claim both age out and recovery completes, +`U:1130-1154`). **Recommend: in scope, keep.** + +> **Test 21's `≤20s` latency assertion is Tier 2, not Tier 1.** `U:1144` asserts +> wall-clock recovery `≤20s` with STALE=1, CLAIM_STALE=2, MAX_WAIT=30. The +> *protocol* recovers correctly regardless; the 20s number is a generous +> envelope bound that a sufficiently oversubscribed runner (e.g. 8 CPU hogs on a +> 2-core box under the stress wrapper) can blow without any protocol defect. +> This is exactly the kind of bound §K says to treat as a test-harness envelope: +> if it flakes under extreme artificial load, **relax the test's bound or scope +> the stress level — do not harden the code.** + +**C4 — Leaked claim.** A few exits must leave a claim behind without a verifiable +unlink (an unreadable claim; an unlink blocked by a foreign handle — exactly +three feeders, `git-commit-lock.sh:138-157`). These append the attempt token to +an in-process **leaked-token memory**. While non-empty, every poll (and a pass +at release/timeout) also reads the lock's line 1: a listed token there means a +rival's rename installed *our* leaked claim as the lock → adopt the hold, or, at +release, recognise our real hold was displaced, clean the leaked file +best-effort, and report 98. The result is structural: **no process inside an +acquire/hold/release arc can leave an *unowned* lock** (per-attempt tokens make +the discovery read conclusive). *Tier 1.* Tested extensively: Test 31 (the four +leaked lanes, including a real Windows no-delete-share feeder), Test 35 +(release-time cleanup of a leak installed over a held hold → 98), Test 36 +(inconclusive-read keeps the entry) (`U:1549-1758, 2013-2164`); ps1 parity in +interop Test 16e. **Recommend: in scope, keep.** This is the most intricate +machinery in the tool and the most thoroughly tested. + +### D. Filesystem semantics the protocol depends on + +These are the **load-bearing FS assumptions**. Where one does not hold, that is a +real robustness boundary, not a bug to fix. + +**D1 — Atomic rename-over.** The steal installs by replacing the lock in one +`rename(2)` with no path-absent window. bash uses GNU `mv -T` where available, +probed once, with a guarded `[ -d ]` + bare-`mv` fallback on BSD/macOS +(`git-commit-lock.sh:954-979`); pwsh 7 uses the 3-arg `File.Move(src,dst,true)`, +**Windows PowerShell 5.1 has no such overload** and falls back to unlink-then- +2-arg-Move (`git-commit-lock.ps1:941-982`). `File.Replace` is *deliberately +never used* (throws on read-only dest; partial-failure states) — pinned by a +static grep in interop Test 16d (`I:1141-1149`). **Boundary:** atomic-replace +rename is guaranteed on local POSIX FS and NTFS (probe R1: 400 replaces, zero +absent reads, `git-commit-lock.sh:380-382`); it is *not* guaranteed on some +network filesystems (see §E). The 5.1 unlink+move lane has a real absent window, +making it the one engine where a rival's create can win the recovered path — +documented as a fairness loss, never a clobber (`docs/git-commit-lock.md:471-476`). +*Tier 1 on local FS.* **Recommend: in scope on local FS; the network-FS boundary +is §E.** + +**D2 — O_EXCL atomic create.** `set -C` noclobber redirect (bash) / +`FileMode.CreateNew` with `FileShare.ReadWrite|Delete` (ps1, +`git-commit-lock.ps1:650-670`). Atomic create-or-fail on local POSIX and NTFS; +exactly one creator wins. *Tier 1 on local FS.* **Recommend: in scope on local +FS.** Boundary: O_EXCL is the classic NFS weak spot (§E). + +**D3 — Wrong-type object at the lock or claim path.** A directory, symlink, FIFO, +socket, or device at the path is **never stolen or deleted**. bash has a +pre-create type guard (`[ -f ] && ! [ -L ]`) plus a per-poll wrong-type +classifier with two-consecutive-poll confirmation to survive Windows +delete-pending ghosts (`git-commit-lock.sh:1322-1327, 1518-1570`); the same +guards apply to the claim path with independent per-path warn-once state +(`:1458-1487`). The FIFO case is *why the pre-create guard is mandatory*: a +noclobber `>` onto a FIFO blocks in `open(2)` before any timeout logic — a hang, +not a warning. *Tier 1 on bash, and on ps1-on-Windows.* Tested: Test 17 +(dir/symlink/FIFO at lock path), Test 22 (claim path), Test 17d (churn must not +false-warn) (`U:818-892, 1156-1262, 894-1032`). + +> **The one real D3 boundary — ps1 on POSIX (Tier 2, accepted).** The .NET API +> exposes no portable type bit for FIFO/device/socket on Unix; they stat as size +> 0 and take the **empty-orphan steal lane** (lock path) or empty-claim clear +> lane (`git-commit-lock.ps1:62-78, 520-525`; `docs/git-commit-lock.md:215-222`). +> Damage is capped at the one misconfigured inode (consumed by the rename). This +> is an **unsupported configuration** (ps1 is Windows-only; POSIX runs it solely +> as cross-impl protocol verification, `README.md:91-95`). **Recommend: accept, +> as documented.** Closing it would need a `stat(2)` shell-out the port avoids; +> not worth it for an unsupported config. + +**D4 — Non-lock CONTENT at the path.** An age-gated content guard steals only +empty or `tok.`-prefixed line-1 content; a real user file at a typo'd path +survives forever (`git-commit-lock.sh:1411-1444`). *Tier 1.* Tested: Test 18 +(user file untouched; sub-prefix torn write `to` never stolen; `tok.`-prefixed +torn write *is* stolen) (`U:1034-1076`). **Two accepted residuals** make the +guarantee precise (`git-commit-lock.sh:298-311`): (a) a stale **empty** user +file is indistinguishable from the crash orphan and *is* stolen; (b) a stale +user file whose line 1 happens to start `tok.` passes the wire test and *is* +stolen. Both are deliberate (a fuller shape check buys near-zero protection for a +harder-bound wire format). **Recommend: in scope, keep, with the two residuals +documented** (already are). + +**D5 — Case-insensitive filesystem.** Not handled explicitly. The lock and claim +paths differ only by the `.next` suffix (`` vs `.next`), which never +collide under case folding, and the token content is case-exact regardless of FS +case sensitivity. The only theoretical exposure is two *different* configured +`AGENT_LOCK_PATH` values that differ only in case resolving to one file on +NTFS/APFS — but that would be a single shared lock, which is *correct* behavior +(they'd serialize), not a break. *Tier 3 (non-issue).* **Recommend: out of +scope as a non-issue; no action.** (Cheap to add one sentence to the design doc +if desired.) + +### E. Network / shared filesystems and clocks + +**E1 — Network/shared FS (NFS, SMB/CIFS, 9p, Dropbox/OneDrive sync).** The design +doc states this plainly: the repo must live on a **local FS with atomic +create/rename and sane mtimes**; "repos on network or sync-backed storage … are +outside the design's guarantees" (`docs/git-commit-lock.md:122-126`). This is the +honest boundary, because the protocol's *correctness* rests on D1 (atomic +rename-over) and D2 (O_EXCL create), and both are exactly the operations network +filesystems weaken: +- **NFS:** `O_EXCL` create is famously unreliable on older NFS (the client can't + guarantee exclusive create across the network); `rename` atomicity and mtime + granularity vary by version/server. On such a mount, **D2 can let two creators + both "win"** → two live holders, and the read-back verification + (`:1352-1361`) is the only backstop (it would catch *some* but not all + interleavings). +- **SMB/CIFS:** delete/rename semantics and the no-delete-share handle behavior + differ from both POSIX and local NTFS; mtime resolution and clock source may be + the *server's*, not the client's. +- **Sync folders (Dropbox/OneDrive):** asynchronous replication means the lock + file's existence and content are *not* globally consistent — two machines can + both create "the" lock locally before sync reconciles. Fundamentally broken; + not a tunable. + +*Tier 3 (out of scope, stated).* Untested (CI runs local FS only). **Recommend: +keep out of scope — but consider making it harder to *fall into* accidentally.** +The current failure mode on a bad FS is *silent* (the tool runs, exclusion may +just not hold). Options, in increasing cost: (i) leave as-is, documented — the +default lock lives in `.git`, which is almost always local, so accidental +network use is rare; (ii) a one-line caveat in `README.md` (currently only in the +deeper design doc); (iii) an optional best-effort startup probe of the lock dir's +FS type with a stderr warning on a known-network type (cheap on Linux via +`stat -f`, awkward cross-platform, and inherently incomplete). **My +recommendation: (ii) now** (surface the boundary in the README, where an operator +actually looks), and treat (iii) as optional polish — do *not* try to *support* +network FS. + +**E2 — Multi-host clock skew / NTP jumps / timezone.** *This is the one place +the documentation is genuinely thin, and it deserves a deliberate decision.* +Staleness is mtime-vs-`now` arithmetic (`git-commit-lock.sh:928, 1409`). The +lock file records `host=` (`:519`), which *suggests* cross-host use — +but the staleness math implicitly assumes **the mtime and the comparing +process's clock come from the same time source.** Reasoning from first +principles about what can go wrong: +- On a **single host** (the actual supported case — all contenders share one + checkout, hence one machine), mtime and `now` are the same clock; skew is a + non-issue, and the **mtime floor** (946684800 / 2000-01-01, + `git-commit-lock.sh:925`) already absorbs the only real local clock glitch: + the Windows FILETIME-zero (1601) transient on fresh files + (`docs/git-commit-lock.md:283-293`, probed at 0.04–0.5% of readings). +- A **backward NTP step / large clock correction** on the one host could make a + live lock look stale (premature steal) or a stale lock look fresh (delayed + recovery). The first is the dangerous one — but it degrades into the *already + handled* B5 lane: a premature steal of a still-live hold is detected at release + as 98, never a silent double-commit. So even a local clock jump is + **correctness-safe, liveness-degraded** — Tier 2. +- **Cross-host** use over a shared FS (already E1-out-of-scope) is where skew + would actually bite: host A's mtime compared against host B's `now` with + minutes of skew could steal live locks wholesale. But this only arises *on a + network FS*, which is already excluded. +- **Timezone** is a non-factor: all arithmetic is in epoch seconds + (`git-commit-lock.sh:439-449`, `git-commit-lock.ps1:448-451`), never local + time. + +*Tier 3 for cross-host (rides on E1); Tier 2 for a local NTP jump.* Untested. +**Recommend:** (a) **document explicitly** that the tool assumes a single time +source — i.e. single-host use (the common case) or a shared FS with a single +server clock — and that this is *why* network/multi-host is out of scope; the +current docs imply it but never say "one clock." (b) Note the reassuring part: a +*local* clock jump is correctness-safe (degrades to the detected-98 lane), so no +code change is warranted. This is a **doc gap, not a code gap.** + +### F. Resource exhaustion + +**F1 — Disk full (ENOSPC) during a claim/lock create or write.** The create is +one open+write+close in a subshell; if the write fails (ENOSPC), the subshell +fails and the acquirer falls through to wait (`git-commit-lock.sh:1336-1361`, +comment at `:1341-1343`). A created-but-write-failed file is an empty orphan that +ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the +accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud, +fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write +manual-fix residual).* Reasoned from code, **not tested** (no ENOSPC fault +injection). **Recommend: accept and document.** ENOSPC is a host-health failure; +the tool degrades safely (no corruption, no false hold) and the one sharp edge +(sub-`tok.` torn write needing manual `rm`) is already documented. Not worth +fault-injection tests. + +**F2 — ENOSPC during a LOG write.** All log writes end in `|| true` +(`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.* +**Recommend: accept** — logging is best-effort by explicit design (it must never +block or fail the lock). The only downside is reduced post-mortem signal under +disk pressure, which is acceptable. + +**F3 — Inode / FD exhaustion.** Same shape as F1: a create that can't get an +inode fails → wait → eventually 97. The tool holds at most a couple of FDs +briefly. *Tier 2.* Untested. **Recommend: accept, document as host-health.** + +**F4 — Read-only / unwritable lock dir or parent.** `lock_acquire` does a +best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is +unwritable the create fails every poll and the waiter times out at 97. No +corruption, no false hold. A *release* unlink blocked by an unwritable parent +routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* Untested directly. +**Recommend: accept, document.** A correct, if blunt, outcome (97); arguably an +*earlier, clearer* error would be nicer — optional polish, low priority. + +**F5 — Memory exhaustion.** The scripts allocate trivially (a few shell vars; the +leaked-token list is "almost always empty"). Not a meaningful failure surface. +*Tier 3 / non-issue.* **Recommend: no action.** + +### G. Misconfiguration + +**G1 — Lock path is a directory / `$HOME` / a real file.** Covered by D3/D4: +never stolen or deleted, loud one-time warning, waiters reach 97 +(`U:818-840`). *Tier 1.* The security note (`docs/git-commit-lock.md:530-541`) +bounds the worst case even for a *hostile* repo redirecting the git dir: the tool +only ever creates its own small set of files at its own names and never deletes +recursively. **Recommend: in scope, keep.** + +**G2 — Garbage numeric config.** Each knob is validated at source time; invalid +values fall back to default with a stderr note (`git-commit-lock.sh:481-500`). +The ps1 port *tightens* .NET's permissive parser to bash's grammar so the same +env var configures the same value on both impls — e.g. rejecting `"1e3"`, +trailing newlines, whitespace (`git-commit-lock.ps1:327-359`). *Tier 1.* Tested: +unit Test 13, interop Test 12 (cross-impl parity, including `1e3`/`+2`/`' '`/ +trailing-newline) (`U:695-703`, `I:554-608`). **Recommend: in scope, keep.** + +**G3 — `run` outside a git repo, no `AGENT_LOCK_PATH`.** Refused with 96 — a +CWD-scoped lock would serialize against nobody (`git-commit-lock.sh:1768-1773`). +Sourcing keeps a CWD fallback with a stderr warning and creates no files +(`:570-572`; unit Test 14/14b). *Tier 1.* **Recommend: in scope, keep.** + +**G4 — `MAX_WAIT ≤ STALE + CLAIM_STALE`.** A startup warning, gated on MAX_WAIT +being left at its default (a caller who set it chose the relationship). The +relation is the stacked worst-case recovery: a crashed holder *plus* a crashed +claimant (`git-commit-lock.sh:502-514`). *Tier 2 (advisory).* Tested: Test 8 +exercises the gate and the stacking (`U:497-522`). **Recommend: in scope, +keep.** + +### H. Signals, interrupts, cleanup-on-exit + +**H1/H2 — bash INT/TERM/EXIT.** Handlers armed at acquire start; on a held lock +they release and re-raise the signal (wrapper dies 143, what a watchdog needs); +they restore the caller's pre-acquire traps exactly (`git-commit-lock.sh:1037- +1054, 1002-1023, 780-784`). *Tier 1.* Tested: Test 11 (TERM mid-hold → 143, +released), Test 12c (exit-while-holding chains the caller's EXIT trap), Test 12d/e +(trap restoration), Test 34 (TERM on a *steal*-acquired hold behaves identically +— all acquisition paths funnel through one hold helper) (`U:577-600, 633-693, +1989-2011`). One documented caveat: a SIGINT delivered to the `run` wrapper alone +while its foreground child survives is discarded by bash before any trap +(`git-commit-lock.sh:1030-1036`) — a real Ctrl+C hits the whole group and does +take the path. **Recommend: in scope, keep.** + +**H3 — ps1 process death.** PowerShell has no `trap SIGTERM`. The port substitutes +(a) `try/finally` inside `Lock-Acquire`, which runs on Ctrl+C/pipeline-stop/ +terminating errors and does the claim-window cleanup + discovery read +(`git-commit-lock.ps1:1378, 1672-1683, 1240-1295`); and (b) a `PowerShell.Exiting` +engine-event backstop for a *held* lock (`:704, 1303-1324`). **Documented limit:** +`PowerShell.Exiting` fires under `-Command` and interactively but **NOT under +`-File`**, and not on hard kill / `[Environment]::Exit()` +(`git-commit-lock.ps1:241-245, 1298-1302`). So a held lock abandoned by a +forgetful dot-source `-File` caller relies on the stale window, not the backstop. +The **`run` contract path is unaffected** — it pairs Acquire/Release in +try/finally (`:1928-1979`). *Tier 2 (for the dot-source `-File` gap).* The happy +path and trap-time claim cleanup are tested (interop Test 16e); the `-File` +non-firing is documented, not test-pinned. **Recommend: accept the `-File` +backstop gap as documented** — the stale window recovers it, and the supported +`run`/try-finally paths are covered. If you want to close it, the documented +option is handle-based ops (`git-commit-lock.ps1:146-151`), a larger change not +worth it for a forgetful-caller edge. + +### I. Cross-implementation + +**I1 — Wire/format compatibility.** One on-disk format (token line 1, owner line +2, `tok.` prefix as wire contract), one read-retry schedule (8 attempts, +20/40/80/160/320/320/320 ms — verified byte-identical between +`git-commit-lock.sh:670` and `git-commit-lock.ps1:597-629`), one set of release +verdicts, one config grammar. *Tier 1.* The interop suite is built to break this: +mixed bash+pwsh exclusion (T1/T6), each side steals the other's genuine stale +lock (T4/T5), robbed-holder 98 both directions (T8), release-classification +agreement (T11), cross-impl claim staleness clearing (T16c), and a Windows +PowerShell 5.1 smoke lane (T17). **Recommend: in scope, keep — and keep the +interop suite as the guard.** Two independent implementations hammering one lock +is the cheap adversarial verification (`README.md:92-95`). + +**I2 — Mixed-version tree.** Prevention (the claim protocol) holds only when +*all* parties run it; older releases stole with an unserialized move-aside, so a +mixed tree degrades prevention to detection (98) and can leave `.dead.*` litter +current versions don't clean (residual 4, `git-commit-lock.sh:261-265`). *Tier +3.* Untested (would require shipping an old version into the suite). **Recommend: +out of scope; keep the "upgrade both implementations together" deployment note** +(it's in `README.md` and the design doc). Acceptable because the degraded mode is +still *detected* (98), never silent. + +### J. Logging subsystem failure + +**J1.** Every log write is `|| true`; the log self-truncates past ~1 MB rather +than rotating (`git-commit-lock.sh:554-562`). A broken log never blocks or fails +the lock. Under a redirected git dir, log *content* (the owner line) is +attacker-influenceable — one-line text spoofing, no execution; the tool itself +writes only its token, owner line, and protocol events, never secrets +(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept** — logging +is best-effort by design, which is the right call for a lock that must keep +working when the disk is full or the log path is bad. The only follow-on: don't +build automation that *trusts* log text from an untrusted repo (already +documented). + +### K. Behavior under extreme load / scheduling pressure, and internal time budgets + +**This is the most important analytical section** — it separates "must hold under +any load" from "holds within an envelope," and tells the owner which apparent +flakes are real gaps vs harness concerns. + +**The clean split: correctness is load-independent; liveness/latency is not.** + +- **Load-independent (Tier 1, must always hold):** mutual exclusion, no silent + lost update, no corruption, eventual recovery. These rest on O_EXCL create + + atomic rename + per-attempt-token discovery — *structural* properties that do + not reference the clock for their *correctness*. The mtime floor + (`:925`) and the read-retry ladder (`:668-684`) exist precisely so that the + one timing-sensitive input (mtime, and transient empty reads) cannot corrupt a + correctness decision: a sub-floor or unsettled reading is treated as "wait," + never "steal." A 25-worker round can go 3s → 41s under load + (`agents/600-claude.md` observation) and *still* lose no update. + +- **Load-dependent (Tier 2, best-effort in an envelope):** every wall-clock bound. + - **Recovery latency** ≈ STALE (+ CLAIM_STALE if a claimant also crashed) + + poll cadence. Under CPU oversubscription or a slow FS, polls stretch, so + recovery takes longer — but still completes. + - **`MAX_WAIT` timeout (97):** a waiter on a genuinely squatted/blocked lock + gives up at MAX_WAIT. Under load the *real* time to MAX_WAIT stretches with + poll cadence; the guarantee is "bounded by MAX_WAIT polls," not "exactly + MAX_WAIT seconds." Interop Test 14b explicitly checks that a blocked steal + **never busy-spins past MAX_WAIT** and logs in a damped, bounded way + (`I:746-817`) — a real correctness-adjacent property (no busy-spin), with a + timing-dependent upper bound on the STALE-line count (`[1,8]`). + - **The read-retry ladder (~1.26s budget):** sized to ride out a sub-second + transient (AV scanner handle, probe-F create→write gap). Under pathological + load a transient *longer* than ~1.26s would surface as the unverifiable-2 / + run-1 verdict (a detected, non-corrupting outcome), not a wrong hold. Test + 16c pins that a 0.4s transient is ridden out (`U:784-817`). + +**Internal time budgets, enumerated** (all tunable via `AGENT_LOCK_*`): + +| Budget | Default | Role | Load sensitivity | +|---|---|---|---| +| `STALE_SECS` | 300s | steal threshold (the lease length) | the fail-open ceiling; raise for slow holds | +| `CLAIM_STALE_SECS` | 60s | crashed-claimant ageout | delays only steals | +| `POLL_SECS` | 2s | poll interval | cadence stretches under load | +| `MAX_WAIT` | 420s | total wait cap → 97 | real wall-clock stretches with cadence | +| read-retry ladder | ~1.26s | ride out transient empty reads | a longer transient → detected-2, not wrong hold | +| mtime floor | 2000-01-01 | reject FILETIME-zero | static, not load-sensitive | + +**Judgments on the load-sensitive behaviors — gap, degradation, or harness +concern:** + +1. **Protocol correctness under load — (c) non-issue / already guaranteed.** + The stress branch wraps every suite in artificial CPU+disk load + (`tests/with-load.sh`) specifically to widen timing windows and surface + *latency/race flakes*, and the protocol assertions (exclusion, one-steal, + zero-98) are written to hold regardless. **Recommend: nothing to harden.** + +2. **Wall-clock test *bounds* under extreme load — (b) acceptable degradation; + fix the TEST, not the code.** Two examples surfaced by the prior stress + effort (which I verified independently against the code, not adopted): + - *Test 21's `≤20s` recovery-latency assertion* (`U:1144`) and + - *Test 22(a)'s claim-warning timing* (which needs ≥2 blocked polls before + MAX_WAIT to fire the two-consecutive-poll-confirmed warning, `U:1162-1168`), + - and *Test 29's `≥2 CLAIM lines` discriminator* (explicitly given `MAX_WAIT=6` + headroom, `U:1514-1518`). + + Each asserts a wall-clock or poll-count bound that an oversubscribed runner + (e.g. 8 hogs on 2 cores) can blow *without any protocol defect* — the + protocol still recovers/warns correctly, just slower. **Recommend: where these + flake only under extreme artificial load, relax the bound or scope the stress + level for that test; do NOT change product code.** The correctness assertions + in the same tests must stay strict. + +3. **Test-*harness* race setup under load — (c) harness concern, already + mitigated.** Tests 2b/16/16b carry heavy sync scaffolding (`sync_waiting_fresh`, + token-guarded `backdate_ghost`, bounded discard-and-retry, `U:70-151`) because + a fast waiter can complete an entire steal before the harness finishes setting + up the race. This is purely about *constructing* the scenario deterministically; + the protocol is fine. **Recommend: keep the scaffolding; it is the right fix.** + +4. **No-busy-spin under a permanently blocked lock — (a) a real property, and + it's guarded.** A failed-steal lane that `continue`d past the timeout+sleep + would busy-spin and never reach 97 — a genuine bug class. Interop Test 14b is + the regression guard (`I:746-817`). **Recommend: keep that test; treat any + regression here as Tier 1.** + +**Net K recommendation:** adopt the explicit envelope — *"correctness holds under +any load; wall-clock recovery/timeout latency scales with poll cadence and +scheduling, bounded by the configured knobs."* Put that sentence in the design +doc. Then audit the suite's wall-clock assertions and **scope each to the load +level it's meant to run at** (the stress branch's extreme `both/8-hog` mode is a +flake-hunting tool, not a contract the product must meet on a 2-core runner). +This is the cleanest way to stop chasing "flakes" that are really the test +asserting a Tier-1 bound on a Tier-2 quantity. + +--- + +## 4. Open questions / recommended scope decisions + +Ordered by how much they need an explicit owner decision. + +1. **Define and document the load/timing envelope (§K) — highest value.** + *Recommendation:* state in `docs/git-commit-lock.md` that correctness + (exclusion, no silent loss, eventual recovery) is load-independent, while all + wall-clock bounds (recovery latency, MAX_WAIT, the read ladder) are + best-effort and scale with scheduling. Then **scope the suite's wall-clock + assertions to a defined load level** so extreme-stress flakes (Test 21's 20s, + Test 22a's warning timing, Test 29's poll count) are recognised as Tier-2 + envelope misses, not product regressions. *This resolves the recurring + "flake" question structurally.* Cost: doc + a test-bound audit; no product + change. + +2. **Multi-host / clock-skew assumption is under-documented (§E2) — doc gap, not + code gap.** The tool implicitly assumes a single time source; a *local* NTP + jump is correctness-safe (degrades to the detected-98 lane), and cross-host + skew only bites on a network FS that's already out of scope. *Recommendation:* + add one explicit sentence — "assumes a single clock, i.e. single-host (the + common case) or a shared FS with one server clock" — and the reassurance that + a local clock jump cannot cause a silent double-commit. No code change. + +3. **Network/shared FS is out of scope but fails *silently* if entered (§E1).** + The boundary is correctly stated in the design doc but only there. + *Recommendation:* surface it in `README.md` (where operators look), since the + failure on a bad FS is silent loss of exclusion. Do **not** attempt to + *support* network FS. An optional best-effort FS-type startup probe is + possible but cross-platform-awkward and incomplete — treat as low-priority + polish, not a requirement. + +4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap + (§H3) — accept as documented.** Both are real but confined to an unsupported + config (ps1-on-POSIX) or a forgetful-caller edge that the stale window + recovers. *Recommendation:* no code change; confirm they stay documented. + Reconsider only if PowerShell-on-POSIX ever becomes supported (it isn't, + `README.md:91-95`). + +5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write + failure F2/J1).** These degrade safely (wait/97, or silent best-effort log + loss) but have **no fault-injection tests** — they are reasoned-correct, not + verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection + tests (low ROI; the degradation is structurally safe). If the owner wants one + belt-and-braces test, the highest-value single one is an **unwritable lock dir + → clean 97** (cheap to write deterministically; F4), since that's the most + likely real-world misconfiguration of the set. + +6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, + confirm.** The first degrades to detection (98), never silent, and is covered + by the "upgrade both together" note. The second is a non-issue. *Recommendation:* + leave both out of scope; optionally one sentence each in the design doc. + +### Things explicitly NOT to do (the design already considered and rejected them) + +- **A background heartbeat** to refresh the lease — would make the tool more than + a single synchronous script; the fail-open-but-detectable lease is the + deliberate alternative (`git-commit-lock.sh:217-218`). +- **A two-rename compare-and-swap** to prevent residual 5 (B3) — reintroduces + crash litter + a sweep, for a failure that is already bounded and + false-success-free (`git-commit-lock.sh:276-282`). +- **`File.Replace` in the ps1 port** — pinned out by interop Test 16d for good + reasons (read-only-dest throw, partial-failure states). +- **Trying to support network/shared filesystems** — the protocol's correctness + rests on local-FS atomic create/rename; this is a boundary to *document*, not + to engineer around. From 402dc1e538ac7ca48fcbe7297050470d8069530a Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 12:33:38 +1000 Subject: [PATCH 15/76] =?UTF-8?q?docs(failure-modes):=20review=20round=201?= =?UTF-8?q?=20=E2=80=94=20sharpen=20core=20guarantee;=20fix=20clock=20dir;?= =?UTF-8?q?=20add=20E3/H4?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address findings from a foreign-model (Codex) + fresh-Claude review of the failure-modes map, each verified against the code: - Core-guarantee precision (the doc's central thesis): the unconditional safety property is "no silent lost update, given cooperative wrapper unwind", NOT unconditional mutual exclusion. Strict mutual exclusion holds only within the staleness window; beyond it the lease is fail-open-but-detectable. Split Tier 1 into safety (unconditional) vs recovery (lock-shaped orphans only, under a readable-clock / supported-FS envelope; foreign objects at the path are deliberately never auto-removed). - Add H4: hard kill (SIGKILL) or a wrapped command's [Environment]::Exit() while holding bypasses release-time detection -> the explicit boundary of the no-silent-loss guarantee. - Add E3: mtime probe entirely unreadable -> staleness detection disabled; fails SAFE (never steals a lock whose age it cannot establish), recovery lost, loudly announced once per process (both ports). - Fix E2 clock-jump direction (age = now - mtime: a FORWARD jump makes a live lock look stale -> premature steal -> detected-98; a BACKWARD jump delays recovery). - D1: separate the atomic-overwrite engines (mv -T / 3-arg File.Move) from the non-atomic Windows PowerShell 5.1 unlink-then-Move fallback (claim-guarded; fairness loss, never a clobber). - Note the leaked-token memory is process-local (ties the "no unowned lock" framing to residual 5); correct the README-location claim for the mixed-version note (it is in the design doc only); minor citation fixes (README quote/line, Test 22a over-attribution). Reviewers confirmed the central thesis (correctness load-independent; only latency degrades) holds against every interleaving attacked on a local FS. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/failure-modes.md | 203 +++++++++++++++++++++++++++++++----------- 1 file changed, 152 insertions(+), 51 deletions(-) diff --git a/docs/failure-modes.md b/docs/failure-modes.md index 199e9da..d078453 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -27,24 +27,41 @@ flags the boundaries the headers state but a reader might skip. ## 1. The core guarantee (what must hold under ANY conditions) -**Mutual exclusion + detectable failure.** At most one process at a time -believes it holds the lock *and* is right about it. The lock cannot be silently -lost: a holder whose lease was taken from it learns so — `lock_release` returns -**98** and logs a loud WARNING — rather than reporting a serialized commit that -wasn't (`git-commit-lock.sh:1607-1688`; `git-commit-lock.ps1:1700-1845`). The -two reserved failure codes mean the wrapped command was provably *not* run -(96 usage, 97 timeout) or provably *not serialized* (98) -(`git-commit-lock.sh:392-415`). There is no fourth outcome in which two -processes both believe they hold an exclusive lock and both are wrong. - -This is a **lease, not a kernel lock** (`docs/git-commit-lock.md:60-126` -explains why no OS primitive spans bash-on-MINGW and PowerShell/.NET). The -deliberate consequence: a hold longer than the staleness window (default 300s) -*can* be stolen mid-work — "fail-open." That is accepted by design and made -*detectable* (the 98 path), not prevented (`git-commit-lock.sh:213-227`). So the -core guarantee is precisely: **no silent lost update.** Liveness (eventual -recovery from any crash) and bounded stalls are best-effort within an operating -envelope (Tier 2), not absolute. +**No silent lost update — given cooperative wrapper unwind.** The absolute safety +property is that the tool never reports a *serialized* critical section that +wasn't: a holder whose lease was taken from it learns so — `lock_release` returns +**98** and logs a loud WARNING — rather than exiting success +(`git-commit-lock.sh:1607-1688`; `git-commit-lock.ps1:1717-1837`). The two +reserved failure codes mean the wrapped command was provably *not* run (96 usage, +97 timeout) or provably *not serialized* (98) (`git-commit-lock.sh:392-415`). + +Two honest qualifications make this a precise property rather than a slogan, and +both matter for the scope decision: + +- **It is a lease, not a kernel lock** (`docs/git-commit-lock.md:60-126` explains + why no OS primitive spans bash-on-MINGW and PowerShell/.NET). **Strict mutual + exclusion holds only *within* the staleness window** (default 300s): a hold that + overruns it *can* be stolen mid-work — "fail-open" — so two processes can + briefly *both* believe they hold the lock. That overlap is accepted by design + and made *detectable* (the displaced holder's 98 at release), not prevented + (`git-commit-lock.sh:213-227`). At most one process is ever the *legitimate* + holder; a displaced believer finds out at release. So "mutual exclusion" is a + Tier-1 guarantee **within the envelope (commits faster than STALE)**, not an + unconditional one. +- **Detection requires the wrapper to actually reach release.** The 98 path fires + on normal return and on trapped signals. It does **not** fire if the held + process is *hard-killed* (SIGKILL) or if the wrapped command terminates the + process abruptly — notably PowerShell `[Environment]::Exit()`, which bypasses + both `Lock-Release` and the `PowerShell.Exiting` backstop + (`git-commit-lock.ps1:221-245`). Such an abrupt exit can report success without + the 98 (see **§H4**). The *next* holder still recovers via staleness, but the + abruptly-exiting one is not warned. Hence the precise statement: **no silent + lost update, provided the wrapper unwinds cooperatively.** + +Liveness (eventual recovery) and bounded stalls are best-effort within an +operating envelope (Tier 2), not absolute — and "recovery" means lock-shaped +orphans get reclaimed, **not** that every bad state self-heals (a foreign object +at the path is deliberately never auto-removed; see the tier split). The integration suite is the end-to-end witness for this guarantee on the real use case: many workers committing into one repo, audited for "every commit @@ -54,8 +71,22 @@ clean tree" (`tests/git-commit-lock.integration.test.sh:10-12, 226-283`). ### The three tiers used throughout 1. **Correctness guarantee** — must hold under *any* conditions (load, slow FS, - adversarial scheduling): mutual exclusion, no corruption, no silent loss, - eventual recovery. If one of these can break, it is a bug. + adversarial scheduling). Two kinds, and the distinction matters: + - **Safety (unconditional):** no corruption, and **no silent lost update** — + the displaced holder detects the loss (98) *provided its wrapper reaches + release* (§1's hard-kill/`Exit()` caveat). Strict **mutual exclusion holds + within the staleness window**; beyond it the lease is + fail-open-but-detectable. + - **Recovery (for lock-shaped stale state, under the supported FS/clock/tooling + envelope):** a crashed holder's stale lock, an orphaned claim, and an empty + crash-orphan are eventually reclaimed. This does **not** extend to *foreign* + objects at the path — a directory, a real user file, or non-`tok.` junk + content are deliberately *never* auto-removed; they wait at 97 for an + operator. "Eventual recovery" means lock-shaped orphans self-clear, not that + every bad state self-heals. + If a *safety* property can break, it is a bug; a *recovery* property failing + outside its envelope (e.g. a foreign object, an unreadable clock) is a + classified Tier-2/3 degradation, not a Tier-1 violation. 2. **Best-effort within a stated envelope** — holds under normal/expected conditions, degrades gracefully (and *detectably*) under pathological ones. Everything wall-clock-bounded lives here, because wall-clock bounds depend on @@ -93,6 +124,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. | | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. | | E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. | +| E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ○ | **Accept** — fails safe + announced. See §E3. | | F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ (reasoned, not tested) | **Accept**, document. See §F1. | | F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ | **Accept;** logging is best-effort by design. | | F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ | **Accept**, document. | @@ -104,6 +136,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | H1 | SIGINT/SIGTERM mid-hold | Release + re-raise (143); traps restored | 1 | ✓ U:577-600/1989-2011 | **In scope.** Keep (bash). ps1 = §H. | | H2 | EXIT-while-holding | Release + chain caller's EXIT trap | 1 | ✓ U:633-648 | **In scope.** Keep. | | H3 | ps1 process death under `-File` | `PowerShell.Exiting` does NOT fire; relies on stale window | 2 | ○ (limit documented) | **Accept;** `run` path is covered. See §H. | +| H4 | Hard kill / `[Environment]::Exit()` while held | Bypasses release → a displaced holder is unwarned (no 98) | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. | | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. | | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. | | J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. | @@ -240,7 +273,14 @@ rival's rename installed *our* leaked claim as the lock → adopt the hold, or, release, recognise our real hold was displaced, clean the leaked file best-effort, and report 98. The result is structural: **no process inside an acquire/hold/release arc can leave an *unowned* lock** (per-attempt tokens make -the discovery read conclusive). *Tier 1.* Tested extensively: Test 31 (the four +the discovery read conclusive). One scope nuance worth stating, because the +memory is **process-local**: only the leaking process can *adopt* its own +installed claim. If that process exits the arc first — times out (97), releases +cleanly, or dies — *before* adopting, the installed claim becomes an unowned lock +recovered by the ordinary staleness lane, never adopted by another process (this +is exactly residual 5 / §B3). Per-attempt-token uniqueness still guarantees that +lock can never be *mistaken* for owned by anyone, so there is **no false +success** — the only cost is a bounded stall. *Tier 1.* Tested extensively: Test 31 (the four leaked lanes, including a real Windows no-delete-share feeder), Test 35 (release-time cleanup of a leak installed over a held hold → 98), Test 36 (inconclusive-read keeps the entry) (`U:1549-1758, 2013-2164`); ps1 parity in @@ -252,21 +292,27 @@ machinery in the tool and the most thoroughly tested. These are the **load-bearing FS assumptions**. Where one does not hold, that is a real robustness boundary, not a bug to fix. -**D1 — Atomic rename-over.** The steal installs by replacing the lock in one -`rename(2)` with no path-absent window. bash uses GNU `mv -T` where available, -probed once, with a guarded `[ -d ]` + bare-`mv` fallback on BSD/macOS -(`git-commit-lock.sh:954-979`); pwsh 7 uses the 3-arg `File.Move(src,dst,true)`, -**Windows PowerShell 5.1 has no such overload** and falls back to unlink-then- -2-arg-Move (`git-commit-lock.ps1:941-982`). `File.Replace` is *deliberately -never used* (throws on read-only dest; partial-failure states) — pinned by a -static grep in interop Test 16d (`I:1141-1149`). **Boundary:** atomic-replace -rename is guaranteed on local POSIX FS and NTFS (probe R1: 400 replaces, zero -absent reads, `git-commit-lock.sh:380-382`); it is *not* guaranteed on some -network filesystems (see §E). The 5.1 unlink+move lane has a real absent window, -making it the one engine where a rival's create can win the recovered path — -documented as a fairness loss, never a clobber (`docs/git-commit-lock.md:471-476`). -*Tier 1 on local FS.* **Recommend: in scope on local FS; the network-FS boundary -is §E.** +**D1 — Steal install: atomic overwrite vs. the 5.1 fallback.** The steal installs +its lock at the path by replacing whatever is there. There are two engine classes +and they differ in atomicity — so this row is *not* uniformly "atomic rename": +- **Atomic overwrite (the guaranteed lane):** one `rename(2)`-class replace with + no path-absent window. bash uses GNU `mv -T` where available, probed once, with + a guarded `[ -d ]` + bare-`mv` fallback on BSD/macOS + (`git-commit-lock.sh:954-979`); pwsh 7 uses the 3-arg `File.Move(src,dst,true)` + (`git-commit-lock.ps1:941-982`). Atomic replace is guaranteed on local POSIX FS + and NTFS (probe R1: 400 replaces, zero absent reads, + `git-commit-lock.sh:380-382`); *not* guaranteed on some network FS (§E). +- **Windows PowerShell 5.1 fallback (NOT atomic, but claim-guarded):** 5.1 has no + 3-arg overload, so it unlinks then does a 2-arg `Move` (`git-commit-lock.ps1:941-982`). + This lane has a real path-absent window in which a rival's *create* can win the + recovered path — a **fairness loss, never a clobber** (claim serialization still + admits one stealer; the loser re-polls), documented at + `docs/git-commit-lock.md:471-476`. +`File.Replace` is *deliberately never used* (throws on read-only dest; +partial-failure states) — pinned by a static grep in interop Test 16d +(`I:1141-1149`). *The atomic lane is Tier 1 on local FS; the 5.1 fallback is Tier +1 for safety (no clobber) but gives up rename atomicity (fairness only).* +**Recommend: in scope on local FS; the network-FS boundary is §E.** **D2 — O_EXCL atomic create.** `set -C` noclobber redirect (bash) / `FileMode.CreateNew` with `FileShare.ReadWrite|Delete` (ps1, @@ -367,12 +413,15 @@ principles about what can go wrong: `git-commit-lock.sh:925`) already absorbs the only real local clock glitch: the Windows FILETIME-zero (1601) transient on fresh files (`docs/git-commit-lock.md:283-293`, probed at 0.04–0.5% of readings). -- A **backward NTP step / large clock correction** on the one host could make a - live lock look stale (premature steal) or a stale lock look fresh (delayed - recovery). The first is the dangerous one — but it degrades into the *already - handled* B5 lane: a premature steal of a still-live hold is detected at release - as 98, never a silent double-commit. So even a local clock jump is - **correctness-safe, liveness-degraded** — Tier 2. +- A **large local clock correction** on the one host splits by sign, because + staleness is `age = now - mtime` (`git-commit-lock.sh:928, 1409`): a **forward** + jump (now leaps ahead) inflates the computed age, so a *live* lock can look + stale → premature steal; a **backward** jump (NTP steps back) shrinks the age, + so a genuinely *stale* lock can look fresh → delayed recovery. The + forward/premature-steal case is the only worrying one — and it degrades into the + *already handled* B5 lane: a premature steal of a still-live hold is detected at + release as 98 (given cooperative unwind), never a silent double-commit. So even + a local clock jump is **correctness-safe, liveness-degraded** — Tier 2. - **Cross-host** use over a shared FS (already E1-out-of-scope) is where skew would actually bite: host A's mtime compared against host B's `now` with minutes of skew could steal live locks wholesale. But this only arises *on a @@ -389,6 +438,25 @@ current docs imply it but never say "one clock." (b) Note the reassuring part: a *local* clock jump is correctness-safe (degrades to the detected-98 lane), so no code change is warranted. This is a **doc gap, not a code gap.** +**E3 — mtime probe fails entirely (the staleness clock is unreadable).** Distinct +from a *wrong* clock (E2): here the lock file's mtime cannot be read at all. Both +ports retry three times on a *present* file, then warn loudly once per process — +bash via `stat -c %Y` / `stat -f %m` / `date -r` (`git-commit-lock.sh:629-645`), +pwsh via `Get-Item.LastWriteTimeUtc` (`git-commit-lock.ps1:531-560`): *"Staleness +detection is BROKEN: stale locks will never be stolen, so a crashed holder wedges +waiters until MAX_WAIT."* The stale check then treats an unreadable mtime as **not +stale** — the floor guard `[ "$mt" -gt 946684800 ]` fails closed to "fresh" +(`git-commit-lock.sh:925-927`). **Safety is preserved**: the tool never steals a +lock whose age it cannot establish, so no premature steal and no corruption — but +**recovery of a genuinely crashed holder is disabled**, and waiters block to +MAX_WAIT (97). *Tier 2 (safety held, recovery lost — and loudly announced).* +Untested (no stat-failure injection). **Recommend: accept and document** — it is a +host/FS-health failure the tool already detects and announces, and it fails *safe* +(no false steal). Fault injection is low-ROI; the loud warning is the right +behavior. This is also the clean reason recovery is a *Tier-1-within-envelope* +property, not unconditional (see the tier split under §1): it presumes a readable +clock. + ### F. Resource exhaustion **F1 — Disk full (ENOSPC) during a claim/lock create or write.** The create is @@ -487,6 +555,30 @@ backstop gap as documented** — the stale window recovers it, and the supported option is handle-based ops (`git-commit-lock.ps1:146-151`), a larger change not worth it for a forgetful-caller edge. +**H4 — Hard process termination / `[Environment]::Exit()` while holding (the +no-silent-loss boundary).** §1's safety guarantee — a displaced holder reports 98 +rather than a false success — relies on the wrapper *reaching its release path*. +Two ways that doesn't happen while a lease is held: (a) the held process is +SIGKILL'd (untrappable; no handler runs in either port); (b) the wrapped command +itself ends the process abruptly, the sharpest case being PowerShell +`[Environment]::Exit(n)`, which bypasses `Lock-Release`, the `finally`, *and* the +`PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-245`). If such a process +was *already displaced* (its lease stolen past STALE) and exits **0**, its caller +sees success with no 98 — the one interleaving that defeats "no silent lost +update." Two bounds keep it narrow: a SIGKILL yields a non-zero wait status, so a +caller that checks exit codes does *not* see success; and the `run` contract pairs +acquire/release in `try/finally`, so only a command that *itself* hard-exits the +process (or an external SIGKILL) skips release — a normal-returning or +signal-trapped command always reaches it. The *next* holder still recovers via +staleness; only the abruptly-exiting one is unwarned. *Tier 2 — the residual edge +of the fail-open lease.* Exercised indirectly: interop Test 5 *uses* +`[Environment]::Exit()` to fabricate a no-release orphan, confirming the bypass +(`I:308-334`). **Recommend: document this as the explicit boundary of the +no-silent-loss guarantee**, alongside the "commits must be fast" golden rule — a +command that hard-exits mid-critical-section *after being displaced* is exactly +the fail-open case the STALE budget exists to make rare. No code change closes it +without the handle-based ops the design rejected (§H3). + ### I. Cross-implementation **I1 — Wire/format compatibility.** One on-disk format (token line 1, owner line @@ -499,7 +591,7 @@ lock (T4/T5), robbed-holder 98 both directions (T8), release-classification agreement (T11), cross-impl claim staleness clearing (T16c), and a Windows PowerShell 5.1 smoke lane (T17). **Recommend: in scope, keep — and keep the interop suite as the guard.** Two independent implementations hammering one lock -is the cheap adversarial verification (`README.md:92-95`). +is "cheap adversarial verification of the protocol" (`README.md:94`). **I2 — Mixed-version tree.** Prevention (the claim protocol) holds only when *all* parties run it; older releases stole with an unserialized move-aside, so a @@ -507,8 +599,9 @@ mixed tree degrades prevention to detection (98) and can leave `.dead.*` litter current versions don't clean (residual 4, `git-commit-lock.sh:261-265`). *Tier 3.* Untested (would require shipping an old version into the suite). **Recommend: out of scope; keep the "upgrade both implementations together" deployment note** -(it's in `README.md` and the design doc). Acceptable because the degraded mode is -still *detected* (98), never silent. +— currently in the design doc only (`docs/git-commit-lock.md:251-255`), **not** in +`README.md`; surface it there too, where operators actually look. Acceptable +because the degraded mode is still *detected* (98), never silent. ### J. Logging subsystem failure @@ -531,10 +624,15 @@ flakes are real gaps vs harness concerns. **The clean split: correctness is load-independent; liveness/latency is not.** -- **Load-independent (Tier 1, must always hold):** mutual exclusion, no silent - lost update, no corruption, eventual recovery. These rest on O_EXCL create + - atomic rename + per-attempt-token discovery — *structural* properties that do - not reference the clock for their *correctness*. The mtime floor +- **Load-independent (Tier 1 *safety*, must always hold):** no silent lost update + (given cooperative unwind, §1/§H4), no corruption, and strict mutual exclusion + *within the staleness window*. These rest on O_EXCL create + atomic rename + + per-attempt-token discovery — *structural* properties that do not reference the + clock for their *correctness*. (Recovery of lock-shaped orphans is also + load-independent in *correctness* — only its latency degrades — but it presumes + a readable clock, §E3, and does not extend to foreign objects, per the tier + split under §1.) The mtime + floor (`:925`) and the read-retry ladder (`:668-684`) exist precisely so that the one timing-sensitive input (mtime, and transient empty reads) cannot corrupt a correctness decision: a sub-floor or unsettled reading is treated as "wait," @@ -582,8 +680,11 @@ concern:** fix the TEST, not the code.** Two examples surfaced by the prior stress effort (which I verified independently against the code, not adopted): - *Test 21's `≤20s` recovery-latency assertion* (`U:1144`) and - - *Test 22(a)'s claim-warning timing* (which needs ≥2 blocked polls before - MAX_WAIT to fire the two-consecutive-poll-confirmed warning, `U:1162-1168`), + - *Test 22(a)'s claim-path warning* — the warning relies on the + two-consecutive-poll confirmation (the mechanism Test 17d pins for the lock + path) having poll *headroom* before MAX_WAIT, which an oversubscribed runner + can starve (`U:1156-1172`); the test asserts the warning fires, not a specific + poll count, - and *Test 29's `≥2 CLAIM lines` discriminator* (explicitly given `MAX_WAIT=6` headroom, `U:1514-1518`). From 57b141854a0d2c180bf5653dce75c3ec25a7500a Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 12:40:47 +1000 Subject: [PATCH 16/76] =?UTF-8?q?docs(failure-modes):=20review=20round=202?= =?UTF-8?q?=20=E2=80=94=20generalize=20the=20no-silent-loss=20bypass=20cla?= =?UTF-8?q?ss?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Address the round-2 reviews (Codex + fresh Claude), verified against the code: - Codex (blocking): §H4's bypass list was incomplete for bash. lock_run runs the wrapped command in the wrapper shell itself (git-commit-lock.sh:1733), so a wrapped `exec` replaces that shell and skips BOTH lock_release and the EXIT trap — the same silent-loss boundary as the pwsh [Environment]::Exit() case. Generalize §H4 and the §1 bullet from an enumeration to the class "termination/replacement without wrapper unwind" (external SIGKILL / bash exec / [Environment]::Exit()), and add the contrast that a plain `exit` is safe (it unwinds: bash EXIT trap, pwsh finally). - Claude (nit): §H4 had attributed try/finally to the bash run path; corrected to bash EXIT trap vs pwsh try/finally. Both rounds confirmed the central thesis holds (no two believed-legitimate holders; no UNdetected lost update on a local FS within the envelope) and that round 1's revisions 1-7 are factually correct and internally consistent. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/failure-modes.md | 72 ++++++++++++++++++++++++------------------- 1 file changed, 41 insertions(+), 31 deletions(-) diff --git a/docs/failure-modes.md b/docs/failure-modes.md index d078453..5bb0f4f 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -49,14 +49,16 @@ both matter for the scope decision: Tier-1 guarantee **within the envelope (commits faster than STALE)**, not an unconditional one. - **Detection requires the wrapper to actually reach release.** The 98 path fires - on normal return and on trapped signals. It does **not** fire if the held - process is *hard-killed* (SIGKILL) or if the wrapped command terminates the - process abruptly — notably PowerShell `[Environment]::Exit()`, which bypasses - both `Lock-Release` and the `PowerShell.Exiting` backstop - (`git-commit-lock.ps1:221-245`). Such an abrupt exit can report success without - the 98 (see **§H4**). The *next* holder still recovers via staleness, but the - abruptly-exiting one is not warned. Hence the precise statement: **no silent - lost update, provided the wrapper unwinds cooperatively.** + on normal return and on trapped signals. It does **not** fire if the held process + is terminated or *replaced* without unwinding — an external SIGKILL, a bash + `exec` in the wrapped command (which replaces the holding shell, so neither + `lock_release` nor the EXIT trap runs), or PowerShell `[Environment]::Exit()` + (bypasses `Lock-Release`, the `finally`, and the `PowerShell.Exiting` backstop, + `git-commit-lock.ps1:221-245`). A *plain* `exit` is safe — it unwinds. A + non-unwinding exit returning 0 *while displaced* can report success without the + 98 (see **§H4**). The *next* holder still recovers via staleness, but the + abruptly-exiting one is not warned. Hence the precise statement: **no silent lost + update, provided the wrapper unwinds cooperatively.** Liveness (eventual recovery) and bounded stalls are best-effort within an operating envelope (Tier 2), not absolute — and "recovery" means lock-shaped @@ -136,7 +138,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | H1 | SIGINT/SIGTERM mid-hold | Release + re-raise (143); traps restored | 1 | ✓ U:577-600/1989-2011 | **In scope.** Keep (bash). ps1 = §H. | | H2 | EXIT-while-holding | Release + chain caller's EXIT trap | 1 | ✓ U:633-648 | **In scope.** Keep. | | H3 | ps1 process death under `-File` | `PowerShell.Exiting` does NOT fire; relies on stale window | 2 | ○ (limit documented) | **Accept;** `run` path is covered. See §H. | -| H4 | Hard kill / `[Environment]::Exit()` while held | Bypasses release → a displaced holder is unwarned (no 98) | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. | +| H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. | | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. | | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. | | J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. | @@ -555,29 +557,37 @@ backstop gap as documented** — the stale window recovers it, and the supported option is handle-based ops (`git-commit-lock.ps1:146-151`), a larger change not worth it for a forgetful-caller edge. -**H4 — Hard process termination / `[Environment]::Exit()` while holding (the -no-silent-loss boundary).** §1's safety guarantee — a displaced holder reports 98 -rather than a false success — relies on the wrapper *reaching its release path*. -Two ways that doesn't happen while a lease is held: (a) the held process is -SIGKILL'd (untrappable; no handler runs in either port); (b) the wrapped command -itself ends the process abruptly, the sharpest case being PowerShell -`[Environment]::Exit(n)`, which bypasses `Lock-Release`, the `finally`, *and* the -`PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-245`). If such a process -was *already displaced* (its lease stolen past STALE) and exits **0**, its caller -sees success with no 98 — the one interleaving that defeats "no silent lost -update." Two bounds keep it narrow: a SIGKILL yields a non-zero wait status, so a -caller that checks exit codes does *not* see success; and the `run` contract pairs -acquire/release in `try/finally`, so only a command that *itself* hard-exits the -process (or an external SIGKILL) skips release — a normal-returning or -signal-trapped command always reaches it. The *next* holder still recovers via -staleness; only the abruptly-exiting one is unwarned. *Tier 2 — the residual edge -of the fail-open lease.* Exercised indirectly: interop Test 5 *uses* -`[Environment]::Exit()` to fabricate a no-release orphan, confirming the bypass -(`I:308-334`). **Recommend: document this as the explicit boundary of the +**H4 — Process termination/replacement *without wrapper unwind* (the no-silent-loss +boundary).** §1's safety guarantee — a displaced holder reports 98 rather than a +false success — relies on the wrapper *reaching its release path*. The bypass class +is any termination or replacement of the holding process that skips that unwind; +crucially it is **not** triggered by a normal `exit`. The instances: +- **External SIGKILL** — untrappable; no handler runs in either port. +- **bash `exec` in the wrapped command** — `run` executes `"$@"` *in the wrapper + shell itself* (`git-commit-lock.sh:1733`), so an `exec` replaces that shell's + process image and *neither* the trailing `lock_release` *nor* the `EXIT` trap + (`git-commit-lock.sh:1002-1013`, armed at `:1308`) runs. +- **PowerShell `[Environment]::Exit(n)`** — a CLR hard-exit that bypasses + `Lock-Release`, the `finally`, *and* the `PowerShell.Exiting` backstop + (`git-commit-lock.ps1:221-245`). + +The useful contrast: a **plain `exit` is safe** — bash `exit` fires the EXIT trap +(which releases), and a plain `exit` inside the pwsh `run` body unwinds its +`finally` (`git-commit-lock.ps1:1928-1979`). Only *non-unwinding* termination or +replacement escapes. If such a process was *already displaced* (its lease stolen +past STALE) and exits **0**, its caller sees success with no 98 — the one +interleaving that defeats "no silent lost update." What keeps it narrow: an external +SIGKILL yields a non-zero wait status (`128+9`), so a caller checking exit codes does +*not* see success; the leak needs a command that *deliberately* replaces or +hard-exits the process **and** returns 0 **while displaced**. The *next* holder +still recovers via staleness; only the abruptly-exiting one is unwarned. *Tier 2 — +the residual edge of the fail-open lease.* Exercised indirectly: interop Test 5 +*uses* `[Environment]::Exit()` to fabricate a no-release orphan, confirming the +bypass (`I:308-334`). **Recommend: document this as the explicit boundary of the no-silent-loss guarantee**, alongside the "commits must be fast" golden rule — a -command that hard-exits mid-critical-section *after being displaced* is exactly -the fail-open case the STALE budget exists to make rare. No code change closes it -without the handle-based ops the design rejected (§H3). +command that replaces/hard-exits the process mid-critical-section *after being +displaced* is exactly the fail-open case the STALE budget exists to make rare. No +code change closes it without the handle-based ops the design rejected (§H3). ### I. Cross-implementation From 534a0073dc5f799575e24b8218686c4e00c6eb8f Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 17:39:45 +1000 Subject: [PATCH 17/76] ben comments: docs/failure-modes.md --- docs/failure-modes.md | 12 +++++++++--- 1 file changed, 9 insertions(+), 3 deletions(-) diff --git a/docs/failure-modes.md b/docs/failure-modes.md index 5bb0f4f..61a4dce 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -733,6 +733,8 @@ asserting a Tier-1 bound on a Tier-2 quantity. Ordered by how much they need an explicit owner decision. +agree except where indicated + 1. **Define and document the load/timing envelope (§K) — highest value.** *Recommendation:* state in `docs/git-commit-lock.md` that correctness (exclusion, no silent loss, eventual recovery) is load-independent, while all @@ -760,14 +762,16 @@ Ordered by how much they need an explicit owner decision. possible but cross-platform-awkward and incomplete — treat as low-priority polish, not a requirement. -4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap +Don't do the polish, just document. + +1. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap (§H3) — accept as documented.** Both are real but confined to an unsupported config (ps1-on-POSIX) or a forgetful-caller edge that the stale window recovers. *Recommendation:* no code change; confirm they stay documented. Reconsider only if PowerShell-on-POSIX ever becomes supported (it isn't, `README.md:91-95`). -5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write +2. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write failure F2/J1).** These degrade safely (wait/97, or silent best-effort log loss) but have **no fault-injection tests** — they are reasoned-correct, not verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection @@ -776,7 +780,9 @@ Ordered by how much they need an explicit owner decision. → clean 97** (cheap to write deterministically; F4), since that's the most likely real-world misconfiguration of the set. -6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, +i'd add test coverage for the various scenarios. It just makes the project easier to maintain and for future users to use if the these sorts of edge cases are actually tested rather than reasoned correct but untested. + +1. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, confirm.** The first degrades to detection (98), never silent, and is covered by the "upgrade both together" note. The second is a non-issue. *Recommendation:* leave both out of scope; optionally one sentence each in the design doc. From 959cca90e839af37b0b96f1cd2edd9413678f5fe Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 17:39:45 +1000 Subject: [PATCH 18/76] Revert "ben comments: docs/failure-modes.md" This reverts commit 534a0073dc5f799575e24b8218686c4e00c6eb8f. --- docs/failure-modes.md | 12 +++--------- 1 file changed, 3 insertions(+), 9 deletions(-) diff --git a/docs/failure-modes.md b/docs/failure-modes.md index 61a4dce..5bb0f4f 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -733,8 +733,6 @@ asserting a Tier-1 bound on a Tier-2 quantity. Ordered by how much they need an explicit owner decision. -agree except where indicated - 1. **Define and document the load/timing envelope (§K) — highest value.** *Recommendation:* state in `docs/git-commit-lock.md` that correctness (exclusion, no silent loss, eventual recovery) is load-independent, while all @@ -762,16 +760,14 @@ agree except where indicated possible but cross-platform-awkward and incomplete — treat as low-priority polish, not a requirement. -Don't do the polish, just document. - -1. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap +4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap (§H3) — accept as documented.** Both are real but confined to an unsupported config (ps1-on-POSIX) or a forgetful-caller edge that the stale window recovers. *Recommendation:* no code change; confirm they stay documented. Reconsider only if PowerShell-on-POSIX ever becomes supported (it isn't, `README.md:91-95`). -2. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write +5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write failure F2/J1).** These degrade safely (wait/97, or silent best-effort log loss) but have **no fault-injection tests** — they are reasoned-correct, not verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection @@ -780,9 +776,7 @@ Don't do the polish, just document. → clean 97** (cheap to write deterministically; F4), since that's the most likely real-world misconfiguration of the set. -i'd add test coverage for the various scenarios. It just makes the project easier to maintain and for future users to use if the these sorts of edge cases are actually tested rather than reasoned correct but untested. - -1. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, +6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, confirm.** The first degrades to detection (98), never silent, and is covered by the "upgrade both together" note. The second is a non-issue. *Recommendation:* leave both out of scope; optionally one sentence each in the design doc. From a5df9d9ae2d2f72199b27f320bc2a09f10d36288 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 17:45:36 +1000 Subject: [PATCH 19/76] c converged: docs/failure-modes.md MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Applied Ben's review comments (frozen in 534a0073) to the failure-modes scope-decisions doc: - §4 status: the recommendations are reviewed and accepted, except two overrides. - Network FS (§4.3 / §E1): document-only — surface the boundary in README, do NOT build the FS-type startup probe ("don't do the polish, just document"). - Untested-but-robust lanes (§4.5 / F1-F4, J1): OVERRIDE the prior "accept untested" -> add test coverage. Rationale (Ben): actually-tested edge cases make the project easier to maintain and give future users confidence vs reasoned-correct-but-untested. Propagated to the per-mode F1/F2/F3/F4/J1 entries and their summary-table rows. Disposition check passed (fresh verifier): every comment dispositioned, propagation consistent, no leaked comment text, §4 numbering coherent. comment-commit: 534a0073dc5f799575e24b8218686c4e00c6eb8f Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/failure-modes.md | 79 +++++++++++++++++++++++++++---------------- 1 file changed, 49 insertions(+), 30 deletions(-) diff --git a/docs/failure-modes.md b/docs/failure-modes.md index 5bb0f4f..0332055 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -127,10 +127,10 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. | | E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. | | E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ○ | **Accept** — fails safe + announced. See §E3. | -| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ (reasoned, not tested) | **Accept**, document. See §F1. | -| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ | **Accept;** logging is best-effort by design. | -| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ | **Accept**, document. | -| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ | **Accept**, document. See §F4. | +| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ → test planned | **Add test** (§4.5) + document. See §F1. | +| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ → test planned | **Add test** (§4.5); logging best-effort, lock unaffected. | +| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, FD via `ulimit`), document. | +| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, highest-value). See §F4. | | G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. | | G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. | | G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. | @@ -141,7 +141,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. | | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. | | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. | -| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ | **Accept;** logging never blocks the lock. | +| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ → test planned | **Add test** (§4.5, via F2); logging never blocks the lock. | | K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. | | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. | @@ -469,28 +469,35 @@ ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud, fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write manual-fix residual).* Reasoned from code, **not tested** (no ENOSPC fault -injection). **Recommend: accept and document.** ENOSPC is a host-health failure; -the tool degrades safely (no corruption, no false hold) and the one sharp edge -(sub-`tok.` torn write needing manual `rm`) is already documented. Not worth -fault-injection tests. +injection). **Recommend: document + add a fault-injection test (per §4.5).** ENOSPC +is a host-health failure; the tool degrades safely (no corruption, no false hold) +and the one sharp edge (sub-`tok.` torn write needing manual `rm`) is already +documented. Per Ben's §4.5 decision, add an ENOSPC test where it can be injected +deterministically and portably (e.g. a small dedicated tmpfs/quota); if portable +injection proves impractical, say so in the plan rather than shipping a flaky test. **F2 — ENOSPC during a LOG write.** All log writes end in `|| true` (`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.* -**Recommend: accept** — logging is best-effort by explicit design (it must never -block or fail the lock). The only downside is reduced post-mortem signal under -disk pressure, which is acceptable. +**Recommend: accept + add a test (per §4.5)** — logging is best-effort by explicit +design (it must never block or fail the lock); the only downside is reduced +post-mortem signal under disk pressure. Add a test that an unwritable/failing log +path leaves the lock fully working (the write is swallowed) — this also covers J1. **F3 — Inode / FD exhaustion.** Same shape as F1: a create that can't get an inode fails → wait → eventually 97. The tool holds at most a couple of FDs -briefly. *Tier 2.* Untested. **Recommend: accept, document as host-health.** +briefly. *Tier 2.* Untested. **Recommend: document + add a test (per §4.5)** as +host-health — an FD-exhaustion test via `ulimit -n` is the deterministic, portable +one; add inode exhaustion only if it can be injected cleanly. **F4 — Read-only / unwritable lock dir or parent.** `lock_acquire` does a best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is unwritable the create fails every poll and the waiter times out at 97. No corruption, no false hold. A *release* unlink blocked by an unwritable parent routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* Untested directly. -**Recommend: accept, document.** A correct, if blunt, outcome (97); arguably an -*earlier, clearer* error would be nicer — optional polish, low priority. +**Recommend: add a test (per §4.5 — the highest-value one).** An unwritable lock +dir → clean 97 is cheap and deterministic to write. A correct, if blunt, outcome +(97); an *earlier, clearer* error would be nicer but is optional polish, low +priority. **F5 — Memory exhaustion.** The scripts allocate trivially (a few shell vars; the leaked-token list is "almost always empty"). Not a meaningful failure surface. @@ -620,11 +627,11 @@ than rotating (`git-commit-lock.sh:554-562`). A broken log never blocks or fails the lock. Under a redirected git dir, log *content* (the owner line) is attacker-influenceable — one-line text spoofing, no execution; the tool itself writes only its token, owner line, and protocol events, never secrets -(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept** — logging -is best-effort by design, which is the right call for a lock that must keep -working when the disk is full or the log path is bad. The only follow-on: don't -build automation that *trusts* log text from an untrusted repo (already -documented). +(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept + covered by the +F2 log-failure test (per §4.5)** — logging is best-effort by design, which is the +right call for a lock that must keep working when the disk is full or the log path +is bad. The follow-on (unchanged): don't build automation that *trusts* log text +from an untrusted repo (already documented). ### K. Behavior under extreme load / scheduling pressure, and internal time budgets @@ -733,6 +740,12 @@ asserting a Tier-1 bound on a Tier-2 quantity. Ordered by how much they need an explicit owner decision. +**Status (Ben, 2026-06-17): reviewed and accepted — with two changes marked below.** +Item 3 (network FS) is **document-only**: do not build the FS-type probe. Item 5 is +**overridden** — the untested-but-robust lanes *will* get test coverage (actually-tested +edge cases make the tool more maintainable and give future users confidence), rather than +"accept untested". Every other recommendation is accepted as written. + 1. **Define and document the load/timing envelope (§K) — highest value.** *Recommendation:* state in `docs/git-commit-lock.md` that correctness (exclusion, no silent loss, eventual recovery) is load-independent, while all @@ -754,11 +767,11 @@ Ordered by how much they need an explicit owner decision. 3. **Network/shared FS is out of scope but fails *silently* if entered (§E1).** The boundary is correctly stated in the design doc but only there. - *Recommendation:* surface it in `README.md` (where operators look), since the - failure on a bad FS is silent loss of exclusion. Do **not** attempt to - *support* network FS. An optional best-effort FS-type startup probe is - possible but cross-platform-awkward and incomplete — treat as low-priority - polish, not a requirement. + *Decision (Ben — document-only):* surface the boundary in `README.md` (where + operators look), since the failure on a bad FS is silent loss of exclusion. Do + **not** attempt to *support* network FS, and **do not build** the optional + FS-type startup probe — just document. (It would be cross-platform-awkward and + incomplete anyway; Ben: "don't do the polish, just document.") 4. **ps1-on-POSIX FIFO/device residual (§D3) and ps1 `-File` exit backstop gap (§H3) — accept as documented.** Both are real but confined to an unsupported @@ -770,11 +783,17 @@ Ordered by how much they need an explicit owner decision. 5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write failure F2/J1).** These degrade safely (wait/97, or silent best-effort log loss) but have **no fault-injection tests** — they are reasoned-correct, not - verified. *Recommendation:* accept without adding ENOSPC/EMFILE injection - tests (low ROI; the degradation is structurally safe). If the owner wants one - belt-and-braces test, the highest-value single one is an **unwritable lock dir - → clean 97** (cheap to write deterministically; F4), since that's the most - likely real-world misconfiguration of the set. + verified. *Decision (Ben — overrides the prior "accept untested"):* **add test + coverage** for these lanes. Rationale: actually-tested edge cases make the + project easier to maintain and give future users confidence, versus + "reasoned-correct but untested." Add deterministic fault-injection tests where + feasible — **unwritable lock dir → clean 97** (F4, cheapest/highest-value and + the most likely real-world misconfig); an **unwritable log path → the lock + still works, the log write is swallowed** (F2/J1); and the **ENOSPC / inode / + FD-exhaustion** lanes (F1/F3) where they can be injected deterministically and + portably (e.g. a small dedicated tmpfs or quota for ENOSPC, `ulimit -n` for + FDs). Flag in the plan any lane that proves genuinely impractical to fault-inject + portably, rather than forcing a flaky test. 6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, confirm.** The first degrades to detection (98), never silent, and is covered From 9048400ae6ef92f904e01e62fda46fd7a18aba4b Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 17:48:10 +1000 Subject: [PATCH 20/76] Plan proposal: guarantees spec + failure-modes follow-ups (await Ben review) --- ...-ci-stress-guarantees-and-coverage-plan.md | 124 ++++++++++++++++++ 1 file changed, 124 insertions(+) create mode 100644 .plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md new file mode 100644 index 0000000..7b067ce --- /dev/null +++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md @@ -0,0 +1,124 @@ +# Plan proposal: guarantees spec + close the failure-modes follow-ups + +Status: **PROPOSAL — awaiting Ben's review.** No implementation until approved. +This is the action list + proposed workflow Ben asked for after the `/c` pass on +`docs/failure-modes.md` (his comments converged at commit a5df9d9; recorded 534a0073). + +## Where this comes from +`docs/failure-modes.md` is the **analysis / decision-support** doc (current behavior, +3-tier classification, recommendations). Ben has now decided on its §4 (agree, with two +overrides). The follow-ups below turn those decisions into work, and add the new doc Ben +asked for: a **normative spec** ("what we guarantee / what's out of scope") — distinct from +the analysis doc. + +## Action list (requirements / things to do) + +### Bucket 1 — NEW normative guarantees spec (Ben's explicit ask) +- **A1.** Create a normative spec doc — *what the tool guarantees* and *what is out of + scope* — derived from `failure-modes.md`'s tiers but written as a contract, not analysis. + - Guarantees: the Tier-1 **safety** properties (no silent lost update given cooperative + unwind; strict mutual exclusion within the staleness window; no corruption) and the + Tier-1 **recovery** properties (lock-shaped orphans reclaimed), each with their stated + conditions/envelope. + - Out of scope: network/shared FS, multi-host/clock-skew, mixed-version trees, + ps1-on-POSIX, the non-unwinding-exit boundary (§H4) — the documented boundaries. + - Defines the **operating envelope** precisely (the load/timing envelope from §4.1) — the + reference Bucket 4 scopes tests against. + - *Open decision D-a:* location/name — `docs/guarantees.md` (new), or a normative section + inside `docs/git-commit-lock.md`? (Recommend a dedicated `docs/guarantees.md` — a crisp + contract is easier to point users/CI at than a section.) + +### Bucket 2 — Test coverage for the untested-but-robust lanes (§4.5, Ben's override) +Decision (Ben): tested edge cases > reasoned-correct-but-untested. Add deterministic, +**portable**, fault-injection tests; flag any lane that can't be injected portably rather +than shipping a flake. **All test execution via CI** (local runs are banned — they lag +Ben's box). +- **B-F4.** Unwritable lock dir/parent → clean 97 (cheapest, highest-value; `chmod`). +- **B-F2/J1.** Unwritable / failing log path → lock still works, the log write is swallowed. +- **B-F1.** ENOSPC during claim/lock create+write (small dedicated tmpfs or quota). +- **B-F3.** FD exhaustion via `ulimit -n` (portable); inode exhaustion only if cleanly + injectable. +- **B-E3 (candidate).** mtime probe unreadable → staleness-detection-disabled, fail-safe + (no steal), 97 + the once-per-process warning. (Also a ○ untested lane; fits the same + decision — include unless Ben says skip.) +- *Open decision D-b:* scope — just the §4.5 set (F1-F4, J1) + E3, or also fold in the two + **deferred F2-audit gaps**: #7 wrong-type object appearing *at the lock path mid-steal* + (A2/G2 — `CLAIM-ABORT (wrong-type)`/`(rename-refused)`), and #8 the Windows-only + blocked-unlink legs? (Recommend: do F4/F2/J1/F3 now; treat F1-ENOSPC, E3, and #7/#8 as a + second tier to confirm.) +- Platform reality: several lanes are POSIX-only (tmpfs, `ulimit`, chmod semantics) — guard + by platform like the existing suite does; Windows-specific lanes (no-delete-share) already + have their own gated tests. + +### Bucket 3 — Documentation gaps (all "document" decisions: §4.1-4.3, §4.6, §I2) +- **C-envelope (§4.1).** Document the load/timing envelope in `docs/git-commit-lock.md`: + "correctness is load-independent; wall-clock bounds (recovery latency, MAX_WAIT, the read + ladder) are best-effort and scale with scheduling." +- **C-clock (§4.2).** One sentence: the tool assumes a single time source (single-host, or a + shared FS with one server clock); a local clock jump is correctness-safe. +- **C-netfs (§4.3).** Surface the network/shared-FS boundary in `README.md` (document-only, + **no** FS-type probe). +- **C-mixedver (§I2).** Add the "upgrade both implementations together" note to `README.md` + (currently design-doc-only). +- **C-misc (§4.6, optional).** One-line each for mixed-version + case-insensitive FS in the + design doc. + +### Bucket 4 — Scope the wall-clock test bounds (§4.1 — the Test 21/22a resolution) +- **S1.** Relax / scope the wall-clock assertions that flake only under extreme artificial + load — **Test 21** (≤20s recovery), **Test 22a** (claim-warning timing), **Test 29** + (≥2-CLAIM poll count) — to the envelope Bucket 1 defines, so the protocol's correctness + assertions in those tests stay strict while the latency/poll-count bounds get headroom (or + are gated to a defined load level). *Depends on Bucket 1's envelope.* +- *Open decision D-c:* relax the numbers in place, or split the suite into a + "correctness" tier (always strict) and a "latency/envelope" tier the extreme-stress runs + don't hard-fail on? (Recommend the latter — it makes the envelope explicit and stops + future stress runs re-raising these as "flakes".) + +### Bucket 5 — Branch hygiene (standing, NOT part of this workflow unless wanted) +- The mergeable commits (the 4 test fixes 58c3741/06c6d8e/51a1753/19a28fd + the docs) vs the + **stress-only, do-not-merge** commits (980856b concurrency tweak, b430d73 load wrapper). + When this lands on `main`, cherry-pick the mergeable set and leave the stress scaffolding. + *Open decision D-d:* do this work on `ci-stress` and cherry-pick later, or branch a clean + `failure-modes` off `main` now? (Recommend: keep working on `ci-stress`; cherry-pick at the + end — the stress wrapper is useful for CI-verifying the new tests under load.) + +## Proposed workflow (our usual approach: spec → plan → implement → review) + +Each phase ends with **Claude + Codex review rounds to convergence** and a **Ben gate**. +Test execution is **CI-only** throughout. + +**Phase 1 — Spec.** Write the Bucket-1 guarantees/scope spec + the precise operating +envelope. Review (Claude + Codex) against the code and `failure-modes.md`. → Ben approves the +spec before any implementation. (This is where the new doc Ben asked for gets created.) + +**Phase 2 — Plan.** A concrete implementation plan for Buckets 2-4: per-test injection method +(tmpfs / `ulimit` / chmod) + platform guard + CI wiring; the exact doc edits; the test-bound +scoping approach (per D-c). Include a logging/observability note (what each new test asserts +in the logs). Record in `.plans/`, review (Claude + Codex). → Ben approves the plan. + +**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2), apply the doc edits +(Bucket 3), scope the wall-clock bounds (Bucket 4). Commit incrementally under the +commit-lock. **Verify via CI** (dispatch `tests.yml` on `ci-stress`) — never locally. + +**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **under the +stress load wrapper** to confirm (a) the new tests pass and are non-flaky, and (b) the scoped +bounds stop Test 21/22a/29 flaking at extreme load while keeping correctness strict. Iterate +to clean. → Ben's final review. + +**Execution mechanics (open decision D-e):** run the phases by hand (subagent review rounds as +this session has been doing), or drive Phases 3-4 with a Claude Code **Workflow** (multi-agent +fan-out — one agent per test lane, adversarial verify, etc.)? (Recommend: hand-run Phase 1-2; +consider a Workflow for Phase 3-4 if the test count grows. Your call.) + +## Decisions I need from Ben (summary) +- **D-a:** new `docs/guarantees.md` vs a section in the design doc. (rec: new doc) +- **D-b:** test scope — §4.5 set + E3 now, defer #7/#8? (rec: F4/F2/J1/F3 first; F1/E3/#7/#8 second tier) +- **D-c:** scope test bounds by relaxing numbers vs a correctness/envelope test split. (rec: split) +- **D-d:** keep on `ci-stress` + cherry-pick later vs clean branch now. (rec: ci-stress) +- **D-e:** hand-run vs Workflow for Phase 3-4. (rec: hand-run 1-2, decide later for 3-4) + +## Out of scope for this plan +- Anything the design already rejected (heartbeat, two-rename CAS, `File.Replace`, supporting + network FS) — see `failure-modes.md` §4 "Things explicitly NOT to do". +- No product *behavior* changes are implied by any of the above — these are tests + docs + + test-bound scoping. (If a new test surfaces a real product bug, that's a separate loop.) From 261744963fa68dae4ce090f2ec025a414b369550 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 18:05:29 +1000 Subject: [PATCH 21/76] Plan: lock decisions D-a..e; add Bucket 6 / Phase 1b (load-testing strategy recommendation, Ben f) --- ...-ci-stress-guarantees-and-coverage-plan.md | 96 ++++++++++++------- 1 file changed, 64 insertions(+), 32 deletions(-) diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md index 7b067ce..0bf4445 100644 --- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md +++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md @@ -82,40 +82,72 @@ Ben's box). `failure-modes` off `main` now? (Recommend: keep working on `ci-stress`; cherry-pick at the end — the stress wrapper is useful for CI-verifying the new tests under load.) -## Proposed workflow (our usual approach: spec → plan → implement → review) +### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code +The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete +loops) was thrown together from a few lines of discussion. Ben wants a **considered, +first-principles rethink** — explicitly **not anchored on the existing approach** — whose +**deliverable is a recommendation doc for Ben, NOT an implementation.** Scope: +- **Is the load injection right?** From first principles: which KINDS of load actually stress + *this* tool's timing-critical windows (claim→rename, read-back, discovery, mtime/staleness, + fsync durability, scheduler preemption at critical points)? Are CPU-spin + disk-fsync the + right proxies, or are better mechanisms warranted (cgroup CPU throttling, `taskset`/`nice`, + `ionice`, `stress-ng` stressors, FUSE/FS-latency injection, memory pressure)? Faithfulness, + reproducibility, and calibration (load relative to runner core count). +- **Expand the CI matrix** on free public GitHub runners: run the suite across + {OS} × {load level} × {load kind} × {config} in parallel. How many cells is *considered* vs + *blowing it up* — diminishing returns, signal-per-cell, GitHub concurrency limits, a small + per-PR tier vs a larger nightly tier. +- **Get more from EXISTING tests, routinely:** parametrize the fan-out/timing tests across + waiter counts and knob values (STALE / CLAIM_STALE / POLL / MAX_WAIT) so each run exercises + more surface — without adding flakiness. Which tests benefit most. +- **Considered, not maximalist:** principles for choosing the matrix + a routine cadence. +Output: `docs/load-testing-strategy.md` (recommendation). Runs EARLY (Phase 1b) because it +shapes Buckets 2 & 4 and the Phase-2 plan. + +## Workflow (settled: spec → plan → implement → review) Each phase ends with **Claude + Codex review rounds to convergence** and a **Ben gate**. -Test execution is **CI-only** throughout. - -**Phase 1 — Spec.** Write the Bucket-1 guarantees/scope spec + the precise operating -envelope. Review (Claude + Codex) against the code and `failure-modes.md`. → Ben approves the -spec before any implementation. (This is where the new doc Ben asked for gets created.) - -**Phase 2 — Plan.** A concrete implementation plan for Buckets 2-4: per-test injection method -(tmpfs / `ulimit` / chmod) + platform guard + CI wiring; the exact doc edits; the test-bound -scoping approach (per D-c). Include a logging/observability note (what each new test asserts -in the logs). Record in `.plans/`, review (Claude + Codex). → Ben approves the plan. - -**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2), apply the doc edits -(Bucket 3), scope the wall-clock bounds (Bucket 4). Commit incrementally under the -commit-lock. **Verify via CI** (dispatch `tests.yml` on `ci-stress`) — never locally. - -**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **under the -stress load wrapper** to confirm (a) the new tests pass and are non-flaky, and (b) the scoped -bounds stop Test 21/22a/29 flaking at extreme load while keeping correctness strict. Iterate -to clean. → Ben's final review. - -**Execution mechanics (open decision D-e):** run the phases by hand (subagent review rounds as -this session has been doing), or drive Phases 3-4 with a Claude Code **Workflow** (multi-agent -fan-out — one agent per test lane, adversarial verify, etc.)? (Recommend: hand-run Phase 1-2; -consider a Workflow for Phase 3-4 if the test count grows. Your call.) - -## Decisions I need from Ben (summary) -- **D-a:** new `docs/guarantees.md` vs a section in the design doc. (rec: new doc) -- **D-b:** test scope — §4.5 set + E3 now, defer #7/#8? (rec: F4/F2/J1/F3 first; F1/E3/#7/#8 second tier) -- **D-c:** scope test bounds by relaxing numbers vs a correctness/envelope test split. (rec: split) -- **D-d:** keep on `ci-stress` + cherry-pick later vs clean branch now. (rec: ci-stress) -- **D-e:** hand-run vs Workflow for Phase 3-4. (rec: hand-run 1-2, decide later for 3-4) +Test execution is **CI-only** throughout (local runs lag Ben's box). + +**Phase 1a — Guarantees spec.** Write `docs/guarantees.md` (D-a) — what we guarantee / what's +out of scope, as a normative contract + the precise operating envelope. Review (Claude + +Codex) against the code + `failure-modes.md`. → Ben gate. + +**Phase 1b — Load-&-matrix testing STRATEGY recommendation (Bucket 6 / Ben "f").** Run a +considered, first-principles process (parallel research agents on distinct facets: the tool's +timing-window→load-type mapping + critique of the current wrapper; CI-matrix design on free +runners; existing-test parametrization), synthesize into `docs/load-testing-strategy.md`, +review (Claude + Codex). **Recommendation only — NO implementation.** → Ben reviews; his chosen +recommendations feed Phase 2. Runs early because it shapes Buckets 2 & 4. (1a and 1b are +independent and can run in parallel.) + +**Phase 2 — Plan.** Concrete implementation plan for Buckets 2-4, incorporating Ben's chosen +load/matrix recommendations: per-test injection method (tmpfs / `ulimit` / chmod) + platform +guard + CI wiring; the matrix/parametrization to adopt; exact doc edits; the +correctness/envelope test split (D-c); a logging/observability note. Record in `.plans/`, +review. → Ben gate. + +**Phase 3 — Implementation.** Build the fault-injection tests (Bucket 2, tiered per D-b), apply +the doc edits (Bucket 3), scope the wall-clock bounds + split the tiers (Bucket 4 / D-c), wire +the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **Verify via CI** +(dispatch `tests.yml` on `ci-stress`) — never locally. + +**Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the +agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the +matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then (D-d) cherry-pick +the mergeable commits to `main`. + +## Decisions (settled 2026-06-17) +- **D-a → new `docs/guarantees.md`** (dedicated normative doc). +- **D-b → accept rec:** F4 / F2-J1 / F3 first tier; F1-ENOSPC, E3, and the deferred F2-audit + gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier. +- **D-c → split the suite** into a strict-correctness tier (always enforced) and a + latency/envelope tier (not hard-failed by extreme-stress runs). +- **D-d → keep on `ci-stress`**, cherry-pick the mergeable commits to `main` at the end. +- **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the + test/matrix count is known. +- **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing + **recommendation doc** (not implementation), run early as Phase 1b. ## Out of scope for this plan - Anything the design already rejected (heartbeat, two-rename CAS, `File.Replace`, supporting From 0397aaa1e971974cac6402873fbd7475673f1043 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 18:12:36 +1000 Subject: [PATCH 22/76] docs: load-&-matrix testing strategy recommendation (Ben f, Phase 1b draft) --- docs/load-testing-strategy.md | 311 ++++++++++++++++++++++++++++++++++ 1 file changed, 311 insertions(+) create mode 100644 docs/load-testing-strategy.md diff --git a/docs/load-testing-strategy.md b/docs/load-testing-strategy.md new file mode 100644 index 0000000..c7ebd73 --- /dev/null +++ b/docs/load-testing-strategy.md @@ -0,0 +1,311 @@ +# Load & matrix testing strategy — recommendation + +**Status: RECOMMENDATION for Ben's decision — not an implementation.** Produced by a +considered, first-principles process (three parallel research agents — load fidelity, CI +matrix, test parametrization — synthesized and cross-checked against the code), deliberately +**not anchored** on the current `tests/with-load.sh` approach (which was thrown together from a +few lines of discussion). It answers: are we injecting load the right way / of the right +kinds; how to use the free public GitHub runners for a load×config matrix; and how to get more +from the existing tests routinely — while staying **considered, not maximalist**. + +Grounded in `docs/failure-modes.md` (esp. §K and the correctness-vs-liveness split) and the +product/test code. Where it cites a fact about GitHub Actions limits, treat the number as +"current as of writing, confirm against GitHub docs before relying on it." + +--- + +## 0. Headline recommendations (skim) + +1. **Reframe load's job.** Correctness here is *load-independent* (O_EXCL + atomic rename + + per-attempt tokens never consult the clock for a correctness decision). So load can't break + exclusion or cause a silent lost update. Load has exactly two jobs: **(J1)** perturb + scheduling so the protocol's multi-syscall sequences get preempted at adversarial points + (race-surfacing), and **(J2)** broaden configs to exercise different code paths. Load + *magnitude* past ~2× CPU oversubscription mostly manufactures *harness wall-clock flakes*, + not bugs. +2. **The biggest race-coverage lever is NOT external load — it's deterministic steering.** The + genuinely dangerous windows are reachable *deterministically* only by the in-process + function-interposition the suite already uses. Invest there first; external load is a + secondary, probabilistic complement for the few windows it can actually move. +3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate + always means a real correctness bug); a **Nightly** non-blocking tier that adds calibrated + load × kind and the parametrization sweeps, with wall-clock assertions relaxed to warnings; + and an on-demand **Deep sweep** (the current stress design) for the 50-clean hunt. +4. **Fix the injection: calibrate, target, record.** Express load as an *oversubscription + ratio* relative to core count (not an absolute hog count); prefer calibrated mechanisms + (`stress-ng`, Linux cgroup `cpu.max`/`io.max`) over free-running spinners; write a per-run + load-manifest artifact so a flake is reproducible. +5. **Embrace platform asymmetry** instead of a uniform injection layer: steering everywhere + (portable); calibrated latency on the Linux leg only; plain CPU oversubscription as the + macOS/Windows fallback — and record per-leg which regime actually ran. +6. **Get more from existing tests** via a *bounded* parametrization of a named handful (waiter + count, fail-open ratio, poll cadence) — with strict correctness assertions kept + config-independent and wall-clock assertions moved to the envelope tier. + +--- + +## 1. What load testing is FOR here (the reframe that drives everything) + +This is **not** a throughput-bound system whose correctness degrades under load. Per +`failure-modes.md` §1/§K, safety/exclusion rest on structural primitives (atomic +create/rename, per-attempt-token discovery) that never reference the clock for a *correctness* +decision. No amount of CPU/IO pressure makes `rename(2)` non-atomic or lets two O_EXCL creates +both win on a local FS. + +So load's honest purpose is narrow: **make the protocol's multi-syscall sequences (which are +not individually atomic) get preempted at adversarial points, so the inter-process +interleavings the code claims to handle are actually exercised** — plus widen the few +genuinely timing-derived decisions (mtime staleness, the FILETIME-zero floor, empty-read +retries). The right metric for a load regime is *"does it raise the probability that process A +is suspended between syscall N and N+1 while process B advances?"* — **not** *"does it consume +the box?"* + +**Direct consequence (the most important single point):** beyond ~2× CPU oversubscription, +more load does not find new correctness bugs — it only stretches wall-clock latency and starts +blowing the suite's *Tier-2* wall-clock assertions (Test 21's ≤20s recovery, Test 22a's +warning timing, Test 29's poll-count), which `failure-modes.md` §K already identifies as +Tier-1-bound-on-a-Tier-2-quantity. The fix for those is to **scope the bound**, not pile on +load. This is why the strategy below puts load in non-blocking tiers and keeps the gate clean. + +--- + +## 2. The biggest lever is deterministic steering, not load + +The protocol's scary windows — and whether *external load* can even reach them: + +| Window | Code | Reachable by external load? | +|---|---|---| +| create → read-back verify | `git-commit-lock.sh:1336-1357` | Only probabilistically (1 command-sub wide); deterministically via steering | +| **claim recheck → touch → re-verify → rename** (residual 1/2 — THE delicate path) | `:1092-1168` | Probabilistically via CPU preemption; deterministically only via steering | +| rename-over → read-back (steal install) | `:1168-1179` | Same — steering for determinism | +| **mtime staleness / fail-open boundary (B5)** | `:1408-1410`, `:928` | **Yes** — CPU/IO load stretches cadence and can push a contended holder past STALE → exercises the 98-detect lane. The most realistic "load surfaces a real lane" case. | +| two-poll wrong-type confirmation (ghosts) | `:1518-1567` | **Yes, but mostly the bad way** — oversubscription *starves* the poll headroom → manufactures the Test 22a-style flake rather than finding a bug | +| FILETIME-zero floor (Windows) | `:925`, `:1408` | **No** — a *create-churn* artifact, not load-driven | +| empty-read retry ladder (AV/create→write) | `:668-684` | Realistic trigger is Windows AV/filter-drivers, not synthetic load | + +**Takeaway:** the windows where a *wrong interleaving could actually corrupt state* +(create→readback, claim→rename, rename→readback, release boundary) are reached *deterministically* +only by the in-process function-interposition steering the suite already does (`clone_fn`, +`tests/git-commit-lock.test.sh:127-136`). External load merely raises the background +probability of hitting an interleaving nobody scripted. **So the primary race-coverage +investment is MORE STEERED SCENARIOS** (portable, deterministic, attributable) — e.g. steered +cases that park the claimant between recheck and rename, and between touch and rename, firing a +clearer + rival. External load is a *secondary, probabilistic* complement, valuable mainly for +the staleness/fail-open boundary (B5) it can genuinely move. + +A corollary for triage: because external load *cannot* break correctness, a load run that +produces a *correctness* failure is surfacing either (a) a real logic bug in a steering-only +window (high value) or (b) a *test-harness* setup race (`sync_waiting_fresh`/`backdate_ghost` +losing its race under load) — a harness fix, not a code fix. Prefer deterministic mechanisms so +an observed failure is *attributable*. + +--- + +## 3. Fix the load injection: calibrate, target, record + +**Critique of the current `tests/with-load.sh`** (N bare CPU spinners + N `dd … conv=fsync` +create/write/delete loops): it is a *reasonable background-jitter generator* and adequate for +"run the whole suite under generic pressure," but from first principles it is: +- **Uncalibrated / non-reproducible:** `LOAD=N` spinners produce wildly different real + preemption pressure on a 2-core vs 4-core runner, so "we tested at load N" doesn't mean a + fixed thing — violating the reproducible-experiments requirement. +- **Untargeted:** a box-wide hog perturbs *everyone uniformly* (including the rival you wanted + to advance), so it adds jitter but doesn't *bias* the interleaving toward the adversarial + order. The high-value windows need a *scalpel* (slow one syscall in one process), which it + can't do. +- **Blind to two windows:** it can't widen the create→write gap (the lock create is one + redirect, no fsync to delay) and can't *produce* the Windows delete-pending ghost (it churns + unrelated files); its main effect on those is the *poll-starvation false-flake* direction. +- **Self-defeating at high N:** on a 2-core runner it pushes wall-clock far enough to blow the + harness's own timeouts (the workflow already had to raise every step timeout 2–3×) — load + manufacturing churn, not findings. + +**Recommendations:** +- **Express load as an oversubscription ratio `R = stressors / nproc`** (e.g. R ∈ {0, 1, 2}), + not an absolute hog count, so a level is runner-independent. +- **Prefer calibrated mechanisms:** `stress-ng --cpu $((R*nproc)) --cpu-load … --metrics` + (defined, measurable) over bare spinners; on **Linux**, prefer **cgroup throttling** + (`systemd-run --user --scope -p CPUQuota=…` / `io.max`) which gives *deterministic, + reproducible* latency — the right tool for **envelope validation** (a 10% CPU quota means the + same everywhere; "8 hogs" does not). +- **Record a per-run `load-manifest`** artifact next to the suite logs: `{kind, R, nproc, + achieved-slowdown, tool versions, runner os/arch, git sha}`, uploaded on *success too* (you + need the negatives to interpret the positives). Optionally probe achieved slowdown with a + fixed micro-benchmark before/during load. +- **Cap routine load at ~2× oversubscription;** higher R only on the deep-sweep flake-hunt leg + (whose *correctness* assertions stay strict but *wall-clock* assertions are relaxed). + +--- + +## 4. Embrace platform asymmetry (don't build a uniform injection layer) + +The platforms diverge too much for a "uniform" load layer (cgroups & FUSE are Linux-only; +macOS SIP blocks `DYLD_INSERT_LIBRARIES` on system binaries; Windows has neither). Don't fight +it — structure around it and **record which regime ran per leg**: + +- **Deterministic steering** — *everywhere* (portable bash; pwsh equivalent). The real + race-coverage tool. +- **Calibrated latency** (cgroup `cpu.max`/`io.max`; optionally `strace -e inject` to slow one + syscall in one process; a FUSE fsync-delay shim only if window W7 is prioritized) — **Linux + leg only**. +- **CPU oversubscription** (`stress-ng` or the bash-spinner fallback) — the **macOS/Windows** + fallback, uncalibrated; document the asymmetry. + +Low-yield, **avoid:** memory/swap pressure (trivial allocation surface; risks OOM-killing the +harness), raw disk-bandwidth saturation (doesn't touch metadata-op latency), de-prioritizing +the background hogs. `ulimit`/inode/FD exhaustion belong to the *fault-injection tests* (the +§4.5 work), not the timing-load regime. + +--- + +## 5. The three-tier CI structure (the matrix) + +The organizing recommendation. It maps directly onto the already-decided correctness/envelope +test split (D-c). + +### Tier R — Required / per-PR (blocking) — KEEP the existing 4 cells, STRIP the load +| Cell | OS | Engines | Buys | +|---|---|---|---| +| R1 | ubuntu | bash + pwsh7 (all suites) | Linux correctness + interop baseline | +| R2 | macos | bash + pwsh7 (all suites) | BSD `stat`/`mv` lanes (D1/E3) — *only* place these run | +| R3 | windows (unit leg) | bash (MINGW) | delete-pending ghosts, FILETIME floor | +| R4 | windows (interop+integration leg) | bash + pwsh7 + **PowerShell 5.1** | the 5.1 non-atomic-fallback path (D1) + real NTFS commit swarm | + +This is exactly today's matrix **minus the stress env**. Running it at **`none` load** means it +only ever asserts Tier-1 correctness — it *cannot* flake on a Tier-2 wall-clock bound, so **a +red required check always means a real bug.** Target < ~8 min. (Also: flip the concurrency group +back to `${{ github.workflow }}-${{ github.ref }}` + `cancel-in-progress: true` — the current +per-run-unique group is a *deep-sweep* setting, which is exactly why the stress branch is marked +"do NOT merge to main.") + +### Tier N — Nightly / scheduled (non-blocking, triaged) +~6 cells adding load **kind** (cpu / disk / both) at **one** oversubscribed level (R≈2), plus +the §6 parametrization sweeps. Run with **`GCL_ENVELOPE_TIER=relax`** so the three known +load-sensitive assertions (Test 21 ≤20s, Test 22a warning, Test 29 poll-count) **downgrade to +warnings** while correctness assertions stay hard. Example cells: ubuntu×{disk, both, cpu}, +macos×disk, windows×{disk on the interop+5.1 leg — highest-value, both on the unit leg}. +Auto-file a triaged issue on failure tagged `correctness` (investigate) vs `envelope-flake` +(expected). macOS gets one harsh cell only (it's the scarce/slow runner); ubuntu absorbs the +extra kinds (cheapest). + +### Tier D — On-demand deep sweep (`workflow_dispatch`, never gates) +The current stress-branch design *is* this tier — keep its `stress_kind`/`stress_load` inputs +and per-run-unique concurrency (many parallel dispatches), add `repeat` (run a cell K times) +and `width` inputs. This is the "50-clean under both/8-hog" hunt: informational, time-boxed by +choice, never a contract. + +**Why this is the linchpin:** keeping artificial load *off the required gate* is what makes the +gate trustworthy; putting all load in non-blocking tiers with the envelope assertions relaxed is +what stops load from manufacturing flakes that erode trust. The split needs a small product/test +change: a `GCL_ENVELOPE_TIER=relax` env that downgrades the wall-clock assertions — nightly/deep +set it, required never does. + +--- + +## 6. Get more from existing tests: bounded parametrization + +Today there are only two coarse knobs: `GCL_TEST_FULL` (global fan-out) and per-case +hard-coded `AGENT_LOCK_*` values (never swept). Add **one** mechanism — a per-axis sweep over a +**named handful** of tests (sum the axes, do **not** cross-product): + +- **Axis A — waiter/stealer count (highest value):** T2b (frozen at 4), T20, interop T16. Sweep + N ∈ {4, 12, 24}. Widens the thundering-herd/claim-serialization and displacement windows that + re-running N=4 never will. +- **Axis B — fail-open ratio (hold ÷ STALE):** a parametrized T4b/T1 variant running hold ≪ + STALE / hold ≈ STALE / hold > STALE, asserting the *correct verdict per regime* (clean → 0 + steals; over → exactly one steal + a 98). +- **Axis C — poll cadence:** {fast 0.05, **default 2s**}. The shipped 2s default is currently + never exercised under contention. +- **Axis D — CLAIM_STALE depth (lower value):** {2, 60} on T21. + +**Do not sweep:** round count (keep as the nightly *soak* dial, not a coverage axis), MAX_WAIT +(timeout-only), the deterministic steered protocol tests (T23–T36 — re-running reruns the same +steered path), or the integration suite's worker count beyond FULL/REDUCED (it's strict in both +modes by design and wall-clock-bound by serialized commits). + +**Flakiness discipline (critical):** keep correctness assertions **config-independent** — when +sweeping N, hold STALE ≫ hold so "zero-98 / one-steal" stays a pure correctness statement, and +**scale MAX_WAIT with N** (more waiters = more serialized turns) so a large-N run doesn't time +out and *look* like a product failure. Move wall-clock/poll-count assertions to the envelope +tier. Keep the existing `sync_waiting_fresh`/`backdate_ghost` scaffolding — at higher N it +matters more. + +**Cadence:** per-PR runs the floor point of each axis (today's behavior, deterministic); +nightly runs the sweeps under a `GCL_TEST_SWEEP=1` gate. The sweep (per-suite fan-out/knobs) is +*orthogonal* to the OS/leg matrix — compose additively (per-PR = matrix × floor; nightly = +matrix × sweep), never multiply everything on every PR. + +--- + +## 7. GitHub Actions realities (the real constraints — confirm against current docs) + +- **Minutes are free on public repos, but concurrency is the real ceiling.** Free/public + accounts cap concurrent jobs on the order of ~20 (with a much smaller macOS sub-limit). A + matrix past that **queues** (serialises into waves), it doesn't fail. Design any single + triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep intentionally exceeds + this and accepts waves. +- **Runner scarcity ≠ billing:** even free, **macOS runners are scarce/slow (~10× cost-weight), + windows ~2×, ubuntu 1×.** Be stingy with macOS cells, liberal with ubuntu. +- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal); + `max-parallel` on nightly/deep so a big sweep doesn't starve the required gate of runners; + 256-job hard cap per workflow (irrelevant at our scale). +- **Triggers:** required on `pull_request` + `push: main`; nightly on `schedule` (cron, + off-peak minute) + `workflow_dispatch`; deep on `workflow_dispatch` only — heavy load never + sits in a PR's critical path. Keep `paths-ignore` (`**.md`, `.plans/**`) on required. + (Note: `schedule` triggers are auto-disabled after ~60 days of repo inactivity.) +- **Artifacts:** keep the existing `upload-artifact` (with `include-hidden-files` for the + `.git/`-buried lock logs); name uniquely per (os, leg, kind, level) so parallel cells don't + collide. + +--- + +## 8. Considered, not maximalist — the decision rule + +> **A cell enters the routine matrix (R or N) only if it can surface a bug class no other +> routine cell can. Otherwise it's a deep-sweep cell, or it doesn't exist.** + +- Cap the routine matrix: **R ≤ 4, N ≤ ~8.** New routine cells must *displace* one, forcing the + "does this find something the others can't?" question. +- **Earn the slot:** a config/cell graduates deep → nightly only after the deep sweep actually + caught a distinct failure there (mirrors the project's own "tested edge cases earn confidence" + philosophy). Demote a cell that's been green for ~60 days and whose window is a subset of + another green cell's. +- Prefer *one* oversubscribed level over a level sweep; prefer *attributable* single-kind cells + over `both`-only when you want to localise a flake. +- **Trustworthiness invariant:** required = always-meaningful-red; nightly = triaged-amber- + tolerant; deep = noise-by-design. Don't retry-mask the required tier (a retry that hides a + 1-in-20 real race is exactly the silent-loss class this tool exists to prevent). + +--- + +## 9. Open decisions for Ben (what to pick before Phase 2 plans the build) + +1. **Nightly aggressiveness:** ~6 cells, cron daily vs weekly? (rec: ~6 cells, daily off-peak; + start smaller and grow by the earn-the-slot rule.) +2. **Linux load mechanism:** adopt calibrated cgroup `cpu.max`/`io.max` throttling on the Linux + leg (reproducible, the right envelope-validation tool) vs keep the simple wrapper but + calibrate it by oversubscription ratio? (rec: cgroup on Linux for the envelope leg; keep a + ratio-calibrated `stress-ng`/spinner as the cross-platform race-jitter lane.) +3. **`stress-ng` dependency:** add an install step (apt/brew) vs keep a pure bash spinner + (zero-dep, uncalibrated)? (rec: `stress-ng` where available + spinner fallback on Windows.) +4. **Parametrization scope now:** Axis A (waiter count) only, or A+B+C? (rec: A first — highest + value, lowest flake risk — then B, then C.) +5. **The envelope-tier switch** (`GCL_ENVELOPE_TIER=relax`): confirm this is how we implement the + D-c correctness/envelope split (a small test-harness change downgrading the 3 wall-clock + assertions to warnings under load). (rec: yes — it's the cleanest implementation of D-c.) +6. **Nightly triage channel:** auto-file/track issues on nightly failure, tagged correctness vs + envelope? (rec: yes — otherwise scheduled-run reds are invisible.) + +These choices feed **Phase 2** (the implementation plan). This doc is a recommendation only — +no code, no workflow changes, until you've decided. + +--- + +## Appendix — provenance +Synthesized from three parallel first-principles research passes (load fidelity & injection +mechanisms; CI matrix on free public runners; existing-test parametrization), each grounded in +`git-commit-lock.sh`/`.ps1`, the three suites, `tests/with-load.sh`, `.github/workflows/tests.yml`, +and `docs/failure-modes.md`, and cross-checked against the code (one agent's claim that +`tests/with-load.sh` was absent was verified false — it exists and is tracked). Pending: a +foreign-model (Codex) review pass over the GitHub-Actions limit claims and the load-mechanism +portability claims before this is treated as settled. From aeba95c435e36f8729e2a160b672c2b698c50799 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 18:25:49 +1000 Subject: [PATCH 23/76] docs(load-testing): apply Codex factual review (cgroup probe-required; max-parallel/paths-ignore caveats; billing vs scarcity; FUSE/SIP hedges; reconcile non-Linux disk cells) --- docs/load-testing-strategy.md | 97 ++++++++++++++++++++++++----------- 1 file changed, 66 insertions(+), 31 deletions(-) diff --git a/docs/load-testing-strategy.md b/docs/load-testing-strategy.md index c7ebd73..e26d68c 100644 --- a/docs/load-testing-strategy.md +++ b/docs/load-testing-strategy.md @@ -27,8 +27,9 @@ product/test code. Where it cites a fact about GitHub Actions limits, treat the genuinely dangerous windows are reachable *deterministically* only by the in-process function-interposition the suite already uses. Invest there first; external load is a secondary, probabilistic complement for the few windows it can actually move. -3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate - always means a real correctness bug); a **Nightly** non-blocking tier that adds calibrated +3. **Three-tier CI:** a **Required** per-PR gate with **no artificial load** (so a red gate is + never a stress-manufactured wall-clock flake — it's actionable); a **Nightly** non-blocking + tier that adds calibrated load × kind and the parametrization sweeps, with wall-clock assertions relaxed to warnings; and an on-demand **Deep sweep** (the current stress design) for the 50-clean hunt. 4. **Fix the injection: calibrate, target, record.** Express load as an *oversubscription @@ -122,12 +123,19 @@ create/write/delete loops): it is a *reasonable background-jitter generator* and **Recommendations:** - **Express load as an oversubscription ratio `R = stressors / nproc`** (e.g. R ∈ {0, 1, 2}), - not an absolute hog count, so a level is runner-independent. + not an absolute hog count, so a level is runner-independent. Note `R` is **per kind**: the + current wrapper's `GCL_STRESS_LOAD=N` spawns N hogs per selected kind, so `both` doubles total + hogs — define and cap `R_total`, and record cpu- and disk-stressor counts separately. - **Prefer calibrated mechanisms:** `stress-ng --cpu $((R*nproc)) --cpu-load … --metrics` - (defined, measurable) over bare spinners; on **Linux**, prefer **cgroup throttling** - (`systemd-run --user --scope -p CPUQuota=…` / `io.max`) which gives *deterministic, - reproducible* latency — the right tool for **envelope validation** (a 10% CPU quota means the - same everywhere; "8 hogs" does not). + (defined, measurable) over bare spinners. On **Linux**, calibrated **CPU** throttling is the + cleanest *envelope-validation* tool — `sudo systemd-run --scope -p CPUQuota=10%` gives a + runner-independent quota (a 10% quota means the same everywhere; "8 hogs" does not). **Treat + this as a probe-required Linux-only option, not a turnkey fact:** it needs cgroup v2 + + controller delegation + a usable systemd manager on the GitHub `ubuntu-24.04` runner, so gate + it behind a CI capability probe with the `stress-ng`/ratio path as the fallback. **IO** cgroup + throttling is *experimental* here — it is not a simple `systemd-run -p io.max`; systemd + exposes it as `IOReadBandwidthMax=`/`IOWriteBandwidthMax=` with device/path caveats — so don't + rely on it until proven on the runner. - **Record a per-run `load-manifest`** artifact next to the suite logs: `{kind, R, nproc, achieved-slowdown, tool versions, runner os/arch, git sha}`, uploaded on *success too* (you need the negatives to interpret the positives). Optionally probe achieved slowdown with a @@ -139,17 +147,23 @@ create/write/delete loops): it is a *reasonable background-jitter generator* and ## 4. Embrace platform asymmetry (don't build a uniform injection layer) -The platforms diverge too much for a "uniform" load layer (cgroups & FUSE are Linux-only; -macOS SIP blocks `DYLD_INSERT_LIBRARIES` on system binaries; Windows has neither). Don't fight -it — structure around it and **record which regime ran per leg**: +The platforms diverge too much for a "uniform" *calibrated/targeted* load layer (cgroup +throttling and FUSE fault-injection filesystems are Linux-only for this CI plan; `strace` +inject is Linux-only; `DYLD_INSERT_LIBRARIES` injection is unreliable on macOS for +SIP-protected Apple/system binaries like `mv`/`git` — possible only for non-protected helper +binaries). Don't fight it — structure around it and **record which regime ran per leg**: - **Deterministic steering** — *everywhere* (portable bash; pwsh equivalent). The real race-coverage tool. -- **Calibrated latency** (cgroup `cpu.max`/`io.max`; optionally `strace -e inject` to slow one - syscall in one process; a FUSE fsync-delay shim only if window W7 is prioritized) — **Linux - leg only**. -- **CPU oversubscription** (`stress-ng` or the bash-spinner fallback) — the **macOS/Windows** - fallback, uncalibrated; document the asymmetry. +- **Calibrated / targeted latency** (cgroup CPU quota; optionally `strace -e inject` to slow one + syscall in one process; a FUSE fsync-delay shim — charybdefs-style — only if window W7 is + prioritized) — **Linux leg only** (probe-gated, per §3). +- **Uncalibrated oversubscription — the macOS/Windows fallback.** Both **CPU** (`stress-ng` or + the bash-spinner fallback) **and the simple disk-churn hog** (the current + `dd`/create-write-fsync-delete wrapper) run cross-platform; they are *low-fidelity and + uncalibrated* but real metadata-op pressure, which is why the Tier-N macOS/Windows `disk` + cells (§5) use them. Document the asymmetry: calibrated latency only on Linux; everywhere else + it's blunt oversubscription. Low-yield, **avoid:** memory/swap pressure (trivial allocation surface; risks OOM-killing the harness), raw disk-bandwidth saturation (doesn't touch metadata-op latency), de-prioritizing @@ -173,7 +187,9 @@ test split (D-c). This is exactly today's matrix **minus the stress env**. Running it at **`none` load** means it only ever asserts Tier-1 correctness — it *cannot* flake on a Tier-2 wall-clock bound, so **a -red required check always means a real bug.** Target < ~8 min. (Also: flip the concurrency group +red required check is never stress-manufactured envelope noise.** It's always actionable — a +real bug, or at worst runner-image/action-download/infra drift (which is also worth knowing) — +never a "load was too high" false alarm. Target < ~8 min. (Also: flip the concurrency group back to `${{ github.workflow }}-${{ github.ref }}` + `cancel-in-progress: true` — the current per-run-unique group is a *deep-sweep* setting, which is exactly why the stress branch is marked "do NOT merge to main.") @@ -239,20 +255,34 @@ matrix × sweep), never multiply everything on every PR. ## 7. GitHub Actions realities (the real constraints — confirm against current docs) -- **Minutes are free on public repos, but concurrency is the real ceiling.** Free/public - accounts cap concurrent jobs on the order of ~20 (with a much smaller macOS sub-limit). A - matrix past that **queues** (serialises into waves), it doesn't fail. Design any single - triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep intentionally exceeds - this and accepts waves. -- **Runner scarcity ≠ billing:** even free, **macOS runners are scarce/slow (~10× cost-weight), - windows ~2×, ubuntu 1×.** Be stingy with macOS cells, liberal with ubuntu. -- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal); - `max-parallel` on nightly/deep so a big sweep doesn't starve the required gate of runners; - 256-job hard cap per workflow (irrelevant at our scale). +- **Minutes are free on public repos; concurrency is the real ceiling.** Free-plan accounts cap + concurrent jobs at **20 total, with a 5-job macOS sub-limit** (confirm against GitHub's + current limits page). A matrix past that **queues** (serialises into waves), it doesn't fail. + Design any single triggered workflow to ≤ ~15–20 jobs to run in one wave; the deep sweep + intentionally exceeds this and accepts waves. +- **Cost-weight is separate from queue scarcity (don't conflate).** On a public repo standard + runners are *free* — the per-minute rates don't consume credits or set queue priority. They do + signal relative runner *cost/scarcity*: roughly Linux 1×, **Windows ~1.7×** ($0.010 vs + $0.006/min), **macOS ~10×** ($0.062/min). The real constraint on macOS is the **5-job + sub-limit** above, plus it being the slowest pool. → keep macOS cells **sparse**, ubuntu + liberal. +- **`strategy.matrix`:** `fail-fast: false` (keep — an OS-specific failure is the signal). + **`max-parallel` only limits parallelism *within a single matrix run*** — it does **not** + reserve capacity across separate workflow runs or the deep sweep's many `workflow_dispatch` + invocations. To stop a sweep starving the required gate, **bound the deep/nightly tiers with a + workflow-level `concurrency` group (and cap the dispatcher width)**, not `max-parallel` alone. + 256-job hard cap per workflow run (irrelevant at our scale). - **Triggers:** required on `pull_request` + `push: main`; nightly on `schedule` (cron, off-peak minute) + `workflow_dispatch`; deep on `workflow_dispatch` only — heavy load never - sits in a PR's critical path. Keep `paths-ignore` (`**.md`, `.plans/**`) on required. - (Note: `schedule` triggers are auto-disabled after ~60 days of repo inactivity.) + sits in a PR's critical path. (Note: `schedule` triggers are auto-disabled after ~60 days of + repo inactivity.) +- **`paths-ignore` gotcha on a *required* check.** A workflow skipped by path filtering leaves + its checks **Pending**, which *blocks merge* if those checks are required. So **don't** put + `paths-ignore` on the workflow whose jobs are the required checks and expect doc-only PRs to + merge. Instead either (a) keep the required workflow always-running with a tiny always-green + job and path-filter only the expensive test jobs, or (b) make a separate cheap job the + required check. (Doc-only-skip is still worth doing — just not on the required-check workflow + naively.) - **Artifacts:** keep the existing `upload-artifact` (with `include-hidden-files` for the `.git/`-buried lock logs); name uniquely per (os, leg, kind, level) so parallel cells don't collide. @@ -306,6 +336,11 @@ Synthesized from three parallel first-principles research passes (load fidelity mechanisms; CI matrix on free public runners; existing-test parametrization), each grounded in `git-commit-lock.sh`/`.ps1`, the three suites, `tests/with-load.sh`, `.github/workflows/tests.yml`, and `docs/failure-modes.md`, and cross-checked against the code (one agent's claim that -`tests/with-load.sh` was absent was verified false — it exists and is tracked). Pending: a -foreign-model (Codex) review pass over the GitHub-Actions limit claims and the load-mechanism -portability claims before this is treated as settled. +`tests/with-load.sh` was absent was verified false — it exists and is tracked). A foreign-model +(Codex, web-grounded) review has been applied: it confirmed the §2 window→load reachability +table against the code and the core GitHub-Actions facts (20-total / 5-macOS free-plan +concurrency, 256-job matrix cap, 60-day schedule auto-disable, `cancel-in-progress`, `stress-ng` +availability), and its corrections are folded in — the cgroup mechanism is now marked +**probe-required** (CPU quota only; IO throttling experimental), the `max-parallel` and +`paths-ignore`-on-required caveats added, billing-weight separated from queue-scarcity, and the +FUSE/SIP claims hedged. From 8ba634177214304fb4dcba6c27befdd68a6b7af0 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 18:54:30 +1000 Subject: [PATCH 24/76] =?UTF-8?q?Plan:=20=C2=A79=20accepted;=20add=20Bucke?= =?UTF-8?q?t=207=20(steering=20coverage=20/=20Phase=201c)=20+=20Bucket=208?= =?UTF-8?q?=20(harness=20ergonomics=20research)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit --- ...-ci-stress-guarantees-and-coverage-plan.md | 49 ++++++++++++++++++- 1 file changed, 48 insertions(+), 1 deletion(-) diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md index 0bf4445..757f601 100644 --- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md +++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md @@ -102,7 +102,46 @@ first-principles rethink** — explicitly **not anchored on the existing approac more surface — without adding flakiness. Which tests benefit most. - **Considered, not maximalist:** principles for choosing the matrix + a routine cadence. Output: `docs/load-testing-strategy.md` (recommendation). Runs EARLY (Phase 1b) because it -shapes Buckets 2 & 4 and the Phase-2 plan. +shapes Buckets 2 & 4 and the Phase-2 plan. **§9 open decisions: all accepted by Ben (2026-06-17) +with the doc's recommendations** — daily ~6-cell nightly (start smaller, grow by earn-the-slot); +Linux cgroup CPU quota (probe-gated) for the envelope leg + ratio-calibrated stress-ng/spinner +as the cross-platform race-jitter lane; stress-ng with a Windows spinner fallback; +parametrization Axis A (waiter count) first; `GCL_ENVELOPE_TIER=relax` as the D-c +correctness/envelope-split implementation; nightly issue auto-triage (correctness vs envelope). + +### Bucket 7 — Complete deterministic-steering coverage (Ben raised 2026-06-17) +The load-strategy doc establishes deterministic STEERING (in-process function-interposition) — +not external load — as the primary lever for the protocol's race-critical windows, and "more +steered scenarios" as the #1 coverage investment. We have **not** scoped what *complete* +steering coverage requires. +- **Audit (Phase 1c):** enumerate every window/branch/residual across acquire / steal / hold / + release and map each to its deterministic-steering test or a GAP. Inputs: `failure-modes.md`, + the load-strategy §2 reachability table, the earlier F2 audit. Known gaps already: residual- + 1/2/3 (claimant parked between recheck / touch and rename), and the F2-audit #7/#8 (wrong-type + appearing at the lock path mid-steal — A2/G2; Windows blocked-unlink legs). Add a **mechanical + branch-coverage pass (kcov for bash, on the Linux CI leg)** to find never-executed lines + objectively, as an input to the manual window audit. +- **Output:** a coverage gap-list doc that scopes the steering-test work. +- **Fill (Phase 3):** write the missing steered tests, bundled with Bucket 2. + +### Bucket 8 — Test-harness ergonomics (research done 2026-06-17; small, zero-dep) +A subagent researched "big bash files vs alternatives." Verdict: **keep the plain-bash, zero-dep, +custom-harness, steering-friendly design** — do NOT adopt bats-core (its forced `set -e` fights +the suite's deliberate `set -uo` + exit-code assertions; its Windows/MINGW path quirks add risk +on this project's most fragile axis) or shunit2 (lateral move, weaker Windows story). But the +*monolith* (not the harness) costs a single-test selector + machine-readable reporting. +Recommended incremental, **zero-dependency** additions, priority order: + 1. **TAP output** from `ok`/`bad` + a `1..N` plan line (~15 lines) — machine-readable CI + reporting AND closes the silent-undercount gap (an early `exit`/crash currently drops every + later assertion from the count, total still prints "passed"). + 2. **A single-test selector** (`GCL_TEST_ONLY=`) — the biggest day-to-day pain (today + you run all 36 unit tests to iterate on one, on the slowest leg). + 3. **Extract the duplicated helpers** into `tests/_harness.sh` (ok/bad/backdate/clone_fn/ + wait — copy-pasted verbatim across all three files). + 4. (Optional) split the two large files by concern; leave the integration suite whole (its + cross-test repo-state audit is an intentional dependency). +Fold into the Phase-2 plan / Phase-3 build; items 1–2 are an afternoon and pay off every +iteration (esp. given the local-test ban → faster CI triage from machine-readable output). ## Workflow (settled: spec → plan → implement → review) @@ -121,6 +160,14 @@ review (Claude + Codex). **Recommendation only — NO implementation.** → Ben recommendations feed Phase 2. Runs early because it shapes Buckets 2 & 4. (1a and 1b are independent and can run in parallel.) +**Phase 1c — Complete-steering-coverage audit (Bucket 7 / Ben raised 2026-06-17).** +Systematically map every protocol window/branch/residual to its deterministic-steering test or +a GAP, plus a mechanical kcov branch-coverage pass to find never-executed lines objectively; +output a coverage gap-list doc. Analytical (read-only), parallel to 1a/1b; its gap-list is a +major input to the Phase-2 test plan (steering is the #1 race-coverage lever per the +load-strategy doc). **Audit now; gap-filling is Phase 3.** → Ben gate. (Also folds in the +Bucket-8 harness-ergonomics items, which the new tests will want.) + **Phase 2 — Plan.** Concrete implementation plan for Buckets 2-4, incorporating Ben's chosen load/matrix recommendations: per-test injection method (tmpfs / `ulimit` / chmod) + platform guard + CI wiring; the matrix/parametrization to adopt; exact doc edits; the From e1f31a5e1fa54ed508105597c86f66a9332163ef Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 19:57:36 +1000 Subject: [PATCH 25/76] Phase 1a + 1c: add guarantees contract + steering-coverage gap-list MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Phase 1a (Bucket 1 / D-a) — docs/guarantees.md: the normative contract derived from failure-modes.md. Operating envelope (E1-E7), guarantees (safety G-S1..5, recovery G-R1..4, interop G-I1), failure semantics, best-effort tier (BE-1..5), out-of-scope (OOS-1..7 incl. the non-unwinding- exit no-silent-loss boundary), operating rules, and a verification map. Phase 1c (Bucket 7) — docs/steering-coverage.md: the deterministic-steering coverage audit + prioritized gap list, synthesized from two manual window audits and an objective kcov pass (83.1% line coverage, 451/543; ~30 lines platform-gated, ~62 Linux-reachable). kcov corrected three manual over- credits (step-3.3 CLAIM-ABORT, the foreign claim-recheck branch, the EXIT- trap no-hold twin). Gaps ranked: Tier A portable steering (A1 rename-refused wrong-type-mid-steal = headline; A2 step-3.3 abort lane; A4 the exec/H4 boundary), Tier B fault-injection (failure-modes 4.5), Tier C platform-only, Tier D document-not-test. Both reviewed to convergence: a fresh-context Claude reviewer plus two Codex rounds. The foreign Codex check plus a 4-line empirical test corrected the exec-bypass characterization across all three docs: "run -- bash -c 'exec'" does NOT skip release (the child shell is replaced, the wrapper releases normally); only an exec in the lock-holding shell itself (a sourced lock_acquire+exec, or "run -- exec") bypasses. Propagated to guarantees.md OOS-5, steering-coverage.md A4, and the failure-modes.md H4 precision fix in this commit. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/failure-modes.md | 13 +- docs/guarantees.md | 408 ++++++++++++++++++++++++++++++++++++++ docs/steering-coverage.md | 283 ++++++++++++++++++++++++++ 3 files changed, 700 insertions(+), 4 deletions(-) create mode 100644 docs/guarantees.md create mode 100644 docs/steering-coverage.md diff --git a/docs/failure-modes.md b/docs/failure-modes.md index 0332055..a187c15 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -570,10 +570,15 @@ false success — relies on the wrapper *reaching its release path*. The bypass is any termination or replacement of the holding process that skips that unwind; crucially it is **not** triggered by a normal `exit`. The instances: - **External SIGKILL** — untrappable; no handler runs in either port. -- **bash `exec` in the wrapped command** — `run` executes `"$@"` *in the wrapper - shell itself* (`git-commit-lock.sh:1733`), so an `exec` replaces that shell's - process image and *neither* the trailing `lock_release` *nor* the `EXIT` trap - (`git-commit-lock.sh:1002-1013`, armed at `:1308`) runs. +- **bash `exec` that replaces the lock-holding shell** — `run` executes `"$@"` + *in the wrapper shell itself* (`git-commit-lock.sh:1733`), so the bypass needs the + exec to run in *that* shell: the wrapped command *is* an exec (`run -- exec …`), + or a **sourced** caller does `lock_acquire; exec …` in its own shell. Then the + exec replaces that shell's process image and *neither* the trailing `lock_release` + *nor* the `EXIT` trap (`git-commit-lock.sh:1002-1013`, armed at `:1308`) runs. An + exec **nested in a child** — the ordinary `run -- bash -c 'exec …'` — does **not** + bypass (the child is replaced; the wrapper waits and releases normally). *Verified + empirically 2026-06-17.* - **PowerShell `[Environment]::Exit(n)`** — a CLR hard-exit that bypasses `Lock-Release`, the `finally`, *and* the `PowerShell.Exiting` backstop (`git-commit-lock.ps1:221-245`). diff --git a/docs/guarantees.md b/docs/guarantees.md new file mode 100644 index 0000000..dca88b1 --- /dev/null +++ b/docs/guarantees.md @@ -0,0 +1,408 @@ +# git-commit-lock: guarantees and scope (the normative contract) + +**Status: normative.** This document states *what the tool guarantees*, *under +what conditions* (the operating envelope), and *what is explicitly out of +scope*. It is the contract a user or a CI gate can point at: a behavior listed +under [Guarantees](#2-guarantees) is a property the code must uphold and the +tests defend; a behavior under [Out of scope](#5-out-of-scope-not-guaranteed) is +one the tool deliberately does not promise. + +**How this relates to the other two docs.** This is the *contract*; +[`failure-modes.md`](failure-modes.md) is the *analysis* behind it (per-mode +current behavior, tier classification, and the scope decisions that produced +this contract); [`git-commit-lock.md`](git-commit-lock.md) is the *design +reference* (why the protocol is shaped this way and how it works). Where they +appear to disagree, the **code and tests are authoritative**, then this contract, +then the analysis, then the design narrative. Each guarantee below cites its +witnessing test(s) and the failure-modes section that justifies it; the +[Verification map](#7-verification-map) collects those pointers. + +This contract makes **no new claims** about behavior — it is a re-statement of +the decisions recorded in `failure-modes.md` §4 as commitments. It does not +re-derive the protocol (see the design doc) or re-argue the tiers (see the +analysis). + +--- + +## 1. The operating envelope + +Every guarantee in §2 holds **within this envelope**. Outside it, the tool +degrades as described in §4 (best-effort) or §5 (out of scope) — in most cases +*detectably and without corruption*, but the strict guarantees are not promised. +The envelope is not a disclaimer bolted on; it is the precise set of assumptions +the filesystem-lease design rests on. + +**E1 — Single host, single time source.** All contenders share one working tree, +hence one machine, hence one clock. Staleness is `age = now − mtime` arithmetic +(`git-commit-lock.sh:928,1409`); it assumes the mtime and the comparing process's +`now` come from the *same* clock. Single-host use satisfies this. A *local* clock +jump remains correctness-safe (it degrades to the detected-98 lane, never a +silent double-commit; see G-S1 and `failure-modes.md` §E2). Multi-host use over a +shared FS does not satisfy it and is out of scope (§5, OOS-2). + +**E2 — Local filesystem with atomic create/rename and sane mtimes.** The protocol +is built from three filesystem operations — atomic create-or-fail (`O_EXCL` / +`FileMode.CreateNew`), atomic rename-over, and unlink — each atomic on local +POSIX filesystems and NTFS (ext4, APFS, NTFS, and kin). (The one exception is the +Windows PowerShell 5.1 steal, which lacks the atomic 3-arg move and uses a +claim-guarded unlink-then-move — a fairness loss, never a clobber; see BE-5.) +Network and sync-backed storage (NFS, SMB/CIFS, 9p, Dropbox/OneDrive) weaken +exactly these operations and are out of scope (§5, OOS-1; +`git-commit-lock.md:122-126`). + +**E3 — Cooperative wrapper unwind.** The theft-detection guarantee (G-S1) fires +when the lock-holding shell *reaches its release path* — on normal return, on a +handled INT/TERM, or on a plain `exit` (all of which unwind). It is **not** +triggered by a termination or replacement that skips the unwind: an external +SIGKILL, an `exec` that replaces the lock-holding shell itself, or PowerShell +`[Environment]::Exit()`. (An `exec` nested in a child — the ordinary +`run -- bash -c 'exec …'` — does *not* skip release.) See §5, OOS-5 for the +precise boundary. + +**E4 — Commits fast relative to the staleness window (for *strict* exclusion).** +The lease is fail-open: a hold older than `AGENT_LOCK_STALE_SECS` (default 300s) +can be stolen mid-work. *Strict* mutual exclusion (G-S3) is therefore guaranteed +only for holds that complete within the staleness window. A hold that overruns it +is still *safe* — a displaced holder is detected (G-S1) — but two processes can +briefly both believe they hold the lock. Keep commits well inside the window, or +raise `AGENT_LOCK_STALE_SECS` for a deliberately slow hold (the golden rule, +`git-commit-lock.md:433-458`). + +**E5 — Matching protocol version on all parties.** Prevention of the +crash-recovery-under-contention race (G-S3's no-displacement property) holds only +when every contender runs the claim protocol. A mixed-version tree degrades +prevention to detection and is out of scope (§5, OOS-3). + +**E6 — Supported platforms.** `git-commit-lock.sh` (bash) is supported on Linux, +macOS, and Windows under Git-for-Windows' MINGW bash. `git-commit-lock.ps1` +(PowerShell) is supported on **Windows only**. Running the `.ps1` port on POSIX is +a CI-only cross-implementation protocol check, not a supported configuration (§5, +OOS-4; `README.md:91-95`). + +**E7 — Cooperating, non-hostile agents.** The lock is advisory: it serializes +*cooperating* agents. It detects interference where it can (token checks; exit 98) +but cannot prevent a process running as the same user from deleting or +overwriting the lock file. The threat model is honest agents racing each other, +not an actively hostile local process (§5, OOS-6; +`git-commit-lock.md:520-528`). + +--- + +## 2. Guarantees + +Each guarantee holds **within the envelope (§1)**. The defaults named are knobs +(`AGENT_LOCK_*`); the guarantee is in terms of the configured value, not a fixed +number of seconds. + +### 2A. Safety (unconditional within the envelope) + +These are correctness properties. If one can break inside the envelope, that is a +bug. + +- **G-S1 — No silent lost update.** A holder whose lease is taken from it never + reports a serialized critical section that wasn't. On release, a **definitive** + theft (the lock file is gone, or carries a foreign token) returns **98** with a + loud WARNING rather than success (`git-commit-lock.sh:1607-1688`; + `git-commit-lock.ps1:1717-1837`); a state the release cannot disambiguate (the + file is present but reads **empty** after the retry ladder — possibly a successor + mid-create after a boundary steal) returns the distinct **unverifiable** code + (`lock_release` 2; `run` maps it to 1 when the command itself succeeded, else + keeps the command's code) — still **never** a silent success. *Condition:* the + wrapper unwinds cooperatively (E3). *Witness:* unit Test 4b (98 + WARNING), Test + 16 (unverifiable lane), interop Test 8 (98 both directions) (`U:387-417`, + `I:460-492`). *Basis:* `failure-modes.md` §1, §B5. + +- **G-S2 — No corruption and no false hold.** An acquirer that cannot prove its + own token is at the lock path (after the read-back retry ladder) treats the lock + as **not** acquired and logs loudly; it never "repairs" a failed read-back by + rewriting the path (`git-commit-lock.sh:1352-1361`). Every path that cannot + establish a fact fails toward "wait", never toward "steal" or "hold". This + extends to resource-exhaustion lanes: a create that fails (ENOSPC, FD/inode + exhaustion, an unwritable lock dir) **never produces a false hold or corruption** + — it falls through to wait/97 (an empty orphan ages into the recovery lane). The + guarantee is *no false hold*, not a uniformly clean 97: a torn write shorter than + `tok.` is a non-lock-shaped residual, never stolen, that needs manual removal + (`failure-modes.md` §F1 — an accepted residual). *Witness:* the read-back-failure lanes — + create-path Test 32, steal-path Test 32b (`U:1760-1855`); resource lanes — + coverage planned (Bucket 2 / `failure-modes.md` §4.5). *Basis:* §1, §A1, §F. + +- **G-S3 — Strict mutual exclusion within the staleness window, with no + displacement during crash recovery.** Within `AGENT_LOCK_STALE_SECS` no steal + occurs at all, so at most one process holds the lock. When a holder dies and a + herd of waiters recovers the one stale lock, the **claim protocol** admits + exactly one stealer and the recovering waiter keeps the lock it recovered — a + straggler whose stale judgement predates the recovery cannot displace it + (`git-commit-lock.sh:1070-1218`). At most one process is ever the *legitimate* + holder. (On the supported Windows PowerShell 5.1 unlink-then-move lane the + recovering waiter can *lose* the recovered path to a rival's create in the + transient absent window — a fairness loss, never a clobber; see BE-5.) + *Condition:* holds complete within the window (E4); a stable clock (E1) — a local + clock jump preserves *no silent loss* (G-S1) but can break *strict exclusion* by + making a live lock look stale (a premature, but detected, steal); and matching + version (E5). *Witness:* unit Tests 1/2b/20, interop Tests 1/6/16/16b, integration suite + (`U:166-195,212-346,1095-1128`; `I:227-261,341-386,884-1088`). *Basis:* + §A1/§A2/§A3. + +- **G-S4 — Never destroys a non-lock-shaped object.** A directory, symlink, FIFO, + device, socket, or a regular file whose line 1 is neither empty nor `tok.`- + prefixed is **never** stolen or deleted, at either the lock path or the claim + path (`git-commit-lock.sh:1322-1327,1411-1444,1458-1487,1518-1570`). The + never-steal *safety* is unconditional; the *warning* is best-effort — it normally + fires once and names the object, but an **actively-rewritten** user file may never + age into the content guard and then times out at 97 *without* the warning + (`git-commit-lock.sh:308`). Deletion is + never recursive; the tool only ever removes its own named lock-protocol files. + *Two accepted residuals* bound this and are documented, not bugs: a stale + *empty* user file, and a stale file whose line 1 happens to start `tok.`, are + stolen (`git-commit-lock.sh:298-311`). *Witness:* unit Tests 17/17d/18/22 + (`U:818-892,894-1032,1034-1076,1156-1262`). *Basis:* §D3/§D4/§G1. *Scoped + exception:* ps1-on-POSIX has no .NET type probe for FIFO/device/socket (§5, + OOS-4). + +- **G-S5 — Truthful exit codes.** The three reserved high codes from `run` are + exact: **96** = usage error (command **not** run), **97** = acquisition timed + out (command **not** run), **98** = lock stolen mid-hold (command **ran but was + not serialized** — redo it) (`git-commit-lock.sh:392-415`). A `run` exit of the + command's own code (including 0) means the command was serialized — *subject to + the one carve-out in OOS-5* (a non-unwinding exit returning 0 while displaced). + *Two stated assumptions* keep the high-code contract exact: the wrapped command + must not itself exit 96/97/98 (such an exit is indistinguishable from a tool + verdict, `git-commit-lock.sh:392`), and an **unverifiable** release maps a + *successful* command to **1** (G-S1), so 0 is never reported over an unverifiable + hold. *Witness:* Test 7 (96), Test 8 (97), Test 4b (98), Test 5 (propagation), + Test 16 (unverifiable→1), interop `run` verdict tests. *Basis:* §1, §H4. + +### 2B. Recovery (within the FS/clock/tooling envelope) + +These hold given a readable clock (E1) and lock-shaped state; latency is +best-effort (§4). + +- **G-R1 — Lock-shaped orphans are reclaimed.** A crashed holder's stale lock, an + orphaned or empty claim, and an empty crash-orphan (a crash between create and + content write) all eventually become stealable and are recovered, bounded by + `STALE` (+ `CLAIM_STALE` if a claimant also crashed) plus poll cadence + (`git-commit-lock.sh:1408-1446,1228-1267`). This does **not** extend to *foreign* + objects (G-S4) — those wait for an operator. *Witness:* unit Tests 2/3/21 + (`U:197-210,348-361,1130-1154`). *Basis:* §B1/§C1/§C2/§C3. + +- **G-R2 — One stuck agent cannot wedge the fleet.** Because the lock is a lease + and the claim is itself leased, a hung-but-alive holder or claimant is recovered + within its window; the fleet does not deadlock behind it. *Witness:* the stale- + steal and crashed-claimant lanes above. *Basis:* §1, `git-commit-lock.md:60-82` + (the explicit reason for a lease over a kernel lock). + +- **G-R3 — No busy-spin; bounded wait.** A waiter on a genuinely squatted or + delete-blocked lock gives up at `MAX_WAIT` and never busy-spins past it; the + failed-steal lane logs in a damped, bounded way (`I:746-817`). *Witness:* interop + Test 14b. *Basis:* §K(4). + +- **G-R4 — No process leaves an *unowned* lock behind.** Per-attempt tokens make + the ownership-discovery read conclusive, so no process inside an + acquire/hold/release arc can install a lock nobody owns and walk away: it either + discovers it holds, or the lock is recovered by staleness, and in no case is a + steal-installed lock mistaken for owned by the wrong process + (`git-commit-lock.sh:138-157` + the leaked-token memory). The one bounded + residual — an untrappably-killed claimant's claim installed as an unowned lock — + stalls waiters ≤ one stale window with **no false success** (accepted; §B3). + *Witness:* unit Tests 31/35/36 (`U:1549-1758,2013-2164`). *Basis:* §C4. + +### 2C. Interoperation + +- **G-I1 — bash and PowerShell take the same lock.** One on-disk wire format + (`tok.`-prefixed line 1, owner line 2), one read-retry ladder + (8 attempts, 20/40/80/160/320/320/320 ms — byte-identical between ports), one + set of release verdicts, one config grammar. A `.sh` holder and a `.ps1` holder + in one tree serialize against each other and steal each other's genuinely stale + locks. *Condition:* Windows for the supported ps1 config (E6). *Witness:* the + interop suite throughout (`I:*`). *Basis:* §I1. + +--- + +## 3. Failure semantics (the shape of every degradation) + +When the tool cannot uphold a property it fails in one of these bounded, +documented ways — **never** silently: + +- **Detect, don't pretend** — a displaced holder returns 98 + WARNING (G-S1). +- **Wait, don't guess** — an unprovable state routes to poll/wait → 97, never to + a steal or a hold (G-S2). +- **Refuse, don't destroy** — a non-lock-shaped object is left in place (and + normally warned about — the warning is best-effort, see G-S4); waiters reach 97. +- **Announce, don't hide** — a broken staleness clock (unreadable mtime) warns + loudly once and disables stealing (fails safe; §4, BE-2). + +**Within the operating envelope**, the only place a *correctness* degradation can +be silent — a non-unwinding exit returning 0 while displaced — is carved out +explicitly in OOS-5. Two silences fall *outside* that scope and are disclosed +separately: a degradation **outside** the envelope (a network/sync FS silently +losing exclusion, OOS-1), and a **non-correctness** loss (a swallowed log write, +BE-4). Logging is best-effort by design; correctness is not. + +--- + +## 4. Best-effort (within the envelope, not a hard guarantee) + +These hold under normal conditions and degrade *gracefully and detectably* under +pathological scheduling or host-health failures. **Correctness (§2) is preserved +throughout; only liveness/latency degrades.** This tier is the reference Bucket 4 +scopes the suite's wall-clock test assertions against (the strict/envelope test +split, `failure-modes.md` §4.1 / D-c). + +- **BE-1 — Wall-clock latency bounds are in poll-count, not seconds.** Recovery + latency (≈ `STALE` + poll cadence), the `MAX_WAIT` timeout, and the ~1.26s + read-retry ladder all *stretch* under CPU oversubscription or a slow FS while + still completing. The guarantee is "bounded by the configured knobs in + poll-count," not "exactly N seconds." Tests asserting a specific wall-clock or + poll-count number (Test 21's ≤20s, Test 22a's warning timing, Test 29's ≥2-CLAIM + count) assert an *envelope* bound, not a correctness bound, and may be relaxed or + gated to a defined load level (`GCL_ENVELOPE_TIER=relax`) without any product + change. *Basis:* `failure-modes.md` §K, §4.1. + +- **BE-2 — Diagnostic warnings are best-effort.** The wrong-type config warning + and the claim-path warning rely on poll headroom that an oversubscribed runner + can starve; the guarantee is that the *condition is handled safely*, not that a + specific warning fires within a specific time. *Basis:* §K(2), §D3. + +- **BE-3 — Recovery presumes a readable clock; an unreadable mtime fails safe.** + If the lock's mtime cannot be read at all, both ports retry three times, then + warn loudly once per process and treat the lock as **not** stale (the mtime floor + fails closed to "fresh"): no premature steal, no corruption — but recovery of a + genuinely crashed holder is *disabled* and waiters block to `MAX_WAIT` (97). + Safety is preserved; recovery is lost and announced. *Coverage planned* (Bucket + 2 / §4.5). *Basis:* §E3. + +- **BE-4 — Logging is best-effort and never blocks the lock.** Every log write + ends `|| true`; a failed or unwritable log write is swallowed and the lock works + unaffected (the log self-truncates past ~1 MB). *Coverage planned* (Bucket 2 / + §4.5, the F2/J1 test). *Basis:* §F2/§J1. + +- **BE-5 — The PowerShell 5.1 steal is claim-guarded, not atomic.** Windows + PowerShell 5.1 lacks the 3-arg `File.Move` overload, so its steal is + unlink-then-move with a transient absent window. Under the claim this is a + *fairness loss* (a rival's create can win the recovered path; the claimant backs + off cleanly), **never a clobber**. *Basis:* §D1, `git-commit-lock.md:471-476`. + +--- + +## 5. Out of scope (not guaranteed) + +The tool deliberately does not promise the following. Where it can, it still fails +*safely and detectably*; the point of listing them is that the strict guarantees +of §2 are **not** claimed here. + +- **OOS-1 — Network / shared / sync-backed filesystems.** NFS, SMB/CIFS, 9p, + Dropbox/OneDrive. These weaken the atomic create/rename the protocol rests on, so + exclusion may silently not hold. Documented boundary only — surfaced in the + README; **no** FS-type probe is built (decision: `failure-modes.md` §4 item 3). + *Basis:* §E1. + +- **OOS-2 — Multi-host use / clock skew across hosts.** Rides on OOS-1 (only arises + on a shared FS). A *local* clock jump on the single host is **in scope and + correctness-safe** (degrades to the detected-98 lane). *Basis:* §E2. + +- **OOS-3 — Mixed-version trees.** If contenders run different protocol versions, + the no-displacement prevention (G-S3) degrades to detection (98), and old-style + stealers can leave `.dead.*` litter. Never silent, but the prevention property is + not guaranteed. Deployment rule: **upgrade both implementations together** + (`git-commit-lock.md:251-256`; to be surfaced in the README too — Bucket 3). + *Basis:* §I2. + +- **OOS-4 — PowerShell port on POSIX.** Supported on Windows only; on POSIX it runs + solely as a cross-implementation protocol check. Its one residual there + (FIFO/device/socket stat as empty and take the empty-orphan lane, capping damage + at the one misconfigured inode) is accepted and documented. *Basis:* §D3. + +- **OOS-5 — A non-unwinding exit returning 0 while displaced (the no-silent-loss + boundary).** G-S1's detection requires the *lock-holding shell* to reach release + (E3). If a *displaced* holder is terminated or replaced **without unwinding** — + external SIGKILL, an `exec` that replaces the **lock-holding shell itself**, or + PowerShell `[Environment]::Exit()` — *and* the resulting process exits **0**, the + caller can see success with no 98. The `exec` case is **narrower than it looks** + (verified empirically): `lock_run` runs the wrapped command vector in the wrapper + shell (`git-commit-lock.sh:1733`), so the bypass needs the exec to run in *that* + shell — a **sourced** caller doing `lock_acquire; exec …` in its own shell, or + the contrived `run -- exec …` where the wrapped command *is* an exec. An exec + **nested in a child** — the normal `run -- bash -c 'exec …'` — does **not** + bypass: the child is replaced, the wrapper waits and releases normally. A **plain + `exit` is safe** (it unwinds). What keeps the whole class narrow: an external + SIGKILL yields a non-zero wait status (POSIX `128+9`), so a caller checking exit + codes does not see success; the hole needs a process that *deliberately* replaces + or hard-exits the lock-holding shell **and** returns 0 **while displaced**. The + *next* holder still recovers via staleness; only the abruptly-exiting one is + unwarned. No code change closes this without the handle-based ops the design + rejected. *Witness (boundary exercised indirectly):* interop Test 5 (`I:308-334`, + ps1 `[Environment]::Exit()`); the bash `exec` lane is a coverage gap + (`steering-coverage.md` A4). *Basis:* §H4. + +- **OOS-6 — Adversarial / hostile local processes.** The lock is advisory. Against + a process actively trying to break it (deleting/overwriting the lock file, or a + hostile repo redirecting the git dir), the tool *detects* interference where it + can but does not prevent it; damage from a redirected git dir is bounded to the + tool's own named files with non-recursive deletion. *Basis:* + `git-commit-lock.md:520-551`. + +- **OOS-7 — Non-issues, explicitly.** A case-insensitive FS path collision (the + lock and claim paths never collide under case folding; two case-differing + configured paths resolving to one file is *correct* shared-lock behavior) and + memory exhaustion (the scripts allocate trivially). No action. *Basis:* §D5/§F5. + +### Things deliberately NOT built (and why) + +The design considered and rejected each of these; they are not roadmap items +(`failure-modes.md` §4 "Things explicitly NOT to do"): + +- A **background heartbeat** to refresh the lease — would make the tool more than a + single synchronous script; the fail-open-but-detectable lease is the deliberate + alternative. +- A **two-rename compare-and-swap** to prevent the B3 residual — reintroduces crash + litter and a sweep, for a failure that is already bounded and false-success-free. +- **`File.Replace`** in the ps1 port — throws on a read-only destination and has + partial-failure states (pinned out by interop Test 16d). +- **Supporting network/shared filesystems** — correctness rests on local-FS atomic + create/rename; this is a boundary to document, not to engineer around. + +--- + +## 6. Staying inside the envelope (operating rules) + +- **Hold the lock only to commit.** Decide what to stage, build any patch, and + resolve failures *outside* the lock; a normal stage+commit holds it for seconds + (the golden rule, `git-commit-lock.md:433-458`). This keeps holds inside the + staleness window (E4) so G-S3 applies. +- **For a deliberately slow hold, raise `AGENT_LOCK_STALE_SECS`** for that + invocation rather than risking a fail-open steal. +- **Keep the lock on a local filesystem** (the default `/commit.lock` + almost always is) so E2 holds. +- **Upgrade both implementations together** (E5) so G-S3's prevention holds. +- **Never `git stash` in a shared checkout** — it rewrites the working tree and + clobbers other agents' edits (orthogonal to the lock, but part of operating in a + shared tree). + +--- + +## 7. Verification map + +Each guarantee → its witnessing test(s) and the failure-modes section. `U` = +`tests/git-commit-lock.test.sh`, `I` = `tests/git-commit-lock.interop.test.sh`, +`integ` = `tests/git-commit-lock.integration.test.sh`. "Coverage planned" marks a +guarantee that is currently reasoned-correct-but-untested and slated for a +fault-injection test under Bucket 2 (`failure-modes.md` §4.5, Ben's override to +add coverage); the *guarantee* is made now, the *test* lands in Phase 3. + +| Guarantee | Witness | failure-modes § | +|---|---|---| +| G-S1 no silent lost update | U Test 4b + Test 16 (unverifiable lane); I Test 8 (both dirs) | §1, §B5 | +| G-S2 no corruption / no false hold | U Tests 32/32b (read-back failure); **resource lanes: coverage planned** (F1/F3/F4) | §1, §A1, §F | +| G-S3 strict exclusion in window + no displacement | U Tests 1/2b/20; I Tests 1/6/16/16b; integ | §A1/§A2/§A3 | +| G-S4 never destroys non-lock-shaped | U Tests 17/17d/18/22 | §D3/§D4/§G1 | +| G-S5 truthful exit codes | U Tests 7/8/4b/5/16; I run-verdict tests | §1, §H4 | +| G-R1 lock-shaped orphans reclaimed | U Tests 2/3/21 | §B1/§C1/§C2/§C3 | +| G-R2 one stuck agent can't wedge | stale-steal + crashed-claimant lanes | §1 | +| G-R3 no busy-spin; bounded wait | I Test 14b | §K(4) | +| G-R4 no unowned lock left behind | U Tests 31/35/36 | §C4 | +| G-I1 bash⇄pwsh same lock | I suite throughout | §I1 | +| BE-3 unreadable mtime fails safe | **coverage planned** (E3) | §E3 | +| BE-4 logging best-effort | **coverage planned** (F2/J1) | §F2/§J1 | + +The "coverage planned" rows are exactly the lanes Phase 1c (the steering-coverage +audit) and Bucket 2 (the new fault-injection tests) exist to close. diff --git a/docs/steering-coverage.md b/docs/steering-coverage.md new file mode 100644 index 0000000..dd98461 --- /dev/null +++ b/docs/steering-coverage.md @@ -0,0 +1,283 @@ +# Deterministic-steering coverage: audit and gap list + +**Status: analysis / work-scoping.** This document maps the protocol's +race-critical windows and branches to their deterministic-steering tests (or +gaps), and scopes the test work that closes the gaps. It is the output of Phase +1c of the [guarantees-and-coverage plan](../.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md) +(Bucket 7). Gap-*filling* is Phase 3 (bundled with the Bucket 2 fault-injection +tests); this doc decides *what* to fill and *how*. + +**Why steering, not load.** As [`load-testing-strategy.md`](load-testing-strategy.md) +establishes, the protocol's correctness rests on structural properties (O_EXCL +create + atomic rename + per-attempt tokens), so the primary coverage lever is +**in-process function interposition** — the test suite's `clone_fn` mechanism +shadows internal `_lock_*` functions (and `mv`/`rm`/`touch`) to force an exact +interleaving deterministically. External load only *probabilistically* widens +the same windows. This audit therefore measures *steering* coverage, with an +objective `kcov` line-coverage pass as a cross-check. + +--- + +## 1. Method and headline numbers + +Three independent inputs, reconciled below: + +1. **Manual window audit — acquire + steal paths.** Every branch/residual mapped + to its steering test or a gap. +2. **Manual window audit — hold + release + discovery + staleness/mtime paths.** +3. **`kcov` objective line coverage** (the mechanical cross-check) — built from + source (kcov v43; no apt package / prebuilt binary exists) and run on the unit + suite at FULL fan-out under WSL Ubuntu-24.04. Artifacts (gitignored): + `.agent-testing/kcov/` (`cobertura.xml`, merged unit+integration, line-by-line + HTML). Repro commands in [§5](#5-kcov-reproduction). + +**kcov result: 83.1% line coverage — 451 / 543 instrumented lines; 92 never +executed.** (kcov does not do real branch coverage on bash — its branch numbers +are trivially 1.0 and must be ignored.) The integration suite added **zero** lines +over the unit suite, so the unit suite is the comprehensive measurement. + +Of the 92 uncovered lines: + +- **~30 are platform-gated and *correctly* unreachable on Linux** — ~23 in the + Windows no-delete-share handle lanes (an open handle blocking `unlink`/`rename`, + which never happens on POSIX), plus 3 in the macOS/BSD `mv` fallback. These are + covered on the **Windows** CI leg (interop Tests 13/31d/33c) and would need a + **macOS/BSD** leg for the `mv` fallback. They are **not** Linux gaps. The + practical Linux line-coverage ceiling is therefore ~94% ((543−30)/543), not + 100%. +- **~62 are Linux-reachable** — the real targets, prioritized in [§3](#3-the-gap-list-prioritized). + +**The cross-check earned its place.** kcov objectively corrected **three +over-credits** in the manual audit — branches the manual reasoning inferred were +covered, but which `kcov` shows were never executed: + +| Branch | Manual audit said | kcov (objective) | Reconciled | +|---|---|---|---| +| step-3.3 pre-rename CLAIM-ABORT block (`:1151-1160`) | covered via the step-2 / `deletion-gone` matrix positions | **hits=0** | **GAP** — the step-2 twin is steered, the near-identical step-3.3 twin is not | +| `foreign` claim-recheck branch (`:1103-1106`) | covered via Test 33b + the matrix | **hits=0** | **GAP** — only the `gone` recheck leg is steered | +| EXIT-trap no-hold arc-end (`:1009,1017-1018`) | transitively covered | **hits=0** | **GAP** — only the *signal* (TERM) no-hold twin is steered, not the EXIT-while-waiting one | + +This is the value of a mechanical pass over correlated manual reasoning: trust the +instance, verify the output against the tool. Where this doc and a manual claim +disagree, **kcov's `hits=0` wins**. + +(Line numbers below are anchors against the current `ci-stress` tree and may drift +a few lines; the manual audits re-located everything and found the +failure-modes.md anchors had moved ~9 lines.) + +--- + +## 2. What is already well covered (for confidence) + +The audit confirms the protocol's *delicate* paths are strongly steered, so the +gaps are at the edges, not the core: + +- **The two read-back "twins"** are each independently steered with opposite + claim-token gates: the create-path "I twin" (`acquire verification FAILED`, + `:1354-1361`) by **Test 32**, and the steal-path "F2 twin" (`steal rename + completed but read-back`, `:1171-1179`) by **Test 32b**. +- **The discovery rule** — the ownership-discovery read on every non-rename exit — + by **Test 25**'s 7-position matrix (`step2-fresh`, `recheck-gone`, `touch-gone`, + `lock-gone`, `contested`, `deletion-gone`, `source-gone`), each steering a rival + install to an exact protocol point. +- **The two discovery routes** (direct `_lock_discover` vs the per-poll + leaked-token-memory check) each independently steered (Test 25 vs Test 31b), + with Test 31a deliberately accepting *either* route on the genuine scheduling + race between them. +- **The claim re-verify / touch / lease-reset lane** (Tests 23/24/26/27), the + leaked-claim family (Tests 31/35/36), the never-steal guards for dir/symlink/FIFO + at both lock and claim paths (Tests 17/22), and the trap-time claim cleanup + (Test 33). + +--- + +## 3. The gap list, prioritized + +Each gap: location, what it is, how to steer it, and a priority. "Portable +interposition" = a `clone_fn`/shadow test that runs on every OS (the cheapest, +most valuable kind). "Fault injection" = needs a real resource/IO failure. "Platform" += only reachable / only meaningful on a specific OS leg. + +### Tier A — Portable deterministic steering (do these first; no fault injection) + +These are new `clone_fn`/shadow tests in the unit suite, runnable on every leg. + +- **A1 — `CLAIM-ABORT (rename-refused)`: wrong-type object at the lock path + mid-steal** (`:1195-1202`). *Headline gap.* The only acquire/steal **verdict** + branch with no steering test, and it has its own log string. (This is the + F2-audit #7 lane; the strategy doc's §2 reachability table missed it.) *Steer:* + `clone_fn _lock_verify_stale` (or shadow `mv`) to `mkdir` a directory onto the + lock path immediately before the rename; assert `rename-refused` + claim deleted + + discovery + no false hold. **Highest value.** + +- **A2 — step-3.3 pre-rename CLAIM-ABORT block** (`:1151-1160`; kcov-corrected + over-credit). The `gone`/`wrongtype`/`fresh` reason map + claim-delete + + discovery + `return 1`, near-identical to the step-2 block but separately + reachable. *Steer:* a `_lock_verify_stale` shadow with a call-counter that flips + to not-stale on the **second** call (step-3.3), the first call (step-2) passing. + **High value** (a whole unexercised abort lane). + +- **A3 — `foreign` claim-recheck branch** (`:1103-1106`; kcov-corrected + over-credit). A clearer removed our claim and a rival re-claimed → leave it, + discovery read, back off. *Steer:* shadow the claim read at recheck to return a + foreign token. **Medium-high.** + +- **A4 — `exec`-bypass of release / the §H4 no-silent-loss boundary** (`lock_run` + runs the wrapped command vector in the wrapper shell, `:1733`). No test exercises + the bash bypass; the ps1 `[Environment]::Exit()` twin *is* (interop Test 5). + **Empirically verified (2026-06-17):** the bypass needs the exec to run in the + **lock-holding shell itself** — `run -- exec true` (the wrapped command *is* an + exec), or a sourced `lock_acquire; exec true` — **not** `run -- bash -c 'exec + true'`, which execs a *child* and lets the wrapper release normally (so that + recipe would silently pass without testing anything). *Steer, two parts:* (a) + benign — `run -- exec true` (or sourced `lock_acquire; exec …`) and assert no + `RELEASED` line / lock left held; (b) the silent-loss — backdate the lease + park + a contender so the holder is *displaced*, then exec a 0-exit and assert the caller + sees 0 with **no** 98 (pinning [`guarantees.md`](guarantees.md) OOS-5). **High + value** — the one interleaving that can silently lose an update. *Note:* this + corrected the original audit recipe, which used the non-bypassing `bash -c 'exec'` + form — a foreign-model (Codex) review + a 4-line empirical check caught it; the + manual audit and a same-model reviewer both had it wrong. + +- **A5 — forward clock-jump → premature steal of a live lock** (§E2; age = now − + mtime, `:928,1409`). Code-safe (degrades to the detected-98 lane) but untested. + *Steer:* `clone_fn _lock_now` to return now+offset on the poll while the real + holder's mtime stays current, forcing age ≥ STALE on a live lock; assert the + victim's release hits 98 (a clock-driven analogue of Test 4b). **Medium.** + +- **A6 — mtime-unreadable fail-safe** (§E3; `:639-645` warn, `:912-926` consume). + Only a *negative* assertion exists (the warning must NOT fire under normal + contention, Test 1). *Steer:* `clone_fn` the mtime helper (`_lock_path_mtime` / + the `stat` shadow) to return empty on a present file; assert the warn-once fires, + no steal occurs, and a waiter reaches 97. **Medium** (it is the clean reason + recovery is Tier-1-*within-envelope*, so worth pinning). + +- **A7 — malformed/unreadable content classification tails** (the `_lock_verify_stale` + tail `:940-949`; the in-acquire steal content guard `:1429-1443`; the + `_lock_claim_stale_check` content tail `:1240-1249`). The `tok.`-prefixed and + empty-orphan lanes are covered; the **non-empty-blank-line-1** (`#18`), + **unreadable-content steal-skip** (`#17`), and **vanished-mid-check** sibling + branches are not. *Steer:* fabricate a line-1-whitespace file and a + read-fault shadow; backdate; assert no-steal + the right warning. **Low-medium, + cheap** (several branches per small test). + +- **A8 — socket & device-node wrong-type arms** (`:1474-1475` claim path, + `:1561-1562` lock path; kcov-new). The dir/symlink/FIFO arms are tested; the + socket (`-S`) and device (`-b/-c`) arms are not. *Steer:* bind a unix socket / + reference a device node (`/dev/null`) at the path; assert refusal. **Low, cheap** + (sibling arms of a tested guard; both creatable on Linux). + +- **A9 — log rotation past 1 MB** (`:558-559`; kcov-new). *Steer:* pre-write a + >1 MB log, trigger a log call, assert truncate-restart. **Low, trivial** (no + fault injection). + +- **A10 — EXIT-trap no-hold arc-end** (`:1009,1017-1018`; kcov-corrected + over-credit). EXIT while *waiting* without a hold or in-flight claim. *Steer:* a + sourced `lock_acquire` that exits while still blocked; assert the no-hold + cleanup/restore path runs. **Low.** + +- **A11 — `mv -T` fallback forced on** (`:969,976-977`). Naturally hit only on + BSD/macOS, but **made Linux-steerable** by forcing `_LOCK_MVT=0` (or shadowing + the probe's `mv -T` to fail) in a sourced steering shell, then running a steal — + and a steal-into-a-directory to hit the `[ -d ]` guard (dovetails with A1). + **Low-medium** (closes a real engine lane on the common leg instead of waiting + for a BSD runner). + +### Tier B — Fault injection (real resource/IO failures; mostly POSIX-only) + +These are the [`failure-modes.md`](failure-modes.md) §4.5 lanes (Ben's override to +add coverage) plus the read-fault siblings. They need a real failure, not +interposition; guard by platform and **flag any that can't be injected portably +rather than shipping a flake** (per the §4.5 decision). + +- **B1 — Unwritable lock dir/parent → clean 97** (F4). `chmod` the dir. + POSIX; the cheapest and highest-value fault-injection test. **High.** +- **B2 — Unwritable/failing log path → lock still works, log swallowed** (F2/J1). + Bad/again-`chmod`'d log path. POSIX. **Medium-high.** +- **B3 — ENOSPC during claim/lock create+write** (F1; the create write-fail branch + `#5` and the read-fault lanes `:848,871-873`). Small dedicated tmpfs/quota. + Linux-friendliest; flag if not portable. **Medium.** +- **B4 — FD exhaustion via `ulimit -n`** (F3). Portable POSIX; inode exhaustion + only if cleanly injectable. **Medium.** + +### Tier C — Platform-only (verify off-Linux; not a Linux gap) + +- **C1 — Windows no-delete-share handle lanes** (~23 lines: `:881-890,993, + 1639-1647,1700-1712`). Already covered by interop Tests 13/31d/33c on the Windows + CI leg. *Action:* confirm the Windows leg's coverage exercises them (it does by + construction); no Linux work. Consider a kcov-equivalent on Windows is + impractical — rely on the explicit interop tests. +- **C2 — macOS/BSD `mv` fallback real path** (`:969,976-977`). A11 makes this + Linux-steerable by forcing the probe off; a *genuine* BSD `mv` exercise needs a + macOS leg. *Action:* prefer A11 (portable) and treat a macOS leg as optional + per the load-strategy matrix. + +### Tier D — Bounded residuals: document, don't test + +Low-value, bounded, detected, or self-healing; the manual audits rate these +not worth a dedicated test. *Action:* ensure each is named in the code header / +`guarantees.md` as an accepted residual; fold into a broader test opportunistically +if cheap, but do not build bespoke tests. + +- **D1 — residual-1** (verify→rename: our rename clobbers a freshly-created rival + lock → victim detects 98). Detection is covered structurally; the specific + interleaving is bounded + detected. +- **D2 — residual-3** (claimant suspended between touch and rename installs an + aged-mtime lock). Bounded shortfall, self-healing; the *positive* lease-reset is + covered (Test 26). +- **D3 — leaked-resolve rare arc-end legs** (`:755-758,1260-1262`) and the + release boundary-re-read in isolation (`R2`). Reachable only with a non-empty + leaked set; transitively exercised. + +--- + +## 4. Scoping summary for Phase 2 + +- **Tier A (11 tests, portable interposition)** is the bulk of the value and the + bulk of the work — all runnable on every CI leg, no fault-injection fragility. + A1, A2, A4 are the high-value three (a real verdict branch, a whole unexercised + abort lane, and the single silent-loss boundary). Bundle these into the unit + suite alongside the Bucket-2 work. +- **Tier B (4 tests, fault injection)** is the failure-modes §4.5 set; platform-gate + them and flag any non-portable lane in the Phase-2 plan rather than shipping a + flake. +- **Tier C** is verification on the Windows leg (already covered) + an optional + macOS leg; **Tier D** is documentation, not tests. +- **Expected effect:** closing Tier A + the Linux-injectable parts of Tier B should + take Linux line coverage from 83.1% toward the ~94% platform ceiling; the + remaining ~6% is the Windows/BSD platform-gated lanes covered on their own legs. +- **Harness ergonomics (Bucket 8)** pay off here: a `GCL_TEST_ONLY=` + selector and TAP output make iterating on ~15 new steered tests far cheaper — + schedule them before/with the test build. + +--- + +## 5. kcov reproduction + +For re-running the objective coverage measurement (per the reproducible-experiments +principle). All from Git Bash; `MSYS_NO_PATHCONV=1` stops Git Bash mangling a +leading `/tmp` arg into a Windows path before WSL sees it. + +```bash +# Build kcov v43 (no apt package; upstream ships no prebuilt binary): +wsl.exe -d Ubuntu-24.04 -e bash -c 'sudo apt-get install -y cmake libdw-dev libelf-dev \ + binutils-dev libcurl4-openssl-dev zlib1g-dev libiberty-dev' +wsl.exe -d Ubuntu-24.04 -e bash -c ' + cd /tmp && curl -fsSL https://github.com/SimonKagstrom/kcov/archive/refs/tags/v43.tar.gz \ + | tar xz && mkdir kcov-build && cd kcov-build && cmake ../kcov-43 && make -j"$(nproc)"' + +# Run the unit suite under kcov (FULL fan-out) and list never-executed lines: +MSYS_NO_PATHCONV=1 wsl.exe -d Ubuntu-24.04 -e bash -c ' + cd /mnt/c/agent_data/commit-lock/worktrees/ci-stress && + GCL_TEST_FULL=1 /tmp/kcov-build/src/kcov --include-path=git-commit-lock.sh \ + /tmp/gcl-cov tests/git-commit-lock.test.sh' +MSYS_NO_PATHCONV=1 wsl.exe -d Ubuntu-24.04 -e bash -c ' + F=/tmp/gcl-cov/git-commit-lock.test.sh.*/cobertura.xml; + grep -oE "" $F | + sed -E "s/.*number=\"([0-9]+)\" hits=\"([0-9]+)\".*/\1 \2/" | + awk "\$2==0 {print \$1}" | sort -n' +``` + +When the kcov pass becomes a permanent CI leg (Phase 3 / Bucket 7), it runs on the +Linux runner against the unit suite at FULL, and the platform-gated ~30 lines (§1) +are expected-uncovered there by design. From b504f873308645a10b7b31f7cf4394f68730407f Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 20:21:46 +1000 Subject: [PATCH 26/76] Phase 2 plan: implementation plan for Buckets 2/3/4/6/8 MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Concrete build plan derived from the Phase 1 outputs (guarantees.md, steering-coverage.md) + the accepted failure-modes §4 and load-strategy §9 decisions. Sections: - 2A Tier-A steering tests (A1-A11; per-test mechanism / assertion / platform / priority — the audit gave each steering technique). - 2B Tier-B fault-injection (F4 + F2/J1 first cut; F1 gated-or-doc, F3 doc-only) — each injection empirically prototyped; refines the original D-b (F3 was not deterministically injectable; ulimit -f is a SIGXFSZ trap). - 3 doc edits (exact text: envelope + single-clock in the design doc; network-FS boundary + upgrade-both in the README). - 4 GCL_ENVELOPE_TIER=relax mechanism + the 3 downgrade sites (D-c). - 6 three-tier CI (Required / Nightly / Deep), event-conditional concurrency (keeps the deep-sweep group off the required gate), kcov coverage job, nightly auto-triage, paths-ignore-on-required fix, and a refined do-not-merge disposition (with-load.sh graduates calibrated). - 8 TAP + 1..N + the silent-undercount sentinel fix, GCL_TEST_ONLY selector (integration excluded by design), tests/_harness.sh extraction. Bucket 2B/6/8 designs feasibility-validated by parallel agents (prototypes in .agent-testing/, gitignored). Awaiting Ben's Phase 2 gate. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-17-ci-stress-phase2-build-plan.md | 396 ++++++++++++++++++ 1 file changed, 396 insertions(+) create mode 100644 .plans/2026-06-17-ci-stress-phase2-build-plan.md diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md new file mode 100644 index 0000000..a7c2edf --- /dev/null +++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md @@ -0,0 +1,396 @@ +# Phase 2 plan: implement the guarantees-and-coverage build (Buckets 2/3/4/6/8) + +Status: **PROPOSAL — Phase 2 of the [guarantees-and-coverage +plan](2026-06-17-ci-stress-guarantees-and-coverage-plan.md).** Awaiting Ben's +gate. No implementation (Phase 3) until approved. + +## What this plans +The concrete build that follows from the (committed, queued) Phase 1 outputs: +- `docs/guarantees.md` — the normative contract (Phase 1a). +- `docs/steering-coverage.md` — the prioritized steering-coverage gap list (Phase 1c). +- `docs/failure-modes.md` §4 — the accepted scope decisions (incl. Ben's §4.5 + override to add fault-injection coverage). +- `docs/load-testing-strategy.md` §9 — accepted load/matrix recommendations. + +It turns those into: new tests (Bucket 2 — the Tier-A steering + Tier-B +fault-injection gaps), documentation edits (Bucket 3), the correctness/envelope +test split (Bucket 4 / D-c, via `GCL_ENVELOPE_TIER=relax`), the CI matrix wiring +(Bucket 6), and harness ergonomics (Bucket 8). **Verification is CI-first** (the +new tests run across the matrix); local runs are allowed but the box lags under +heavy fan-out. + +Each section gives per-item designs concrete enough for Phase 3 to implement +directly. Three sections (Bucket 2 Tier-B, Bucket 6, Bucket 8) are being +feasibility-validated by parallel design agents and are integrated below. + +--- + +## Bucket 2A — Tier-A steering tests (portable, deterministic; the bulk of the value) + +From `steering-coverage.md` §3 Tier A. All are new `clone_fn`/shadow tests in +`tests/git-commit-lock.test.sh` (unit suite), runnable on every CI leg — no +fault-injection fragility. The audit already established each steering technique; +line anchors are current-tree and may drift (re-locate at build). + +| ID | Gap (location) | Steering mechanism | Asserts | Platform | Priority | +|---|---|---|---|---|---| +| **A1** | `CLAIM-ABORT (rename-refused)` — wrong-type object at the lock path mid-steal (`:1195-1202`) | `clone_fn _lock_verify_stale` (or shadow `mv`) to `mkdir` a directory onto `$AGENT_LOCK_PATH` immediately before the rename | `CLAIM-ABORT (rename-refused)` + "non-file at the lock path" log; claim deleted; discovery read; **no false hold**; ghost handled | all | **HIGH** — the only acquire/steal *verdict* branch with no test; its own log string | +| **A2** | step-3.3 pre-rename CLAIM-ABORT block (`:1151-1160`; kcov hits=0) | `_lock_verify_stale` shadow with a **call-counter**: pass on call 1 (step-2), flip to `not stale` (gone/wrongtype/fresh) on call 2 (step-3.3) | the step-3.3 abort reason-map fires; claim-delete + discovery + `return 1`; no false hold | all | **HIGH** — a whole unexercised abort lane | +| **A3** | `foreign` claim-recheck branch (`:1103-1106`; kcov hits=0) | shadow the claim read at recheck to return a *foreign* token (a clearer removed our claim, a rival re-claimed) | leave the foreign claim; discovery read; back off; no 98-on-mere-claim | all | MED-HIGH | +| **A4** | `exec`-bypass / §H4 no-silent-loss boundary (`lock_run` runs `"$@"` in the wrapper shell, `:1733`) | **(corrected, verified empirically)** the exec must run in the lock-holding shell: `run -- exec true` or sourced `lock_acquire; exec true` — **NOT** `run -- bash -c 'exec true'` (that execs a child, releases normally) | (a) benign: no `RELEASED` line / lock left held; (b) displaced (backdated lease + parked contender) + exec 0 → caller sees 0 with **no** 98 — pins `guarantees.md` OOS-5 | all (bash) | **HIGH** — the one silent-loss boundary | +| **A5** | forward clock-jump → premature steal of a live lock (§E2; `:928,1409`) | `clone_fn _lock_now` to return now+offset on the poll while the live holder's mtime stays current | the live lock is judged stale and stolen; the victim's release hits **98** (clock-driven analogue of Test 4b) | all | MED | +| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn` the mtime helper (`_lock_path_mtime` / its `stat` shadow) to return empty on a *present* file | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED | +| **A7** | malformed/unreadable content classification tails (`_lock_verify_stale` `:940-949`; in-acquire steal guard `:1429-1443`; claim-stale-check `:1240-1249`) | fabricate a line-1-whitespace file (non-empty blank line 1 = `#18`); shadow a read-fault (`#17`) | no steal; the right `not a lock/claim file` / `unreadable` warning; covers several sibling branches per test | all | LOW-MED (cheap, multi-branch) | +| **A8** | socket & device-node wrong-type arms (`:1474-1475` claim, `:1561-1562` lock; kcov-new) | bind a unix socket / reference a device node (`/dev/null`) at the path | refusal (never stolen/deleted); the `-S`/`-b`/`-c` arms execute | POSIX | LOW (cheap; sibling of tested guard) | +| **A9** | log rotation past 1 MB (`:558-559`; kcov-new) | pre-write a >1 MB `$AGENT_LOCK_LOG`, trigger a log call | truncate-restart (log shrinks; lock unaffected) | all | LOW (trivial, no injection) | +| **A10** | EXIT-trap no-hold arc-end (`:1009,1017-1018`; kcov hits=0) | a sourced `lock_acquire` that `exit`s while still *waiting* (no hold, no in-flight claim) | the no-hold cleanup/restore path runs (vs the TERM twin already tested) | all | LOW | +| **A11** | `mv -T` fallback forced on (`:969,976-977`) | pre-set `_LOCK_MVT=0` (or shadow the probe's `mv -T` to fail) in a sourced steering shell, then run a steal + a steal-into-a-directory | the BSD/macOS unlink+bare-`mv` lane + the `[ -d ]` last-instant guard execute on Linux/MINGW | all (forces the lane) | LOW-MED (closes an engine lane on the common leg) | + +**Sequencing:** A1/A2/A4 first (high value, real verdict/abort/silent-loss lanes); +A3/A5/A6 next; A7-A11 as a cheap batch. Each is a self-contained unit test using +the existing fabricate + backdate + `clone_fn` idioms. + +--- + +## Bucket 2B — Tier-B fault-injection tests (empirically feasibility-validated) + +Each injection was prototyped against the real `git-commit-lock.sh` (Git Bash + WSL). +The §4.5 discipline applies: **ship only lanes that inject portably/deterministically; +flag the rest rather than ship a flake.** This **refines the original D-b** (which had +F3 in the first cut) based on the feasibility results. + +| Lane | Injection | Asserts | Guard | Status | +|---|---|---|---|---| +| **F4 — unwritable lock dir → 97** | `chmod 0555` the lock dir; create fails O_EXCL every poll. Cap `MAX_WAIT=1-2`, `POLL=0.1`. | `rc==97`; command never ran (no marker); no lock created; log `WAITING` then `TIMEOUT after Ns` | **POSIX-only** (guard is **load-bearing**: `chmod 0555` is a *no-op for writes* on Git Bash/NTFS → would falsely pass rc=0; skip-with-note like Test 17's symlink branch) | **First cut.** Deterministic (5/5 rc=97 on WSL). The §F4 highest-value lane (most likely real misconfig). | +| **F2/J1 — failing log → lock works, write swallowed** | Point `AGENT_LOCK_LOG` at `/x.log` so every append fails **ENOTDIR** (portable; no chmod/perms). | `rc==0`; command ran (marker); lock cleaned up (gone); log **not written** (`[ ! -s "$LOG" ]` / uncreated). Covers F2 **and** J1 in one test. | **Portable — no guard.** | **First cut.** Deterministic, both platforms. **Caveat:** bash's redirection-open failure leaks to stderr (the `||true` is on the write, not the open) — do **not** assert clean stderr, and do **not** `grep RELEASED "$LOG"` (nothing is written). | +| **F1 — ENOSPC on create/write** | Real full FS only: `sudo mount -t tmpfs -o size=400k` + `dd` fill, point the lock there. | `rc==97`; command never ran; an **empty-orphan lock left behind** (create 0-byte, write failed — matches §F1) | **Linux-only AND needs root/sudo** | **Second cut — gated, or document-only.** Behavior validated end-to-end on WSL. **`ulimit -f 0` is a trap** — it raises SIGXFSZ (rc=153) killing the *wrapper*, not the create. **No portable injection.** | +| **F3 — FD / inode exhaustion** | (intended `ulimit -n` / small-inode FS) | (intended `rc==97`, create-fail→wait) | Linux-only; inode→root | **Document-only.** **Cannot inject deterministically:** the create uses **~1 FD**, so any `ulimit -n` low enough to fail *it* first starves bash's own startup (machine-/load-dependent harness corruption, not the lib's 97 lane). Inode exhaustion needs root. §F3 is already reasoned-correct (same shape as F1). | + +**D-b tier split (refined by feasibility):** +- **First cut (implement now):** F4 (POSIX-guarded) + F2/J1 (portable). Both deterministic, + single-shot (no fan-out), ~3-4 s total. These close the resource-lane coverage on every + leg with zero flake risk. +- **Second cut:** F1 — **recommend** a Linux-only test gated behind both `uname`==Linux + **and** a `sudo -n true` capability probe that **skips-with-note** when sudo is + unavailable (never fails the suite), with `sudo umount` in cleanup (GitHub `ubuntu-*` + runners have passwordless sudo). *Alternative:* document-only, since the behavior is + validated. *(Decision point for Ben — see Open decisions.)* +- **Document-only:** F3 (and F1 if Ben prefers zero root in the suite). Note the validated + behavior in `failure-modes.md` §F1/§F3 (the empty-orphan→97 path) rather than shipping a + flaky/non-portable test. + +**Implementation notes (match existing idioms):** use the `LOCK`/`LOG`/`AGENT_LOCK_*` env +vocabulary and the `rc=$?; [ "$rc" = 97 ] && ok … || bad …` + `grep -q "TIMEOUT after"` +pattern; mirror Test 17's `2> "$WORK/tNN.err"` capture and skip-with-note. **F4 cleanup is +load-bearing:** a `chmod 0555` dir blocks `rm -rf` of its *contents* — keep that lock dir +**empty** (nothing is created in it) so the suite's `cleanup()` `rm -rf "$WORK"` succeeds. +**F2 assertion polarity** is inverted: assert the log was **not** written; the lock-success +signal is `rc==0` + the command's marker + lock-file-gone, not a log line. + +--- + +## Bucket 3 — Documentation edits (exact text) + +Small, concrete edits surfacing the boundaries the analysis decided to document. + +### C-envelope (§4.1) → `docs/git-commit-lock.md` +Add, near the staleness/clock discussion (after the "One caveat on the mtime +clock" block, ~`:283-293`), a short **operating-envelope** statement: +> **Correctness is load-independent; latency is not.** Exclusion, no-silent-loss, +> and eventual recovery rest on atomic create/rename + per-attempt tokens and hold +> under any load. The wall-clock bounds — recovery latency (≈ STALE + poll +> cadence), the `MAX_WAIT` timeout, and the ~1.3 s read-retry ladder — are +> best-effort and scale with scheduling: under CPU oversubscription or a slow FS +> they stretch, but the protocol still recovers and never loses an update. + +### C-clock (§4.2) → `docs/git-commit-lock.md` +One sentence in the same caveat block: +> The tool assumes a **single time source** — single-host use (the common case, +> all contenders share one checkout hence one clock), or a shared FS with one +> server clock. A local clock jump is correctness-safe: a forward jump can make a +> live lock look stale and be prematurely stolen, but that degrades to the +> detected exit-98 lane, never a silent double-commit. + +### C-netfs (§4.3) → `README.md` +The boundary is in the design doc (`git-commit-lock.md:122-126`) but not the +README, where operators look. Add to "How it works" (after the atomic-create +sentence, ~`README.md:57`): +> The protocol's correctness rests on these operations being atomic, which holds +> on local filesystems (ext4, APFS, NTFS, and kin) but **not** on network or +> sync-backed storage — NFS, SMB shares, Dropbox/OneDrive-synced directories — +> where exclusion may silently fail. Keep the repo (and so its `.git/`) on a local +> disk. (The default lock lives in `.git`, which almost always is.) + +### C-mixedver (§I2) → `README.md` +The "upgrade both together" rule is design-doc-only (`git-commit-lock.md:251-256`). +Add to the two-implementations section (~`README.md:82-95`): +> **Upgrade both implementations together.** Older releases stole with an +> unserialized move-aside instead of the claim protocol, so the +> no-displacement-during-recovery guarantee holds only when every party in a tree +> runs a current version; a mixed-version tree degrades that prevention to +> detection (exit 98) and can leave `.dead.*` files current versions don't clean. + +### C-misc (§4.6, optional) → `docs/git-commit-lock.md` +One line each (low priority): case-insensitive FS is a non-issue (the lock/claim +paths never collide under case folding); the mixed-version `.dead.*` litter note +cross-referenced. + +--- + +## Bucket 4 — Correctness/envelope test split (D-c; `GCL_ENVELOPE_TIER=relax`) + +D-c is implemented as a **tagged assertion downgrade**, not a physical file split +(a file split would duplicate Test 21/29's heavy `clone_fn` setup and break the +single-suite kcov measurement). Add an `ok`/`bad`-adjacent helper pair (in +`tests/_harness.sh` once Bucket 8 item 3 lands; inline in the unit suite until +then — same signature, so the later move is mechanical): + +```bash +ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}" # default strict; nightly/deep set relax +ENV_WARN=0 +ok_envelope() { echo "PASS[env]: $*"; PASS=$((PASS+1)); } +bad_envelope() { # the FAIL branch of a wall-clock/poll-count (Tier-2) assertion only + if [ "$ENVELOPE_TIER" = relax ]; then echo "WARN[env-relaxed]: $*"; ENV_WARN=$((ENV_WARN+1)) + else echo "FAIL: $*"; FAIL=$((FAIL+1)); fi +} +``` + +- **`ok`/`bad` = the strict-correctness tier** (always hard, both tiers); + **`ok_envelope`/`bad_envelope` = the latency/envelope tier** (hard in `strict`, + warn-only in `relax`). Exit code is driven by real `FAIL` only — `ENV_WARN` never + reds a run; the summary prints the `ENV_WARN` count so it's visible. +- **The three (and only three) downgraded call sites** — swap `ok`/`bad` → + `*_envelope` on the *wall-clock* assertion only; every neighbouring correctness + assertion (rc=97, no-steal, dir-untouched, STOLE-BY-CLAIM, …) **keeps `ok`/`bad`**: + - **Test 21** `:1144` — recovery latency `≤20s`. + - **Test 22a** `:1167` (warning fired — relies on two-poll-confirm headroom), + `:1170` (fired exactly once), and `:1168` (warning names the type — contingent + on the same starved warning). The never-steal / never-delete assertions stay strict. + - **Test 29** `:1531` — `≥2` CLAIM lines (poll-count). +- **Required CI sets `strict` (or leaves it unset)** — at zero artificial load the + three pass comfortably, so the gate behavior is unchanged; **nightly/deep set + `relax`** so an oversubscribed runner can't turn an envelope miss into a red. +- Anchors are current-tree; re-locate the three sites at build (each is the single + `-le 20` / warning-count / `-ge 2` line). + +--- + +## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions) + +**Two-workflow structure:** keep `tests.yml` for **Tier R (required)** + **Tier D +(deep dispatch)**; add a new `nightly.yml` for **Tier N (nightly)** + the kcov job + +triage. Rationale: the nightly tier is non-blocking and must never be a required +check, so a separate workflow keeps its `concurrency`, `issues: write` permission, +and schedule independent of the gate. + +**Tier R — Required / per-PR (blocking), `tests.yml`.** The current 4 cells +unchanged (ubuntu all / macos all / windows unit / windows interop+integration), +**no load**, `GCL_ENVELOPE_TIER=strict` (default — the 3 wall-clock assertions pass +comfortably at zero load), `GCL_TEST_FULL=1`. Diff from today: **revert** the +per-run-unique concurrency group (`980856b`) → `group: ${{ github.workflow }}-${{ +github.ref }}` + `cancel-in-progress`; **drop** the `GCL_STRESS_*` env + `with-load.sh` +wrap + raised timeouts from the required job (`b430d73`'s workflow half); restore the +original step/job timeouts. Target < ~8 min. A red here is therefore never a +stress-manufactured flake. + +**Tier N — Nightly (non-blocking, triaged), new `nightly.yml`.** `schedule` (daily, +off-peak) + `workflow_dispatch`; one oversubscribed level **R≈2**; +`GCL_ENVELOPE_TIER=relax` + `GCL_TEST_SWEEP=1`; `concurrency: nightly` + cancel +(one run at a time). **6 explicit cells** (`matrix.include`): N1 ubuntu/cpu, N2 +ubuntu/disk, N3 ubuntu/both, N4 macos/disk (the single harsh macOS cell — scarce/slow/ +5-job sub-limit), N5 windows interop+integration/disk (highest-value: delete-pending +ghosts + 5.1 unlink-then-move under churn), N6 windows unit/both. 6 cells + kcov + +triage ≈ 8 jobs → one wave under the ~20/5 ceiling. Nightly steps keep the raised +timeouts (correct here). + +**Tier D — Deep sweep (on-demand, never gates), `tests.yml`.** `workflow_dispatch` +only, inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax). +**The key mechanism that lets Deep + Required coexist in one file** — an +event-conditional concurrency group so the per-run-unique group never leaks onto the +gate: +```yaml +concurrency: + group: >- + ${{ github.event_name == 'workflow_dispatch' + && format('{0}-deep-{1}', github.workflow, github.run_id) + || format('{0}-{1}', github.workflow, github.ref) }} + cancel-in-progress: ${{ github.event_name != 'workflow_dispatch' }} +``` + +**Axis-A waiter-count sweep {4,12,24}** under `GCL_TEST_SWEEP=1` (nightly/deep only; +unset per-PR → today's floor `N=4`, deterministic). A `T_AXIS_A` list read at suite +top; each of **Test 2b / Test 20 / interop Test 16** loops `N` over it, naming `N` in +every message. Anti-flake discipline baked into the loop: keep correctness assertions +config-independent (hold `STALE ≫ hold` so "zero-98 / one-steal" holds at every N — +these stay `ok`/`bad` strict, *not* `_envelope`), and **scale `MAX_WAIT` with N** so a +large-N run doesn't time out and look like a product failure. Mechanism generalizes to +Axis B/C later (deferred per §9.4). + +**kcov coverage job** (nightly.yml, Linux-only): build kcov v43 from source (no +apt/prebuilt), run the **unit suite at FULL, strict, no-load** (`--include-path=git- +commit-lock.sh`), upload HTML + cobertura (30-day retention), and gate on a +**conservative line-coverage floor of 0.80** (below the current 83.1%, above noise; +the Linux ceiling is ~94% because ~30 lines are platform-gated). **Ratchet the floor up +toward ~0.90 as Bucket-2 lands the Tier-A tests** — the floor tracks achieved coverage, +it doesn't lead it. + +**Nightly issue auto-triage** (nightly.yml, `if: always()`, `issues: write`): parse the +preserved logs — `^FAIL:` and/or job `failure` → **correctness** (file/append a +labelled issue, investigate); no FAIL but `WARN[env-relaxed]` and job `success` → +**envelope-flake** (tracked, no action); timeout/checkout failure → **infra**. +Idempotent (search-then-append, one issue per (date, class); no all-green spam). +**Empty-round guard (learned-once):** every cell's artifact missing / workflow errored +before any suite ran is an **infra** failure — do NOT read "0 FAIL across 0 logs" as +green. Upload nightly logs on success too (need the negatives to read the positives). + +**Load calibration** (`with-load.sh` graduates from scaffolding): express load as +oversubscription ratio `R = stressors/nproc` (cap `R_total`), prefer `stress-ng` +(Windows spinner fallback) and a **probe-gated** Linux cgroup CPU-quota path for the +calibrated envelope leg (IO throttling experimental — don't rely on it); emit a per-run +**load-manifest** artifact (`{kind, R, nproc, achieved-slowdown, tool versions, os/arch, +sha}`) uploaded on success too. + +**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):** +- **Graduate to `main`:** the calibrated `with-load.sh` (strip the do-not-merge banner; + add ratio calibration + load-manifest); `ok_envelope`/`bad_envelope` + the 3 + reassigned assertions; `GCL_TEST_SWEEP` + Axis-A loop (default-off → per-PR identical + to today); the new `nightly.yml`; the `tests.yml` event-conditional-concurrency edit + + dispatch inputs. So `b430d73` is **not** wholly do-not-merge — its `with-load.sh` + payload graduates; only its *required-job wiring* is dropped. +- **Revert / drop:** `980856b` (flat per-run-unique group); `b430d73`'s load-wrap + + raised-timeouts **on the required job** (they move to nightly.yml). + +**§7 GitHub-Actions gotchas the diff MUST honor:** +- **`paths-ignore` on a *required* check blocks doc-only PRs** (skipped workflow → checks + Pending → merge blocked). The current `tests.yml` has both `paths-ignore` and the + required jobs. **Fix (required, not optional):** keep the workflow always-running and + path-filter only the expensive `test`/`lint` *steps*, with a tiny always-green job + satisfying the required check on doc-only PRs (recommended), or make a separate cheap + job the required check. +- **`max-parallel` is intra-matrix only** — bound Deep/Nightly with workflow-level + `concurrency` groups (done), never `max-parallel`. +- **`schedule` auto-disables after ~60 days of repo inactivity** — note in `nightly.yml`; + rely on `workflow_dispatch` to re-trigger. A successor should know an empty nightly + history may mean "disabled," not "passing." +- **Artifact names** unique per `(os, leg, kind)`; keep `include-hidden-files: true` + (the lock logs live under the scratch `.git/`). `fail-fast: false` stays (per-OS + signal + triage needs every cell's verdict). 256-job cap irrelevant at this scale. + +--- + +## Bucket 8 — Harness ergonomics (zero-dep; prototype-validated) + +Tests are straight-line `echo "== Test N: … =="` blocks (no registry): **43** in the +unit suite (the "~36" figure was stale), 25 interop, 2+1 integration. Sequencing is +**TAP → selector → extract** (each its own commit). + +**Item 1 — TAP + `1..N` plan line + the undercount fix (do FIRST, ~20 lines/suite).** +The bug: under `set -uo pipefail` (no `-e`), an early `exit`/crash terminates the +suite before the final `echo RESULT` + `[ "$FAIL" = 0 ]`, dropping later assertions +from the count — and a stray `exit 0` after a recorded FAIL exits **0 with no RESULT +line** (a *silent green*). Fix, three parts (all prototype-validated): +- Make `ok`/`bad` TAP-aware, gated by `GCL_TAP=1` (dev runs byte-unchanged): bump a + running `TAPN` and emit `ok N - desc` / `not ok N - desc`; keep the `return 0` that + the `A && ok || bad` idiom needs. +- Emit a **trailing `1..$TAPN`** plan line before the verdict — a consumer fails on a + short count. +- A **"reached-the-end" sentinel**: `DONE=0` set to `1` as the last action before the + verdict; a `finish` EXIT trap (wrapping the existing per-suite `cleanup`) that, if it + fires with `DONE!=1`, prints `Bail out!` and **`exit 1`**. (Key validated detail: a + bare trap *return* is ignored — the script keeps its pre-trap code — so the guard + needs an explicit `exit 1`; this is what converts the silent early-`exit 0`-after-FAIL + into a red.) No hand-maintained expected-count constant — the sentinel catches *any* + premature termination with zero upkeep. Apply to all three suites. + +**Item 2 — `GCL_TEST_ONLY=` selector (SECOND; 43 mechanical header rewrites).** +Wrap each block: `echo "== Test N: … =="` → `if section "Test N: …"; then … fi`, where +`section` echoes the header and returns success iff `GCL_TEST_ONLY` is unset or its +regex matches the label. **Care point:** a few blocks do trailing cleanup *after* the +last assertion before the next header — those lines must move *inside* the `fi`. +**Integration is EXCLUDED by design:** its Tests 1-3 share one repo + `ALL_IDS` +accumulator (Test 3 audits 1+2's output), so it is one indivisible scenario — it +must *note-and-ignore* `GCL_TEST_ONLY` (loud stderr note), never per-block select. +Unit first; interop the same treatment (lower priority). Anchoring tip for docs: +`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`. + +**Item 3 — extract `tests/_harness.sh` (LAST; pure dedup, largest diff).** Source one +shared file from each suite. Tier 1 (all three): the `PASS/FAIL/TAPN/DONE` inits + +`GCL_TAP`/`GCL_TEST_ONLY` reads, `ok`/`bad`, `section`, the `finish`/sentinel helper, +and the shared shellcheck disables. Tier 2 (unit+interop only — integration uses none): +`epoch_to_stamp`, `backdate`, `backdate_ghost`, `sync_waiting_fresh`, `fabricate_lock`, +`wait_for_grep`, `clone_fn` + its `export -f` line. Tier 3: keep **both** poll helpers +under their existing names/semantics (`wait_for_file` `$2`=seconds, interop's `wait_for` +`$2`=50ms-iterations) — do *not* unify signatures this pass (would touch every call site +on the most fragile timing axis). **Do NOT extract `cleanup`** — it closes over each +suite's `$WORK` and interop's body genuinely differs; the shared `finish` just calls the +suite-local `cleanup`. Do it last so the final TAP/selector code is extracted once. +Verify byte-identical behavior by diffing a FULL run's sorted `PASS:`/`FAIL:` set +before/after (CI or local). + +Prototypes (gitignored, `.agent-testing/bucket8-proto/`) validate TAP emission, the +trailing plan, selector matching, TAP+selector composition, and the sentinel closing +the exact silent-green bug. + +--- + +## Phasing for Phase 3 (the build) + +Order chosen so cheap, enabling work lands first and each step is CI-verifiable: + +1. **Bucket 8 items 1-2 first** (TAP + `GCL_TEST_ONLY`) — they make iterating on + ~15 new tests far cheaper and give machine-readable CI output to read the new + tests' results back from. (Per the harness design's safe-increment order.) +2. **Bucket 3 doc edits** — independent, low-risk, can land anytime; do early so + the docs match the contract. +3. **Bucket 4 envelope switch** (`GCL_ENVELOPE_TIER`) — needed before the nightly + CI tier and before scoping Test 21/22a/29. +4. **Bucket 2A steering tests** (A1/A2/A4 first, then the rest) — the coverage core. +5. **Bucket 2B fault-injection tests** (the feasible D-b first cut; flag/defer any + non-portable lane). +6. **Bucket 8 item 3** (`_harness.sh` extraction) — after the new tests exist, so + the shared helpers are settled. +7. **Bucket 6 CI matrix** — wire the three tiers + kcov leg + parametrization last, + once the tests and the envelope switch exist for it to orchestrate. + +Each step commits incrementally under the commit-lock; verification dispatches +`tests.yml` on `ci-stress`. **Build vs Workflow:** decide hand-run vs a Claude Code +Workflow once the final test count is known (plan D-e) — likely a Workflow for the +~15 steering tests (fan-out write + per-test CI verify). + +## Logging / observability design (per engineering practices) +- **New tests** assert on the product's existing protocol log strings (the coverage + proxy the audit used) — every new steering test greps a specific log line, so a + silent behavior change is caught. +- **TAP output** (Bucket 8) makes each assertion's pass/fail individually visible in + CI logs, and the `1..N` plan line makes a truncated run fail loudly (closing the + silent-undercount gap). +- **The load-manifest artifact** (Bucket 6) records `{kind, R, nproc, + achieved-slowdown, tool versions, runner os/arch, git sha}` per nightly/deep run, + uploaded on success too, so any flake is reproducible (the reproducible-experiments + requirement). +- **kcov coverage artifact** (Bucket 6) uploaded per Linux run; the gap list in + `steering-coverage.md` is the baseline to diff against. +- **Nightly auto-triage** tags a failing scheduled run `correctness` (investigate) + vs `envelope` (expected under load), so scheduled reds are visible, not silent. + +## Open decisions for Ben +- **D-b tiering (confirm):** build all of Tier A (A1-A11) + the Tier-B first cut + (F4, F2/J1) now? The original D-b's "second tier" items are all accounted for — + E3 → **A6** (steering, not fault-injection), F2-audit #7 (rename-refused) → **A1**, + #8 (Windows blocked-unlink) → **Tier C** (platform-only, verified on the Windows + leg); only **F1/F3** are genuinely not portably injectable. (Recommend: yes — Tier A + is all portable; defer only F1/F3.) +- **F1 (ENOSPC) — gated test vs document-only:** F1's behavior is validated but its + injection needs Linux root (`mount`). Ship as a Linux-only test gated behind a + `sudo -n` capability probe (skip-with-note elsewhere, `sudo umount` in cleanup), or + document-only? (Recommend: the **gated test** — GitHub `ubuntu-*` runners have + passwordless sudo so it actually runs there and skips cleanly everywhere else; falls + back to document-only if you'd rather keep zero root in the suite.) **F3 is + document-only either way** (no deterministic injection exists — the create needs ~1 FD). +- **Build mechanism (D-e):** hand-run Phase 3, or a Claude Code Workflow for the test + fan-out? (Recommend: decide once the count is final — ~13 steering + 2-3 fault tests; + lean Workflow for the steering batch, hand-run the CI/doc edits.) +- Anything else needing a call is surfaced inline in the integrated sections. + +## Changelog +(empty — Phase 2 planning; implementation changelog starts in Phase 3.) From 26c9c29fbf11e4149d40bb8e9c97fe843c7a9871 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Wed, 17 Jun 2026 20:36:13 +1000 Subject: [PATCH 27/76] Phase 2 plan: fold review round (Claude + Codex); both verdict sound-to-gate MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit CI (Codex): split into THREE workflows — tests.yml (required) + a stable tests-passed aggregator as the ONLY required context; nightly.yml; deep-sweep.yml with distinct job names. This fixes the workflow_dispatch-publishes-check-contexts gating risk AND the paths-ignore-on-required gotcha, and drops the event-conditional concurrency expression. Made ok_envelope/bad_envelope TAP-aware (Bucket 8 item 1 lands first, so TAPN/GCL_TAP exist). Added a GCL_TEST_ONLY zero-match guard. Tests (Claude): A6 must shadow the INNER _lock_stat_mtime (:606), NOT _lock_path_mtime (:639-643), which is the function that emits the warn-once the test asserts — verified against the code. Test 22a downgrade refined to only the warning-fired-at-all assertion (keep the warn-once dedup n<=1 and names-type strict). A reviewer's Test-22a line numbers were a mislocation — the plan mapping (:1167/:1168/:1170) is verified correct against the tree. F3 reclassified document-only supersedes steering-coverage B4's "portable POSIX" rating (ulimit -n can't fail the ~1-FD create without starving bash startup) — steering-coverage.md B4 corrected to match. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-17-ci-stress-phase2-build-plan.md | 96 ++++++++++++------- docs/steering-coverage.md | 15 ++- 2 files changed, 72 insertions(+), 39 deletions(-) diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md index a7c2edf..cb0e1a1 100644 --- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md +++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md @@ -39,7 +39,7 @@ line anchors are current-tree and may drift (re-locate at build). | **A3** | `foreign` claim-recheck branch (`:1103-1106`; kcov hits=0) | shadow the claim read at recheck to return a *foreign* token (a clearer removed our claim, a rival re-claimed) | leave the foreign claim; discovery read; back off; no 98-on-mere-claim | all | MED-HIGH | | **A4** | `exec`-bypass / §H4 no-silent-loss boundary (`lock_run` runs `"$@"` in the wrapper shell, `:1733`) | **(corrected, verified empirically)** the exec must run in the lock-holding shell: `run -- exec true` or sourced `lock_acquire; exec true` — **NOT** `run -- bash -c 'exec true'` (that execs a child, releases normally) | (a) benign: no `RELEASED` line / lock left held; (b) displaced (backdated lease + parked contender) + exec 0 → caller sees 0 with **no** 98 — pins `guarantees.md` OOS-5 | all (bash) | **HIGH** — the one silent-loss boundary | | **A5** | forward clock-jump → premature steal of a live lock (§E2; `:928,1409`) | `clone_fn _lock_now` to return now+offset on the poll while the live holder's mtime stays current | the live lock is judged stale and stolen; the victim's release hits **98** (clock-driven analogue of Test 4b) | all | MED | -| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn` the mtime helper (`_lock_path_mtime` / its `stat` shadow) to return empty on a *present* file | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED | +| **A6** | mtime-unreadable fail-safe (§E3; `:639-645`, consumed `:912-926`) | `clone_fn _lock_stat_mtime` (the **inner** stat probe at `:606`) to return empty on a *present* file — **NOT** `_lock_path_mtime`, which is the function that *emits* the warn-once (`:639-643`); shadowing it would defeat the assertion | warn-once "Staleness detection is BROKEN"; **no steal**; waiter → 97; (closes BE-3's "coverage planned") | all (bash; + ps1 parity if feasible) | MED | | **A7** | malformed/unreadable content classification tails (`_lock_verify_stale` `:940-949`; in-acquire steal guard `:1429-1443`; claim-stale-check `:1240-1249`) | fabricate a line-1-whitespace file (non-empty blank line 1 = `#18`); shadow a read-fault (`#17`) | no steal; the right `not a lock/claim file` / `unreadable` warning; covers several sibling branches per test | all | LOW-MED (cheap, multi-branch) | | **A8** | socket & device-node wrong-type arms (`:1474-1475` claim, `:1561-1562` lock; kcov-new) | bind a unix socket / reference a device node (`/dev/null`) at the path | refusal (never stolen/deleted); the `-S`/`-b`/`-c` arms execute | POSIX | LOW (cheap; sibling of tested guard) | | **A9** | log rotation past 1 MB (`:558-559`; kcov-new) | pre-write a >1 MB `$AGENT_LOCK_LOG`, trigger a log call | truncate-restart (log shrinks; lock unaffected) | all | LOW (trivial, no injection) | @@ -77,7 +77,10 @@ F3 in the first cut) based on the feasibility results. validated. *(Decision point for Ben — see Open decisions.)* - **Document-only:** F3 (and F1 if Ben prefers zero root in the suite). Note the validated behavior in `failure-modes.md` §F1/§F3 (the empty-orphan→97 path) rather than shipping a - flaky/non-portable test. + flaky/non-portable test. **This supersedes `steering-coverage.md` §3 B4's "portable POSIX" + rating and the failure-modes §4.5/Q5 "`ulimit -n` for FDs" suggestion** — the empirical + check shows the create needs ~1 FD, so no `ulimit -n` fails it without first starving + bash's own startup (harness corruption). `steering-coverage.md` B4 is corrected to match. **Implementation notes (match existing idioms):** use the `LOCK`/`LOG`/`AGENT_LOCK_*` env vocabulary and the `rc=$?; [ "$rc" = 97 ] && ok … || bad …` + `grep -q "TIMEOUT after"` @@ -148,11 +151,20 @@ then — same signature, so the later move is mechanical): ```bash ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}" # default strict; nightly/deep set relax ENV_WARN=0 -ok_envelope() { echo "PASS[env]: $*"; PASS=$((PASS+1)); } -bad_envelope() { # the FAIL branch of a wall-clock/poll-count (Tier-2) assertion only - if [ "$ENVELOPE_TIER" = relax ]; then echo "WARN[env-relaxed]: $*"; ENV_WARN=$((ENV_WARN+1)) - else echo "FAIL: $*"; FAIL=$((FAIL+1)); fi -} +# TAP-aware (Bucket 8 item 1 lands FIRST, so TAPN/GCL_TAP already exist — review catch). +# An envelope PASS is a normal `ok`; an envelope FAIL is a hard `bad` in strict, but in +# relax it is a TAP-passing line with a `# env-relaxed` directive — it counts toward the +# 1..N plan and bumps ENV_WARN (for triage), and NEVER reds the run. +ok_envelope() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS[env]: $*" + [ "${GCL_TAP:-0}" = 1 ] && echo "ok $TAPN - $*"; return 0; } +bad_envelope() { + if [ "$ENVELOPE_TIER" = relax ]; then + ENV_WARN=$((ENV_WARN+1)); TAPN=$((TAPN+1)); echo "WARN[env-relaxed]: $*" + [ "${GCL_TAP:-0}" = 1 ] && echo "ok $TAPN - $* # env-relaxed" + else + FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" + [ "${GCL_TAP:-0}" = 1 ] && echo "not ok $TAPN - $*" + fi; return 0; } ``` - **`ok`/`bad` = the strict-correctness tier** (always hard, both tiers); @@ -163,9 +175,16 @@ bad_envelope() { # the FAIL branch of a wall-clock/poll-count (Tier-2) asserti `*_envelope` on the *wall-clock* assertion only; every neighbouring correctness assertion (rc=97, no-steal, dir-untouched, STOLE-BY-CLAIM, …) **keeps `ok`/`bad`**: - **Test 21** `:1144` — recovery latency `≤20s`. - - **Test 22a** `:1167` (warning fired — relies on two-poll-confirm headroom), - `:1170` (fired exactly once), and `:1168` (warning names the type — contingent - on the same starved warning). The never-steal / never-delete assertions stay strict. + - **Test 22a** — downgrade ONLY the *warning-fired-at-all* assertion (`:1167`, + `grep -q "is not a claim file"`, i.e. count `≥1`), which depends on two-poll-confirm + headroom under load. Keep the warn-once **correctness** strict: **split** the current + `n==1` check (`:1170`) into `n≥1` (→ `bad_envelope`, timing) **+** `n≤1` (→ `bad`, + strict — the dedup property: never warns twice), and **guard** "names the type" + (`:1168`) on a warning having fired (assert strictly only when `n≥1`). So a real + warn-once regression (n≥2, or wrong type) stays a hard FAIL even under `relax`. + (Mapping `:1167`/`:1168`/`:1170` verified against the current tree — a reviewer's + alternate line numbers were a mislocation; re-confirm at build.) The never-steal / + never-delete assertions (`:1171`/`:1172`) stay strict. - **Test 29** `:1531` — `≥2` CLAIM lines (poll-count). - **Required CI sets `strict` (or leaves it unset)** — at zero artificial load the three pass comfortably, so the gate behavior is unchanged; **nightly/deep set @@ -177,11 +196,24 @@ bad_envelope() { # the FAIL branch of a wall-clock/poll-count (Tier-2) asserti ## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions) -**Two-workflow structure:** keep `tests.yml` for **Tier R (required)** + **Tier D -(deep dispatch)**; add a new `nightly.yml` for **Tier N (nightly)** + the kcov job + -triage. Rationale: the nightly tier is non-blocking and must never be a required -check, so a separate workflow keeps its `concurrency`, `issues: write` permission, -and schedule independent of the gate. +**Three-workflow structure** (revised after review — a `workflow_dispatch` run +publishes check contexts on the head SHA, so keeping Deep in `tests.yml` under shared +job names risks a failed Deep run gating a PR; separate files + a stable required +aggregator remove that risk *and* the event-conditional concurrency): +- **`tests.yml`** — Tier R (required): the 4-cell `test` matrix + `lint` + a single + stable **`tests-passed` aggregator** (`needs: [test, lint]`, `if: always()`, succeeds + iff every needed job *succeeded or was skipped*). **Branch protection requires ONLY + `tests-passed`**, not the per-cell matrix contexts. Concurrency: `group: ${{ + github.workflow }}-${{ github.ref }}` + `cancel-in-progress`. +- **`nightly.yml`** — Tier N + the kcov job + triage (`issues: write`, `schedule`, its + own `concurrency: nightly`). +- **`deep-sweep.yml`** — Tier D (`workflow_dispatch` only), with **distinct job names** + (`deep-*`) so it never publishes the `tests-passed` context, and per-run-unique + concurrency. +This also fixes the **`paths-ignore`-on-required gotcha** cleanly: path-filter the +expensive `test`/`lint` jobs (they *skip* on doc-only PRs) while `tests-passed` always +runs and reports green (its needs were skipped, not failed) — so a doc-only PR satisfies +the one required context without the expensive jobs running. **Tier R — Required / per-PR (blocking), `tests.yml`.** The current 4 cells unchanged (ubuntu all / macos all / windows unit / windows interop+integration), @@ -203,19 +235,13 @@ ghosts + 5.1 unlink-then-move under churn), N6 windows unit/both. 6 cells + kcov triage ≈ 8 jobs → one wave under the ~20/5 ceiling. Nightly steps keep the raised timeouts (correct here). -**Tier D — Deep sweep (on-demand, never gates), `tests.yml`.** `workflow_dispatch` -only, inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax). -**The key mechanism that lets Deep + Required coexist in one file** — an -event-conditional concurrency group so the per-run-unique group never leaks onto the -gate: -```yaml -concurrency: - group: >- - ${{ github.event_name == 'workflow_dispatch' - && format('{0}-deep-{1}', github.workflow, github.run_id) - || format('{0}-{1}', github.workflow, github.ref) }} - cancel-in-progress: ${{ github.event_name != 'workflow_dispatch' }} -``` +**Tier D — Deep sweep (`deep-sweep.yml`, `workflow_dispatch` only, never gates).** +Inputs `stress_kind`/`stress_load`/**`repeat`**/`envelope_tier` (default relax). Its +jobs use **distinct names** (`deep-*`) so a failed dispatch never publishes the +`tests-passed` required context (the review catch), with per-run-unique concurrency +(`group: deep-${{ github.run_id }}`, `cancel-in-progress: false`) so many parallel +dispatches each run and accept queue waves. Living in its own file removes any need for +an event-conditional concurrency expression. **Axis-A waiter-count sweep {4,12,24}** under `GCL_TEST_SWEEP=1` (nightly/deep only; unset per-PR → today's floor `N=4`, deterministic). A `T_AXIS_A` list read at suite @@ -262,11 +288,10 @@ sha}`) uploaded on success too. **§7 GitHub-Actions gotchas the diff MUST honor:** - **`paths-ignore` on a *required* check blocks doc-only PRs** (skipped workflow → checks - Pending → merge blocked). The current `tests.yml` has both `paths-ignore` and the - required jobs. **Fix (required, not optional):** keep the workflow always-running and - path-filter only the expensive `test`/`lint` *steps*, with a tiny always-green job - satisfying the required check on doc-only PRs (recommended), or make a separate cheap - job the required check. + Pending → merge blocked). **Fixed** by the `tests-passed` aggregator above: it is the + sole required context and always runs (green when the path-filtered `test`/`lint` jobs + skip), so doc-only PRs merge. Branch protection must require **`tests-passed`**, NOT the + per-cell matrix contexts (else skipped cells sit Pending). - **`max-parallel` is intra-matrix only** — bound Deep/Nightly with workflow-level `concurrency` groups (done), never `max-parallel`. - **`schedule` auto-disables after ~60 days of repo inactivity** — note in `nightly.yml`; @@ -311,7 +336,10 @@ last assertion before the next header — those lines must move *inside* the `fi accumulator (Test 3 audits 1+2's output), so it is one indivisible scenario — it must *note-and-ignore* `GCL_TEST_ONLY` (loud stderr note), never per-block select. Unit first; interop the same treatment (lower priority). Anchoring tip for docs: -`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`. +`'Test 2'` also matches `Test 2b/20/25` — use `'Test 2:'` / `'Test 2b'`. **Zero-match +guard (review catch):** `section` bumps a `SECTIONS_RUN` counter when it runs a block; +at the end, if `GCL_TEST_ONLY` is set and `SECTIONS_RUN==0`, fail loudly — a typo'd regex +must not report a vacuous `PASS=0 FAIL=0` green (same spirit as the undercount sentinel). **Item 3 — extract `tests/_harness.sh` (LAST; pure dedup, largest diff).** Source one shared file from each suite. Tier 1 (all three): the `PASS/FAIL/TAPN/DONE` inits + diff --git a/docs/steering-coverage.md b/docs/steering-coverage.md index dd98461..8abaa03 100644 --- a/docs/steering-coverage.md +++ b/docs/steering-coverage.md @@ -193,12 +193,17 @@ rather than shipping a flake** (per the §4.5 decision). - **B1 — Unwritable lock dir/parent → clean 97** (F4). `chmod` the dir. POSIX; the cheapest and highest-value fault-injection test. **High.** - **B2 — Unwritable/failing log path → lock still works, log swallowed** (F2/J1). - Bad/again-`chmod`'d log path. POSIX. **Medium-high.** + *Phase-2 feasibility:* use the **ENOTDIR trick** (`AGENT_LOCK_LOG` under a regular + file) — **portable, no chmod/guard**. **First cut.** - **B3 — ENOSPC during claim/lock create+write** (F1; the create write-fail branch - `#5` and the read-fault lanes `:848,871-873`). Small dedicated tmpfs/quota. - Linux-friendliest; flag if not portable. **Medium.** -- **B4 — FD exhaustion via `ulimit -n`** (F3). Portable POSIX; inode exhaustion - only if cleanly injectable. **Medium.** + `#5` and the read-fault lanes `:848,871-873`). *Phase-2 feasibility:* real injection + needs `mount` (Linux **root**); `ulimit -f` is a SIGXFSZ trap (wrong lane). **Second + cut — Linux + `sudo -n` probe-gated, or document-only.** +- **B4 — FD exhaustion via `ulimit -n`** (F3). **Corrected (Phase-2 feasibility, + supersedes the earlier "portable POSIX" rating):** NOT portably/deterministically + injectable — the create needs only ~1 FD, so any `ulimit -n` low enough to fail it + first starves bash's own startup (machine-dependent harness corruption); inode + exhaustion needs root. **Document-only.** ### Tier C — Platform-only (verify off-Linux; not a Linux gap) From 3789be9a8a79daa88c841ac26edf775eaaccadf8 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 00:58:06 +1000 Subject: [PATCH 28/76] Bucket 8 item 1: TAP output + 1..N plan line + undercount sentinel MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit All three suites: ok/bad now emit TAP13 (`ok`/`not ok N - desc`) gated by GCL_TAP=1 (dev runs byte-unchanged); a trailing `1..N` plan line lets a consumer fail on a short count; and a DONE sentinel + a finish() EXIT-trap wrapper turn any early exit/crash into a loud `Bail out!` + exit 1 — closing the silent-undercount gap (a stray `exit 0` after a recorded FAIL no longer reports green). A bare trap `return` is ignored by bash, so the guard uses an explicit `exit 1`. Validated: unit suite REDUCED + GCL_TAP=1 -> 220/220, the `1..220` plan line matches the assertion count, exit 0, sentinel does not false-fire. interop + integration syntax-checked here; full runs verify via CI. Phase 3, step 1 of the Phase 2 build plan (Bucket 8 item 1). Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.integration.test.sh | 29 +++++++++++++++++++---- tests/git-commit-lock.interop.test.sh | 29 +++++++++++++++++++---- tests/git-commit-lock.test.sh | 29 +++++++++++++++++++---- 3 files changed, 75 insertions(+), 12 deletions(-) diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh index a142bba..579a5da 100644 --- a/tests/git-commit-lock.integration.test.sh +++ b/tests/git-commit-lock.integration.test.sh @@ -59,11 +59,30 @@ cleanup() { rm -rf "$WORK" 2>/dev/null || true fi } -trap cleanup EXIT +# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with +# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is +# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` +# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. +finish() { + cleanup + if [ "${DONE:-0}" != 1 ]; then + echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 + exit 1 + fi +} +trap finish EXIT -PASS=0; FAIL=0 -ok() { echo "PASS: $*"; PASS=$((PASS+1)); } -bad() { echo "FAIL: $*"; FAIL=$((FAIL+1)); } +PASS=0; FAIL=0; TAPN=0; DONE=0 +GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output +# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and +# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted +# just before the verdict) lets a TAP consumer fail on a short count; together with the +# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the +# "ok/bad cannot fail" property the ` && ok ... || bad ...` idiom relies on. +ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" + [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } +bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" + [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } # --- sizing ------------------------------------------------------------------ # Commits serialise (that's the whole point), so wall time ≈ workers x commit @@ -301,5 +320,7 @@ done || bad "$n_next leftover claim file(s) beside the lock" echo +DONE=1 echo "==== INTEGRATION RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" +[ "$GCL_TAP" = 1 ] && echo "1..$TAPN" [ "$FAIL" = 0 ] diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh index 8d2a566..a638005 100644 --- a/tests/git-commit-lock.interop.test.sh +++ b/tests/git-commit-lock.interop.test.sh @@ -67,9 +67,17 @@ WORK="$(pwsh -NoProfile -Command '[IO.Path]::Combine([IO.Path]::GetTempPath(), " WORK="${WORK//\\//}" mkdir -p "$WORK" -PASS=0; FAIL=0 -ok() { echo "PASS: $*"; PASS=$((PASS+1)); } -bad() { echo "FAIL: $*"; FAIL=$((FAIL+1)); } +PASS=0; FAIL=0; TAPN=0; DONE=0 +GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output +# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and +# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted +# just before the verdict) lets a TAP consumer fail on a short count; together with the +# DONE sentinel below this closes the silent-undercount gap. `return 0` preserves the +# "ok/bad cannot fail" property the ` && ok ... || bad ...` idiom relies on. +ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" + [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } +bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" + [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } # Failure post-mortems need the logs: keep $WORK when anything failed, and # honour GCL_TEST_PRESERVE_DIR (the CI preserve-logs knob) by copying @@ -86,7 +94,18 @@ cleanup() { fi rm -rf "$WORK" 2>/dev/null || true } -trap cleanup EXIT +# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with +# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is +# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` +# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. +finish() { + cleanup + if [ "${DONE:-0}" != 1 ]; then + echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 + exit 1 + fi +} +trap finish EXIT # Poll for a marker file: ready-markers replace fixed head-start sleeps so a # slow pwsh cold-start (1-3s+ under load) can't fake an ordering failure. @@ -1380,5 +1399,7 @@ else fi echo +DONE=1 echo "==== INTEROP RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" +[ "$GCL_TAP" = 1 ] && echo "1..$TAPN" [ "$FAIL" = 0 ] diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index b5ca5ee..5491768 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -51,11 +51,30 @@ cleanup() { rm -rf "$WORK" 2>/dev/null || true fi } -trap cleanup EXIT +# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with +# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is +# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` +# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. +finish() { + cleanup + if [ "${DONE:-0}" != 1 ]; then + echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 + exit 1 + fi +} +trap finish EXIT -PASS=0; FAIL=0 -ok() { echo "PASS: $*"; PASS=$((PASS+1)); } -bad() { echo "FAIL: $*"; FAIL=$((FAIL+1)); } +PASS=0; FAIL=0; TAPN=0; DONE=0 +GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output +# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and +# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted +# just before the verdict) lets a TAP consumer fail on a short count; together with the +# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the +# "ok/bad cannot fail" property the ` && ok ... || bad ...` idiom relies on. +ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" + [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } +bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" + [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } # Backdate a path's mtime by $2 seconds — the lock's staleness clock is the # lock FILE's own mtime (stamped by the creating write), so this is how a @@ -2175,6 +2194,8 @@ rm -f "$LOCK" "$LOCK.next" # Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by # Test 32b. +DONE=1 echo echo "==== RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" +[ "$GCL_TAP" = 1 ] && echo "1..$TAPN" [ "$FAIL" = 0 ] From dbecc0201c1ec1cb31478d717089b0d228cae500 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 01:00:07 +1000 Subject: [PATCH 29/76] Bucket 3: documentation edits (envelope, single-clock, network-FS, upgrade-both) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit docs/git-commit-lock.md: add the operating-envelope statement (correctness is load-independent; the wall-clock bounds are best-effort and scale with scheduling) and the single-time-source note (a local clock jump is correctness-safe, degrading to the detected exit-98 lane), both beside the mtime-clock caveat; cross-ref guarantees.md. README.md: surface the network/sync-FS boundary in "How it works" (exclusion may silently fail off a local FS), and add the "upgrade both implementations together" deployment note (a mixed-version tree degrades prevention to detection). C-misc (the optional case-insensitive-FS / mixed-version one-liners) skipped as low-value — the design doc already covers mixed-version. Plan changelog updated: Phase 3 step 1 done; the Bucket 8 item-2 selector is deferred to bundle with item 3 (revised phasing recorded). Co-Authored-By: Claude Opus 4.8 (1M context) --- .plans/2026-06-17-ci-stress-phase2-build-plan.md | 14 ++++++++++++-- README.md | 12 +++++++++++- docs/git-commit-lock.md | 16 ++++++++++++++++ 3 files changed, 39 insertions(+), 3 deletions(-) diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md index cb0e1a1..69b3bb6 100644 --- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md +++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md @@ -420,5 +420,15 @@ Workflow once the final test count is known (plan D-e) — likely a Workflow for lean Workflow for the steering batch, hand-run the CI/doc edits.) - Anything else needing a call is surfaced inline in the integrated sections. -## Changelog -(empty — Phase 2 planning; implementation changelog starts in Phase 3.) +## Changelog (Phase 3 implementation) +- **Step 1 (commit `3789be9`) — Bucket 8 item 1 done.** TAP + `1..N` + the + `DONE`/`finish` undercount sentinel in all three suites. Unit validated locally + (220/220 REDUCED + matching plan line, exit 0, sentinel does not false-fire); + interop/integration syntax-checked, full runs via CI. +- **Deviation — defer Bucket 8 item 2 (the `GCL_TEST_ONLY` selector).** Wrapping 43 + blocks in `if section …; then … fi` is a large, boundary-sensitive change whose only + benefit is per-test iteration speed; for this batch the steering tests are validated + by a full-suite run, so it doesn't justify front-loading its risk. Bundled with item 3 + (`_harness.sh` extraction — also a large harness change) into one validated + harness-restructure step near the end. **Revised phasing: 8.1 → 3 → 4 → 2A → 2B → + (8.2 + 8.3 together) → 6.** diff --git a/README.md b/README.md index 5bebc3a..9c7d595 100644 --- a/README.md +++ b/README.md @@ -57,7 +57,11 @@ atomic create-or-fail open (`O_CREAT|O_EXCL` / `FileMode.CreateNew`) — atomic on local POSIX filesystems and NTFS alike, with no dependency on `flock` — whose content is the holder's unique token. Every worktree has its own git dir, so independent worktrees get independent locks, while all agents sharing -one checkout contend on the same lock. The lock is deliberately a stealable +one checkout contend on the same lock. The protocol's correctness rests on these +operations being atomic, which holds on local filesystems (ext4, APFS, NTFS, and +kin) but **not** on network or sync-backed storage — NFS, SMB shares, +Dropbox/OneDrive-synced directories — where exclusion may silently fail. Keep the +repo (and so its `.git/`) on a local disk. The lock is deliberately a stealable **lease**, not a kernel lock: in unattended agent fleets a hung-but-alive holder is at least as common as a crashed one, and a lock that can't be taken from a stuck holder halts the whole run — while a rare collision costs little @@ -94,6 +98,12 @@ against each other on all three OSes — not as platform support, but because two independent implementations hammering one lock is cheap adversarial verification of the protocol. +**Upgrade both implementations together.** Older releases stole with an +unserialized move-aside instead of the claim protocol, so the +no-displacement-during-recovery guarantee holds only when every party in a tree +runs a current version; a mixed-version tree degrades that prevention to +detection (exit 98) and can leave `.dead.*` files current versions don't clean. + ## Suggested agent instructions Agents only benefit from the lock if their instructions tell them to use it. diff --git a/docs/git-commit-lock.md b/docs/git-commit-lock.md index 828cfc4..f47fbb8 100644 --- a/docs/git-commit-lock.md +++ b/docs/git-commit-lock.md @@ -292,6 +292,22 @@ settles in milliseconds. The same floor governs the claim file's ageout: a sub-floor claim mtime reads as "just created", never "ancient — clear". +**The operating envelope — correctness is load-independent; latency is not.** +Exclusion, no-silent-loss, and eventual recovery rest on atomic create/rename +plus per-attempt tokens, and hold under any load. The wall-clock bounds — +recovery latency (≈ `STALE_SECS` + poll cadence), the `MAX_WAIT` timeout, and the +~1.3 s read-retry ladder — are best-effort and scale with scheduling: under CPU +oversubscription or a slow filesystem they stretch, but the protocol still +recovers and never loses an update. (For the precise guarantee/scope split, see +[`guarantees.md`](guarantees.md).) + +**One time source.** The tool assumes a single clock — single-host use (the +common case: all contenders share one checkout, hence one machine and one clock), +or a shared filesystem with one server clock. A local clock jump is +correctness-safe: a forward jump can make a live lock look stale and be +prematurely stolen, but that degrades to the detected exit-98 lane (the robbed +holder's release fails loudly), never a silent double-commit. + ## The PowerShell port (`git-commit-lock.ps1`) Some agents (Codex on Windows, for example) run their commands in From 750be3cdb9d534da098d99036c515b350ab1f634 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 01:06:53 +1000 Subject: [PATCH 30/76] Bucket 4: correctness/envelope test split (D-c; GCL_ENVELOPE_TIER) Add TAP-aware ok_envelope/bad_envelope to the unit suite: default 'strict' is identical to ok/bad; GCL_ENVELOPE_TIER=relax downgrades an envelope FAIL to a WARN that never reds the run (ENV_WARN counted + reported in the summary). Reassign the three load-sensitive wall-clock/poll-count assertions to the envelope tier, keeping every neighbouring correctness assertion strict: - Test 21: recovery <=20s - Test 22a: "warning fired at all" (n>=1) -> envelope; the warn-once dedup (n<=1) and the type-naming stay STRICT (names-type guarded on a warning having fired) - Test 29: >=2 CLAIM lines Validated: strict (default) -> 220/220, 0 envelope warnings, 1..220 consistent, the 3 sites report PASS[env]. Downgrade logic checked deterministically (relax->WARN, strict->FAIL). No product (git-commit-lock.sh) change. Phase 3, step 3 (Bucket 4). Required CI will set strict; nightly/deep set relax. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.test.sh | 40 +++++++++++++++++++++++++++++------ 1 file changed, 33 insertions(+), 7 deletions(-) diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 5491768..eb8b662 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -76,6 +76,26 @@ ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } +# Envelope-tier assertions (Bucket 4 / decision D-c). A wall-clock or poll-count +# bound is a Tier-2 (best-effort latency) property, NOT a correctness one (see +# guarantees.md BE-1). In the default 'strict' tier these behave exactly like +# ok/bad. Under GCL_ENVELOPE_TIER=relax (nightly/deep stress runs) an envelope FAIL +# becomes a WARN that does NOT increment FAIL — so an oversubscribed runner can't +# turn a latency miss into a red — while every CORRECTNESS assertion keeps ok/bad +# and stays hard in both tiers. TAP-aware so envelope assertions still count toward 1..N. +ENVELOPE_TIER="${GCL_ENVELOPE_TIER:-strict}" +ENV_WARN=0 +ok_envelope() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS[env]: $*" + [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } +bad_envelope() { + if [ "$ENVELOPE_TIER" = relax ]; then + ENV_WARN=$((ENV_WARN+1)); TAPN=$((TAPN+1)); echo "WARN[env-relaxed]: $*" + [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $* # env-relaxed" + else + FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" + [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*" + fi; return 0; } + # Backdate a path's mtime by $2 seconds — the lock's staleness clock is the # lock FILE's own mtime (stamped by the creating write), so this is how a # test fakes a stale lock. Portable: BSD touch has no `-d @epoch`, so convert @@ -1160,7 +1180,7 @@ t21_t1=$(date +%s) [ "$rc" = 0 ] && ok "waiter recovered through a crashed claimant's claim (rc 0)" || bad "rc=$rc behind a crashed claim" grep -q "CLAIM-STALE-CLEARED" "$LOG" && ok "aged claim cleared (CLAIM-STALE-CLEARED logged, with age)" || bad "no CLAIM-STALE-CLEARED entry" grep -q "STOLE-BY-CLAIM" "$LOG" && ok "steal completed after the clear" || bad "no STOLE-BY-CLAIM after clearing the crashed claim" -[ $((t21_t1 - t21_t0)) -le 20 ] && ok "recovery latency bounded ($((t21_t1 - t21_t0))s)" || bad "recovery took $((t21_t1 - t21_t0))s (>20s)" +[ $((t21_t1 - t21_t0)) -le 20 ] && ok_envelope "recovery latency bounded ($((t21_t1 - t21_t0))s)" || bad_envelope "recovery took $((t21_t1 - t21_t0))s (>20s)" [ -e "$LOCK.next" ] && bad "claim leftover after recovery" || ok "claim path clean after recovery" # (b) an EMPTY claim file (claimant died between create and write): same lane. LOCK="$WORK/ccempty.lock"; LOG="$WORK/ccempty.log"; : > "$LOG" @@ -1183,10 +1203,16 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ bash "$LIB" run -- bash -c 'true' 2> "$WORK/t22a.err"; rc=$? [ "$rc" = 97 ] && ok "dir at claim path: steals blocked, waiter timed out (97)" || bad "dir at claim path: rc=$rc (want 97)" [ -f "$LOCK.next/sub/f" ] && ok "directory at claim path untouched" || bad "directory at claim path was damaged!" -grep -q "is not a claim file" "$WORK/t22a.err" && ok "loud claim-path config warning on stderr" || bad "no claim-path config warning" -grep -q "it is a directory" "$WORK/t22a.err" && ok "claim warning names the detected type (directory)" || bad "claim warning does not name the type" n="$(grep -c "is not a claim file" "$WORK/t22a.err")" -[ "$n" = 1 ] && ok "claim-path warning fired exactly once (got $n)" || bad "claim-path warning fired $n times (want 1)" +# "warning fired at all" is timing-dependent (the two-poll confirmation needs poll +# headroom before MAX_WAIT, which an oversubscribed runner can starve) -> envelope. +# The warn-once dedup (never >1) and the type-naming are CORRECTNESS -> strict (the +# latter only asserted when a warning actually fired). +[ "$n" -ge 1 ] && ok_envelope "claim-path config warning fired (got $n)" || bad_envelope "no claim-path config warning (n=$n)" +[ "$n" -le 1 ] && ok "claim-path warning not duplicated (n=$n)" || bad "claim-path warning fired $n times (warn-once broken)" +if [ "$n" -ge 1 ]; then + grep -q "it is a directory" "$WORK/t22a.err" && ok "claim warning names the detected type (directory)" || bad "claim warning does not name the type" +fi grep -q "STOLE-BY-CLAIM" "$LOG" && bad "stole despite a squatted claim path" || ok "no steal through a squatted claim path" [ -f "$LOCK" ] && ok "stale lock left in place (cannot be stolen safely)" || bad "lock vanished behind a squatted claim path" # (b) a free LOCK path is UNaffected by claim-path junk: normal acquire works. @@ -1547,8 +1573,8 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ ' _ "$LIB" 2>/dev/null; rc=$? [ "$rc" = 97 ] && ok "blocked-steal waiter honoured MAX_WAIT (97)" || bad "blocked-steal rc=$rc (want 97)" nclaim="$(grep -c "] CLAIM " "$LOG")" -[ "$nclaim" -ge 2 ] && ok "claim re-created on later attempts (x$nclaim) — deleted immediately, no ageout penalty" \ - || bad "only $nclaim CLAIM line(s) — the failed steal's claim was left to age out (60s-class penalty)" +[ "$nclaim" -ge 2 ] && ok_envelope "claim re-created on later attempts (x$nclaim) — deleted immediately, no ageout penalty" \ + || bad_envelope "only $nclaim CLAIM line(s) — the failed steal's claim was left to age out (60s-class penalty)" grep -q "steal FAILED" "$LOG" && ok "blocked rename logged (damped steal FAILED)" || bad "no steal FAILED log line" [ -e "$LOCK.next" ] && bad "claim leftover after the blocked steal attempts" || ok "no claim leftover at exit" [ -f "$LOCK" ] && ok "squatted lock left in place" || bad "lock vanished in the blocked lane" @@ -2196,6 +2222,6 @@ rm -f "$LOCK" "$LOCK.next" DONE=1 echo -echo "==== RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" +echo "==== RESULT: $PASS passed, $FAIL failed, $ENV_WARN envelope warning(s) (fan-out: $GCL_MODE) ====" [ "$GCL_TAP" = 1 ] && echo "1..$TAPN" [ "$FAIL" = 0 ] From cbc1eca65bb5db3ac5d6a38cb20833530afb8220 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 01:27:50 +1000 Subject: [PATCH 31/76] Bucket 2A wave 1: steering tests 37-40 (rename-refused, step-3.3, foreign-recheck, exec/H4) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Four new deterministic-steering unit tests closing high-value Tier-A gaps from steering-coverage.md (each drafted + self-validated by a sub-agent against a faithful harness, then re-validated together by me + the full suite): - Test 37 (A1): CLAIM-ABORT (rename-refused) — a directory appears at the lock path mid-steal; the only acquire/steal VERDICT branch previously untested. - Test 38 (A2): the step-3.3 pre-rename re-verify abort lane (kcov hits=0); a call-counter shadow proves the steal got past step-2 to the 3.3 position. - Test 39 (A3): the foreign claim-recheck branch (kcov hits=0) — rival's claim left intact, discovery read, no false 98 (mutation-checked: 6 FAILs against a broken branch). - Test 40 (A4): the exec-bypass / OOS-5 no-silent-loss boundary — exec in the lock-holding shell skips release (lock left, no RELEASED); exec in a child (run -- bash -c 'exec') does NOT; plus the displaced-holder silent-loss case. Full unit suite: 259 passed, 0 failed, 1..259 consistent (REDUCED). No product change. Tests 41-47 (A5-A11) land in waves 2-3. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.test.sh | 289 ++++++++++++++++++++++++++++++++++ 1 file changed, 289 insertions(+) diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index eb8b662..c44f8ae 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -2208,6 +2208,295 @@ grep -q "resolved tok=tok.leak.t36.2" "$LOG" && ok "conclusive resolution logged || bad "no resolution log line for the conclusive drop" rm -f "$LOCK" "$LOCK.next" +echo "== Test 37: rename-refused — a directory appearing at the lock path mid-steal aborts the steal, no false hold ==" +# The only acquire/steal VERDICT branch with no test: a NON-regular object (a +# directory) appears AT the lock path between the claimant's final re-verify +# (step 3.3, sees a stale FILE) and its rename-over, so the rename is refused +# with the lock path occupied by a non-file. The claimant must classify this +# as rename-refused (non-file at the lock path), delete its claim, take NO +# hold, and re-poll to MAX_WAIT. Steered deterministically by shadowing mv: +# the claim->lock rename (the `.next` move) is intercepted to swap the stale +# lock FILE for a DIRECTORY at the lock path, then the real `mv -T` runs and +# fails NATURALLY (mv refuses to overwrite a directory with a non-directory) — +# exactly the wrong-type rename lane. The verifies don't call mv, so the lock +# reads as a stale file through step 3.3; only the rename sees the directory. +# Mutation check: an implementation that mis-classifies the refused rename +# (e.g. treats it as blocked, or proceeds to STOLE-BY-CLAIM) fails the +# no-false-hold / rename-refused assertions below. +LOCK="$WORK/renref.lock"; LOG="$WORK/renref.log"; : > "$LOG" +fabricate_lock "$LOCK" "tok.ghost.t37" "pid=9 host=ghost"; backdate "$LOCK" 9999 +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=3 \ + bash -c ' + source "$1" || exit 70 + # Shadow mv: on the claim->lock rename (the only mv touching ".next"), + # replace the stale lock file with a directory, then run the real mv -T, + # which refuses to overwrite a directory with a non-directory. The mv -T + # capability probe inside _lock_rename_over operates on its own temp paths + # (never ".next"), so it is unaffected. + mv() { + case "$*" in + *".next"*) + command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null + command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null + ;; + esac + command mv "$@" + } + lock_acquire + exit $? + ' _ "$LIB" 2>/dev/null; rc=$? +[ "$rc" = 97 ] && ok "rename-refused waiter honoured MAX_WAIT (97), never falsely held" \ + || bad "rename-refused rc=$rc (want 97 — a false hold would exit 0)" +grep -q "CLAIM-ABORT (rename-refused)" "$LOG" \ + && ok "CLAIM-ABORT (rename-refused) logged — the wrong-type rename branch was hit" \ + || bad "no CLAIM-ABORT (rename-refused) — branch not exercised" +grep -q "non-file at the lock path" "$LOG" \ + && ok "rename refusal classified as non-file at the lock path" \ + || bad "missing 'non-file at the lock path' classification wording" +grep -q "STOLE-BY-CLAIM" "$LOG" \ + && bad "spurious STOLE-BY-CLAIM — the steal was claimed despite the refused rename" \ + || ok "no STOLE-BY-CLAIM (no false steal of the directory-occupied path)" +grep -q "DISCOVERY-HOLD" "$LOG" \ + && bad "spurious discovery-HOLD — the victim wrongly believed it acquired" \ + || ok "no spurious discovery-HOLD — ownership-discovery read found no hold" +grep -q "acquire verification FAILED" "$LOG" \ + && bad "read-back path entered — the rename was treated as having succeeded" \ + || ok "rename treated as refused, not as a completed-then-unverified steal" +[ -e "$LOCK.next" ] \ + && bad "claim leftover (\$LOCK.next) after the rename-refused abort" \ + || ok "claim file cleaned up — no leftover \$LOCK.next" +[ -d "$LOCK" ] \ + && ok "directory left in place at the lock path (never overwritten)" \ + || bad "lock path is no longer the squatting directory" +rm -rf "$LOCK" "$LOCK.next" + +echo "== Test 38: step-3.3 pre-rename re-verify abort — claim cleaned, discovery, no false hold ==" +# The step-2 re-verify (sh:1075) and the step-3.3 re-verify immediately before +# the rename (sh:1149) are near-identical abort lanes; Test 23/27 exercise the +# step-2 lane only, leaving 3.3 untested. Steered with a CALL-COUNTER on +# _lock_verify_stale: call 1 (step-2) passes through to the REAL verdict +# (stale — the ghost is backdated 9999s), so the steal proceeds PAST step-2; +# call 2 (step-3.3) freshens the lock first, so the real verify reports "fresh" +# and the abort fires SPECIFICALLY at step-3.3. The proof is the log suffix +# "(lock re-verify before rename: fresh)" — step-2's suffix is "after claim", +# so the string can only be the 3.3 lane. STALE_SECS=30 keeps the freshened +# ghost fresh long enough that the post-abort re-poll does NOT re-steal before +# the test removes the lock — so the waiter then acquires via the CREATE race +# (no second STOLE-BY-CLAIM), the same shape as Test 23. +LOCK="$WORK/pr33.lock"; LOG="$WORK/pr33.log"; : > "$LOG" +fabricate_lock "$LOCK" "tok.ghost.t38" "pid=9 host=slow"; backdate "$LOCK" 9999 +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=30 \ + AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \ + bash -c ' + source "$1" || exit 70 + clone_fn _lock_verify_stale _vs_orig + N=0 + _lock_verify_stale() { + N=$((N+1)) + # call 1 = step-2: pass through to the real verdict (stale). call 2 = + # step-3.3: freshen the ghost lock so the real verify now sees "fresh", + # tripping the pre-rename abort at the 3.3 position. + if [ "$N" = 2 ]; then command touch -- "$AGENT_LOCK_PATH"; fi + _vs_orig "$@" + } + lock_acquire || exit 72 + lock_release || exit 74 + exit 0 + ' _ "$LIB" 2>/dev/null & +w38=$! +# Proof the 3.3 lane ran AND the steal got PAST step-2: the "before rename" +# suffix is unique to the step-3.3 position (step-2 logs "after claim"). +wait_for_grep "lock re-verify before rename: fresh" "$LOG" 20 \ + && ok "step-3.3 pre-rename re-verify aborted (fresh) — got past step-2 to the 3.3 lane" \ + || bad "no step-3.3 'before rename' abort — the 3.3 lane did not run" +grep -q "CLAIM-ABORT (fresh) tok=.* (lock re-verify before rename: fresh)" "$LOG" \ + && ok "CLAIM-ABORT (fresh) logged at the 3.3 position (reason map: fresh)" \ + || bad "no CLAIM-ABORT (fresh) with the 'before rename' suffix" +grep -q "lock re-verify after claim" "$LOG" \ + && bad "the abort fired at step-2 (after claim) — the call-counter let call 1 trip, not the 3.3 lane" \ + || ok "no step-2 (after claim) abort — call 1 passed; only the 3.3 lane aborted" +grep -q "STOLE-BY-CLAIM" "$LOG" \ + && bad "a rename installed the claim — the 3.3 fresh abort did not prevent the steal" \ + || ok "no STOLE-BY-CLAIM — no rename onto the lock from the aborted attempt" +grep -q "DISCOVERY-HOLD" "$LOG" \ + && bad "spurious DISCOVERY-HOLD — the victim wrongly held after the 3.3 abort" \ + || ok "no false hold — the discovery read ran and the victim did not wrongly hold" +[ -e "$LOCK.next" ] && bad "claim leftover immediately after the 3.3 fresh abort" \ + || ok "claim deleted on the 3.3 fresh abort" +rm -f "$LOCK" # the slow holder releases normally +wait "$w38"; rc=$? +[ "$rc" = 0 ] && ok "waiter re-polled past the 3.3 abort, then acquired/released (rc 0)" \ + || bad "waiter rc=$rc after the slow holder released (want 0)" +[ -e "$LOCK.next" ] && bad "claim leftover after the waiter finished" || ok "no claim leftover at exit" +rm -f "$LOCK" "$LOCK.next" + + +echo "== Test 39: foreign claim at recheck — left intact, discovery, no false 98 ==" +# After winning its claim and passing step-2 re-verify, the claimant rechecks +# its OWN claim file before installing. The `gone` recheck leg is covered (Test +# 25 recheck-gone / Test 32); the `foreign` leg is NOT: a waiter judged our +# claim abandoned, cleared it, and a RIVAL re-claimed in its place, so the +# recheck reads back a FOREIGN token at the claim path. The claimant must then +# LEAVE the rival's claim alone, run the ownership-discovery read (the lock is +# still the ghost, not ours -> no hold), and back off to re-poll — never a 98 +# (a mere claim recheck carries NO stolen-lease semantics) and never a deletion +# of the rival's claim. +# +# Steering (Test 24/25 idiom): clone _lock_claim_state and, on the FIRST recheck +# only (fire-once via a flag FILE so a subshell can't lose the state), overwrite +# .next with a fresh-mtime foreign "tok.rival.*" token before delegating +# to the original — exactly what a waiter-cleared + rival-reclaimed claim path +# looks like. The original then classifies it `foreign`. CLAIM_STALE is large +# and MAX_WAIT small so the freshly-planted rival claim is never aged out: it +# survives, the create on the next poll loses to it, and the waiter times out +# 97. Mutation check: an implementation that 98'd on a foreign recheck, or that +# deleted/overwrote the rival's claim, or that false-HELD, fails the asserts. +LOCK="$WORK/foreign-recheck.lock"; LOG="$WORK/foreign-recheck.log"; : > "$LOG" +fabricate_lock "$LOCK" "tok.ghost.t39" "pid=9 host=ghost"; backdate "$LOCK" 9999 +SF="$LOCK.steered"; RIVAL="tok.rival.t39.deadbeef"; rm -f "$SF" +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=3 \ + SF="$SF" RIVAL="$RIVAL" \ + bash -c ' + source "$1" || exit 70 + clone_fn _lock_claim_state _cs_orig + _lock_claim_state() { + # Fire ONCE, at the post-win recheck of OUR claim: a waiter cleared ours + # and a rival re-claimed. Plant the rival token (fresh mtime => not stale) + # then classify via the real function. + if [ ! -e "$SF" ] && [ "$1" = "$_LOCK_CLAIM_TOKEN" ] \ + && [ "$_LOCK_CLAIM_PATH" -ef "$AGENT_LOCK_PATH.next" ] 2>/dev/null; then + : > "$SF" + printf "%s\n%s\n" "$RIVAL" "pid=4242 host=rival" > "$_LOCK_CLAIM_PATH" + fi + _cs_orig "$@" + } + lock_acquire + exit $? + ' _ "$LIB" 2>/dev/null; rc=$? + +# The foreign-recheck branch ran (its log line is the proof the leg executed). +grep -q "claim recheck: foreign token '$RIVAL' at the claim" "$LOG" \ + && ok "foreign-recheck branch ran (rival token left at the claim, discovery read)" \ + || bad "no foreign-recheck log line — branch not executed" +# A mere claim recheck must NEVER report a stolen-lease 98. +[ "$rc" = 98 ] && bad "false 98 on a foreign CLAIM recheck (no lease was ever held)" \ + || ok "no false 98 on the foreign claim recheck (rc=$rc)" +# No hold was ever taken: discovery saw the ghost, not our token. +grep -q "DISCOVERY-HOLD" "$LOG" && bad "false discovery-HOLD on the foreign recheck" \ + || ok "no false hold (ownership-discovery read found the ghost, not ours)" +grep -q "STOLE-BY-CLAIM" "$LOG" && bad "claimant stole despite a foreign claim at recheck" \ + || ok "no STOLE-BY-CLAIM — claimant backed off the foreign claim" +# The rival's claim file SURVIVES, unmodified (left intact, never deleted). +[ -e "$LOCK.next" ] && ok "rival's foreign claim file still present (not deleted)" \ + || bad "rival's foreign claim was deleted — must be left alone" +rl1=""; IFS= read -r rl1 < "$LOCK.next" 2>/dev/null || true +[ "$rl1" = "$RIVAL" ] && ok "rival's claim token intact (untouched: $rl1)" \ + || bad "rival's claim token modified (line1=$rl1, want $RIVAL)" +grep -q "CLAIM-STALE-CLEARED" "$LOG" && bad "claimant aged-out/cleared the rival's fresh claim" \ + || ok "rival's fresh claim never cleared as stale" +# Clean outcome: the lock was never acquired; the waiter timed out (97). +[ "$rc" = 97 ] && ok "waiter re-polled past the foreign claim and timed out cleanly (97)" \ + || bad "rc=$rc (want 97 — clean re-poll/timeout behind the surviving rival claim)" +# The ghost lock is untouched (never stolen). +gl1=""; IFS= read -r gl1 < "$LOCK" 2>/dev/null || true +[ "$gl1" = "tok.ghost.t39" ] && ok "ghost lock untouched by the foreign-recheck backoff" \ + || bad "ghost lock modified (line1=$gl1)" +rm -f "$LOCK" "$LOCK.next" "$SF" + +echo "== Test 40: exec-bypass boundary — exec in the lock-holding shell skips release (OOS-5); exec in a child does not ==" +# `lock_run` runs the wrapped command vector with `"$@"` IN THE WRAPPER SHELL +# (git-commit-lock.sh), so a command that is itself an `exec` REPLACES the +# lock-holding wrapper process: the trailing `lock_release` AND the EXIT trap +# are both skipped, and the lock is left held with no RELEASED logged. This is +# the one interleaving that can SILENTLY lose an update (guarantees.md OOS-5) — +# this test pins the exact boundary so a future change to the release/trap +# wiring can't quietly widen or close it without a red. + +# (a1) BYPASS: `run -- exec true` — the wrapped command IS an exec, so it +# replaces the wrapper. Release + EXIT trap are skipped: lock LEFT, no RELEASED +# (ACQUIRED proves the hold was taken, so "no RELEASED" means the trap really +# was bypassed, not that nothing ran). +LOCK="$WORK/t40.bypass.lock"; LOG="$WORK/t40.bypass.log"; : > "$LOG" +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- exec true; rc=$? +[ "$rc" = 0 ] && ok "run -- exec true exits 0 (the exec'd command's code)" \ + || bad "run -- exec true rc=$rc (want 0)" +grep -q ACQUIRED "$LOG" && ok "run -- exec true did take the lock (ACQUIRED logged)" \ + || bad "run -- exec true: no ACQUIRED — the hold never happened, test is vacuous" +[ -e "$LOCK" ] && ok "run -- exec true LEFT the lock file (release bypassed by exec)" \ + || bad "run -- exec true: lock released — exec did NOT bypass (boundary changed)" +grep -q RELEASED "$LOG" && bad "run -- exec true logged RELEASED — the EXIT trap was NOT skipped (boundary changed)" \ + || ok "run -- exec true logged NO RELEASED (EXIT trap skipped — OOS-5 boundary)" +rm -f "$LOCK" + +# (a2) CONTROL — NO bypass: `run -- bash -c 'exec true'` — the exec replaces the +# CHILD, not the wrapper, so the wrapper releases normally: lock GONE, RELEASED +# logged. The opposite outcome to (a1) is the whole point; assert both so the +# test documents the exact boundary. +LOCK="$WORK/t40.child.lock"; LOG="$WORK/t40.child.log"; : > "$LOG" +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'exec true'; rc=$? +[ "$rc" = 0 ] && ok "run -- bash -c 'exec true' exits 0" \ + || bad "run -- bash -c 'exec true' rc=$rc (want 0)" +[ -e "$LOCK" ] && bad "run -- bash -c 'exec true' LEFT the lock — exec in a child must NOT bypass" \ + || ok "run -- bash -c 'exec true' released the lock (exec in a child does not bypass)" +grep -q RELEASED "$LOG" && ok "run -- bash -c 'exec true' logged RELEASED (the control: release ran)" \ + || bad "run -- bash -c 'exec true' logged NO RELEASED — the control case did not release" +rm -f "$LOCK" + +# (a3) REALISTIC sourced bypass: `lock_acquire; exec true` in a sourcing shell +# (a subshell so it can't take the suite down) — the holder execs away before +# release, leaving the lock held. This is the shape a real caller hits if it +# execs while holding instead of calling lock_release. +LOCK="$WORK/t40.sourced.lock"; LOG="$WORK/t40.sourced.log"; : > "$LOG" +( AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash -c ' + source "$1" || exit 70 + lock_acquire || exit 72 + exec true + ' _ "$LIB" ); rc=$? +[ "$rc" = 0 ] && ok "sourced lock_acquire; exec true exits 0" \ + || bad "sourced lock_acquire; exec true rc=$rc (want 0)" +[ -e "$LOCK" ] && ok "sourced lock_acquire; exec true LEFT the lock held (release skipped)" \ + || bad "sourced lock_acquire; exec true released the lock — exec did not bypass" +grep -q RELEASED "$LOG" && bad "sourced exec-while-holding logged RELEASED — the trap was not skipped" \ + || ok "sourced exec-while-holding logged NO RELEASED (release + trap skipped)" +rm -f "$LOCK" + +# (b) SILENT-LOSS boundary: a DISPLACED holder that execs a 0-exit is UNWARNED. +# Build a holder H that (sourced) acquires, backdates its OWN lock ancient so a +# contender steals it (H is now displaced — a rival token sits at the path), +# then execs a 0-exit. Because the exec skips BOTH release and the EXIT trap, +# the displacement-detection in lock_release NEVER runs: H exits 0 with no +# WARNING and no 98. This is exactly the documented silent boundary (OOS-5): a +# non-unwinding exit while displaced cannot report that the hold was not +# exclusive. (backdate/epoch_to_stamp are export -f'd by the preamble, so the +# steering shell inherits them.) +LOCK="$WORK/t40.silent.lock"; LOG="$WORK/t40.silent.log"; : > "$LOG" +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 bash -c ' + source "$1" || exit 70 + lock_acquire || exit 72 # H holds the lock + backdate "$2" 9999 # H'"'"'s own lock now ancient -> instantly stealable + # A contender steals it (separate process) — H is displaced once a rival + # token lands at the path. + AGENT_LOCK_PATH="$2" AGENT_LOCK_LOG="$3" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 \ + bash "$1" run -- true + exec true # H execs 0 — neither release nor trap runs + ' _ "$LIB" "$LOCK" "$LOG"; rc=$? +[ "$rc" = 0 ] && ok "displaced holder's exec-0 exits 0 (no unwinding ran)" \ + || bad "displaced holder's exec-0 rc=$rc (want 0)" +grep -q "STOLE-BY-CLAIM" "$LOG" \ + && ok "the contender genuinely displaced H (STOLE-BY-CLAIM logged) — H WAS displaced" \ + || bad "no STOLE-BY-CLAIM — H was not actually displaced, the (b) premise is gone" +grep -q "lock LOST" "$LOG" \ + && bad "H logged a 'lock LOST' displacement WARNING — the exec did NOT skip release/trap" \ + || ok "displaced holder's exec-0 emitted NO 'lock LOST' WARNING (silent boundary — OOS-5)" +grep -q "WARNING" "$LOG" \ + && bad "an unexpected WARNING was logged by the displaced exec-0 holder" \ + || ok "displaced holder's exec-0 emitted NO WARNING at all (unwarned silent loss)" +rm -f "$LOCK" + # NOTES (deliberately untested here): # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a # foreign no-delete-share handle on the lock file — Windows-only, and the From dee154342dbd01efa00e5b43440cfd35c8db7649 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 01:42:09 +1000 Subject: [PATCH 32/76] Bucket 2A waves 2-3: steering tests 41-47 (Tier-A coverage complete) Seven more deterministic-steering unit tests (each drafted + self-validated by a sub-agent against a faithful harness, then re-validated together by me + the full suite): - Test 41 (A5): forward clock jump steals a live lock -> detected 98, never silent (E2). - Test 42 (A6): mtime unreadable -> staleness disabled, fail-safe no-steal, warn-once, 97 (E3). Shadows the INNER _lock_stat_mtime (NOT _lock_path_mtime, which emits the warning). - Test 43 (A7): malformed/unreadable lock content at the poll guard -- #18 blank line 1 ("not lock-shaped"), #17 unreadable steal-read ("steal skipped ... unreadable") -- never stolen. - Test 44 (A8): socket & device-node wrong-type arms -> refused, 97 (socket arm POSIX/CI-gated; device-node arm runs everywhere via /dev/null, proven non-destructive). - Test 45 (A9): log self-truncates past ~1 MB (rotation), with a sub-threshold negative control. - Test 46 (A10): EXIT while waiting (no hold) -> the no-hold trap arc, no spurious release. kcov-confirmed it flips :1009/:1017/:1018 from hits=0; corrected from a wrong initial recipe (a post-97 exit has the EXIT trap already restored, so it can't reach this arc). - Test 47 (A11): the no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 -> steal still installs; the [ -d ] guard refuses a directory. Lane proven via an mv trace (bare mv vs mv -T). Full unit suite: 311 passed, 0 failed, 1..311 consistent (REDUCED). No product change. Bucket 2A (the 11 Tier-A steering gaps from steering-coverage.md) is complete. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.test.sh | 553 ++++++++++++++++++++++++++++++++++ 1 file changed, 553 insertions(+) diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index c44f8ae..aca7323 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -2497,6 +2497,559 @@ grep -q "WARNING" "$LOG" \ || ok "displaced holder's exec-0 emitted NO WARNING at all (unwarned silent loss)" rm -f "$LOCK" +echo "== Test 41: forward clock jump steals a live lock — detected as 98, never silent (E2) ==" +# Staleness is age = now - mtime (git-commit-lock.sh ~:928, ~:1409), where `now` +# is _lock_now. A process whose clock has LEAPED FORWARD computes an inflated age +# for everyone's lock, so it can judge a LIVE, fresh lock ancient and steal it. +# This is correctness-safe but liveness-degraded: it degrades into the already- +# handled robbed-holder lane (Test 4b) — the displaced holder DETECTS the theft +# at release and exits 98 with a loud WARNING; it never silently double-commits. +# +# Steering (no real sleep/backdate): holder H acquires and HOLDS a fresh lock on +# a NORMAL clock. Waiter W has _lock_now shadowed to return the real now PLUS a +# large offset (+9999s), so H's just-created lock looks ~9999s old to W and W +# steals it. STALE=100 means the lock is genuinely fresh under a normal clock +# (without the jump W would block, never steal — the jump is what's causal); +# CLAIM_STALE=99999 keeps W's own just-created claim (also judged ~9999s old by +# W's jumped clock) well under the claim-stale window, so W's recheck does not +# self-abort (contested) and the steal proceeds to rename. +LOCK="$WORK/fwdjump.lock"; LOG="$WORK/fwdjump.log"; : > "$LOG"; OUT="$WORK/fwdjump-out"; : > "$OUT" +READY="$WORK/t41.ready"; TDONE="$WORK/t41.thief-done" +# Holder H (sourced, NORMAL clock): create+hold a fresh lock, signal READY, hold +# until told the waiter is done, then release and exit with the release rc. +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=100 \ + AGENT_LOCK_CLAIM_STALE_SECS=99999 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=120 \ + bash -c ' + source "$1" || exit 70 + lock_acquire || exit 72 + echo h-work >> "$2" + touch "$3" + until [ -e "$4" ]; do sleep 0.05; done + lock_release + exit $? + ' _ "$LIB" "$OUT" "$READY" "$TDONE" & +hpid=$! +wait_for_file "$READY" || bad "T41 holder never signalled ready (lock not held)" +# Waiter W (sourced, clock JUMPED +9999s): _lock_now returns real now + offset, so +# every age it computes is inflated and H's fresh lock reads as ancient. W acquires +# (by stealing) then releases; run in the FOREGROUND so its rc is captured. +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=100 \ + AGENT_LOCK_CLAIM_STALE_SECS=99999 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \ + bash -c ' + source "$1" || exit 70 + clone_fn _lock_now _now_orig + _lock_now() { echo $(( $(_now_orig) + 9999 )); } + lock_acquire || exit 72 + echo w-work >> "$2" + lock_release + exit $? + ' _ "$LIB" "$OUT" +wpid_rc=$? +touch "$TDONE" +wait "$hpid"; h_rc=$? +# W judged H's live, fresh lock ancient under the jumped clock and stole it. +grep -q "STOLE-BY-CLAIM" "$LOG" \ + && ok "forward-jumped waiter stole a LIVE fresh lock (STOLE-BY-CLAIM)" \ + || bad "no STOLE-BY-CLAIM — jumped waiter did not steal the live lock" +[ "$wpid_rc" = 0 ] && ok "thief (its own fresh hold) released cleanly (rc 0)" \ + || bad "thief rc=$wpid_rc (its own fresh hold should release 0)" +grep -q w-work "$OUT" && ok "thief did its work" || bad "thief work missing" +# The proof: the premature steal was DETECTED, not silent — H exits exactly 98. +[ "$h_rc" = 98 ] && ok "robbed holder detected the premature steal — exits exactly 98" \ + || bad "robbed holder rc=$h_rc (forward-jump steal must degrade to 98, never silent)" +grep -q "WARNING: lock LOST" "$LOG" \ + && ok "robbed holder logged a loud theft WARNING (no silent double-commit)" \ + || bad "no theft WARNING logged for the forward-jump steal" +rm -f "$LOCK" "$LOCK.next" + +echo "== Test 42: mtime unreadable — staleness disabled, fail-safe (no steal), warn-once, 97 (E3) ==" +# §E3: if the lock file's mtime cannot be read AT ALL (every probe fails on a +# PRESENT file), staleness detection is BROKEN. The mtime floor fails closed to +# "fresh": _lock_verify_stale returns state=fresh, so a crashed/stale holder is +# NEVER stolen — recovery is disabled and waiters block to MAX_WAIT (97). The +# tool must say so LOUDLY, exactly once per process. Test 1 only asserts the +# NEGATIVE (the warning must NOT fire under healthy contention); this drives the +# positive lane. +# +# Steering: shadow _lock_stat_mtime — the INNER single-probe (sh:606, runs +# stat/date and prints the epoch) — to return EMPTY for the LOCK path while it +# is PRESENT. We must NOT shadow _lock_path_mtime (sh:629): that is the 3x-retry +# wrapper that EMITS the warn-once, so shadowing it would remove the very +# warning we assert. With the inner probe empty on a present file, +# _lock_path_mtime retries 3x, sees the file present-but-unreadable, fires the +# warn-once and sets _LOCK_MTIME="" -> _lock_verify_stale -> fresh -> no steal. +# The shadow returns empty ONLY for the lock path: _lock_stat_mtime is also used +# for the CLAIM file's mtime (sh:1120/1230), which must keep working, and other +# paths fall through to the real probe. +T42_LOCK="$WORK/t42.lock"; T42_LOG="$WORK/t42.log"; T42_ERR="$WORK/t42.err" +: > "$T42_LOG"; : > "$T42_ERR" +# A STALE ghost that WOULD normally be stolen (backdated 9999s, well past STALE): +# the whole point is that it is NOT stolen because its age can't be established. +fabricate_lock "$T42_LOCK" "tok.ghost.t42.99999" "pid=99999 host=ghost" +backdate "$T42_LOCK" 9999 +T42_INNER=' + source "$1" || exit 70 + clone_fn _lock_stat_mtime _sm_orig + # Return EMPTY for the present lock path; defer to the real probe otherwise + # (the claim-file mtime at sh:1120/1230 must stay readable). + _lock_stat_mtime() { + if [ "$1" = "$AGENT_LOCK_PATH" ]; then printf ""; return 0; fi + _sm_orig "$@" + } + lock_acquire; exit $? +' +# Tight timing: small MAX_WAIT so the blocked waiter reaches 97 in ~2-3s. +AGENT_LOCK_PATH="$T42_LOCK" AGENT_LOCK_LOG="$T42_LOG" AGENT_LOCK_STALE_SECS=2 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \ + bash -c "$T42_INNER" _ "$LIB" 2>"$T42_ERR"; t42_rc=$? + +# (1) The fail-safe lane ran: the warn-once line appears. It is logged via +# _lock_log (lock log) AND echoed to stderr; assert either surface. +if grep -q "Staleness detection is BROKEN" "$T42_LOG" "$T42_ERR" 2>/dev/null \ + || grep -q "cannot read the lock file's mtime" "$T42_ERR" 2>/dev/null; then + ok "mtime-unreadable: 'Staleness detection is BROKEN' fail-safe warning fired" +else + bad "mtime-unreadable: no broken-staleness warning (fail-safe lane did not run); err=$(cat "$T42_ERR")" +fi +# (2) NO steal: the stale ghost is NOT stolen and is left in place. +if grep -q "STOLE-BY-CLAIM" "$T42_LOG" 2>/dev/null || grep -q "STOLE" "$T42_LOG" 2>/dev/null; then + bad "mtime-unreadable: ghost was STOLEN — staleness should have been disabled" +else + ok "mtime-unreadable: no steal (recovery disabled, ghost not stolen)" +fi +g42="$(head -n 1 -- "$T42_LOCK" 2>/dev/null | tr -d '\r')" +[ "$g42" = "tok.ghost.t42.99999" ] \ + && ok "mtime-unreadable: stale ghost lock left in place (token unchanged)" \ + || bad "mtime-unreadable: ghost lock disturbed (line1=$g42, want tok.ghost.t42.99999)" +# (3) The waiter blocks to MAX_WAIT and exits 97 (recovery disabled). +[ "$t42_rc" = 97 ] \ + && ok "mtime-unreadable: waiter blocked to MAX_WAIT and exited 97" \ + || bad "mtime-unreadable: waiter rc=$t42_rc (want 97 — was the stale ghost stolen?)" +# (4) Warn-once: the broken-staleness warning fires EXACTLY once per process. +t42_warns="$(grep -c "Staleness detection is BROKEN" "$T42_ERR" 2>/dev/null || echo 0)" +[ "$t42_warns" -le 1 ] \ + && ok "mtime-unreadable: broken-staleness warning fired at most once on stderr ($t42_warns)" \ + || bad "mtime-unreadable: warning repeated ($t42_warns times — warn-once broken)" +rm -f "$T42_LOCK" "$T42_LOCK.next" + +echo "== Test 43: malformed/unreadable lock content at the poll guard — never stolen, warned/skipped ==" +# Two sibling branches of the in-acquire steal CONTENT GUARD (git-commit-lock.sh +# ~:1419-1444), both gated on an already-stale candidate, neither of which the +# torn/empty/tok.-prefixed cases (Tests 17/18) reach: +# (a) #18 — line 1 is NON-EMPTY but BLANK (whitespace/CR only): the trim at +# :1421 reduces it to empty, but the file is NOT empty (`-s` true) and the +# read SUCCEEDED, so it lands in the final `else` -> _lock_warn_nonlock +# "its content is not lock-shaped" (the `is not a lock file` config +# warning). NO steal; waiters reach 97. +# (b) #17 — the content read FAILS on a present, non-empty regular file (the +# `[ "$rdrc" -ne 0 ]` lane at :1432): logs "steal skipped: stale lock +# content unreadable"; NO steal; waiters reach 97. We can't make a real +# file unreadable on every platform (a chmod-000 file still reads for its +# owner on Windows/Cygwin), so we STEER it: source the lib in-process and +# shadow the `read` builtin to fail ONLY for the inline steal-guard read, +# identified by its direct caller `lock_acquire` (FUNCNAME[1]) — the +# _lock_read_tok / _lock_verify_stale reads delegate to `builtin read`, so +# only the :1420 site is perturbed. + +# (a) #18 — whitespace-only line 1: non-empty, blank, read OK -> never stolen, warned. +LOCK="$WORK/t43blank.lock"; LOG="$WORK/t43blank.log"; : > "$LOG" +printf ' \n' > "$LOCK"; backdate "$LOCK" 9999 # one space + LF: non-empty, blank line 1 +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \ + bash "$LIB" run -- bash -c 'true' 2> "$WORK/t43a.err"; rc=$? +[ "$rc" = 97 ] && ok "#18 blank line 1: waiter timed out (97) instead of stealing" \ + || bad "#18 blank line 1: rc=$rc (want 97)" +grep -q "is not a lock file" "$WORK/t43a.err" \ + && ok "#18 config warning fired (line 1 not lock-shaped)" || bad "#18 no config warning for blank line 1" +grep -q "non-lock object at lock path (its content is not lock-shaped)" "$LOG" \ + && ok "#18 log records the non-lock-shaped classification (branch ran)" \ + || bad "#18 missing the non-lock-shaped log line (branch did not run)" +grep -q "STOLE" "$LOG" && bad "#18 blank-content file was STOLEN" || ok "#18 no steal of the blank-content file" +[ -f "$LOCK" ] && ok "#18 blank-content file left in place" || bad "#18 blank-content file was removed" +rm -f "$LOCK" + +# (b) #17 — steal-guard content read FAILS on a present, non-empty file. +# Steering shell: source the lib, shadow the `read` builtin to fail ONLY when +# invoked directly by lock_acquire (the inline steal read at sh:1420). The ghost +# is tok.-prefixed and ancient, so absent the shadow it WOULD be stolen — the +# 97 outcome plus the "steal skipped ... unreadable" line prove the failed-read +# lane (not some other refusal) is what blocked the steal. +LOCK="$WORK/t43unread.lock"; LOG="$WORK/t43unread.log"; : > "$LOG" +fabricate_lock "$LOCK" "tok.ghost.t43" "pid=9 host=ghost"; backdate "$LOCK" 9999 +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \ + bash -c ' + source "$1" || exit 70 + # Shadow the read builtin; reach the real one via `builtin read`. Fail only + # the steal-guard read (its direct caller is lock_acquire) so the + # _lock_read_tok / _lock_verify_stale reads stay intact. + read() { + if [ "${FUNCNAME[1]:-}" = lock_acquire ]; then return 1; fi + builtin read "$@" + } + lock_acquire || exit 97 + lock_release || exit 74 + exit 0 + ' _ "$LIB" 2> "$WORK/t43b.err"; rc=$? +[ "$rc" = 97 ] && ok "#17 unreadable steal content: waiter timed out (97) instead of stealing" \ + || bad "#17 unreadable steal content: rc=$rc (want 97)" +grep -q "steal skipped: stale lock content unreadable" "$LOG" \ + && ok "#17 log records the skipped steal (unreadable branch ran)" \ + || bad "#17 missing the 'steal skipped ... unreadable' log line (branch did not run)" +grep -q "STOLE" "$LOG" && bad "#17 ghost was STOLEN despite the unreadable content read" \ + || ok "#17 no steal while the steal-guard read fails" +[ -f "$LOCK" ] && ok "#17 stale ghost left in place" || bad "#17 stale ghost was removed" +rm -f "$LOCK" + +echo "== Test 44: socket & device-node at the lock path — never stolen/deleted, refused (97) ==" +# The never-steal wrong-type guard (git-commit-lock.sh ~:1557-1567) classifies +# NON-regular objects at the lock path so they are NEVER stolen and NEVER +# deleted: a real config error (a typo'd AGENT_LOCK_PATH, a stray special file) +# must wedge waiters to 97 with a loud one-time config warning, not get +# clobbered. Test 17 covers the directory / symlink / FIFO arms of that +# classifier; this test covers the two remaining arms — the SOCKET (-S) and the +# DEVICE NODE (-b/-c) — both of which name their detected type in the warning. +# For each: rc 97, the object survives unchanged (same type), the warning fires +# naming the type, and nothing is ever stolen. + +# (a) a UNIX-DOMAIN SOCKET at the lock path. Fabricated with a backgrounded +# python3 AF_UNIX bind (the socket inode persists while the process holds it); +# skipped where a real socket can't be made AND classified -S by the running +# shell — notably default Git-Bash on Windows, whose bundled python is a native +# build with no socket.AF_UNIX (probed: bind raises AttributeError, so no inode +# appears). CI's POSIX legs exercise this arm. The listener is reaped by its +# EXACT pid at the end (never by name). +LOCK="$WORK/sock.lock"; LOG="$WORK/sock.log"; : > "$LOG" +SOCKERR="$WORK/sock.py.err"; sock_pid=""; sock_ok=0 +if command -v python3 >/dev/null 2>&1; then + rm -f "$LOCK" + python3 -c 'import socket,sys,time +s=socket.socket(socket.AF_UNIX) +s.bind(sys.argv[1]) +sys.stderr.write("bound\n"); sys.stderr.flush() +time.sleep(30)' "$LOCK" 2> "$SOCKERR" & + sock_pid=$! + # Gate on the socket actually existing AND classifying -S (not just the pid + # being alive): on a no-AF_UNIX build the process exits immediately with no + # inode, so we must positively confirm the object before relying on it. + for _ in $(seq 1 100); do + [ -S "$LOCK" ] && { sock_ok=1; break; } + kill -0 "$sock_pid" 2>/dev/null || break + sleep 0.05 + done +fi +if [ "$sock_ok" = 1 ]; then + AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=3 \ + bash "$LIB" run -- bash -c 'true' 2> "$WORK/t44a.err"; rc=$? + [ "$rc" = 97 ] && ok "socket at lock path: waiter timed out (97), command never ran" \ + || bad "socket at lock path: rc=$rc (want 97)" + [ -S "$LOCK" ] && ok "socket untouched (never stolen/deleted, still a socket)" \ + || bad "socket at lock path was removed/replaced!" + grep -q "is not a lock file" "$WORK/t44a.err" && ok "loud config warning on stderr (socket)" \ + || bad "no config warning for socket at lock path" + grep -q "it is a socket" "$WORK/t44a.err" && ok "warning names the detected type (socket)" \ + || bad "warning does not name the socket type" + n="$(grep -c "is not a lock file" "$WORK/t44a.err")" + [ "$n" = 1 ] && ok "socket config warning fired exactly once per process (got $n)" \ + || bad "socket config warning fired $n times (want 1)" + grep -q STOLE "$LOG" && bad "socket was STOLEN" || ok "no steal attempted on a socket" +else + echo "note: cannot create a unix-domain socket here (no socket.AF_UNIX / not classified -S) — socket guard not exercised (CI POSIX legs cover it)" +fi +# Reap the listener by ITS exact pid only (bounded wait, then hard-kill of the +# same pid as a last resort) — never by name. Harmless if it already exited. +if [ -n "$sock_pid" ]; then + kill "$sock_pid" 2>/dev/null + for _ in $(seq 1 40); do kill -0 "$sock_pid" 2>/dev/null || break; sleep 0.05; done + kill -0 "$sock_pid" 2>/dev/null && kill -9 "$sock_pid" 2>/dev/null + wait "$sock_pid" 2>/dev/null +fi +rm -f "$LOCK" + +# (b) a DEVICE NODE at the lock path. mknod needs root, but /dev/null is a +# character device that always exists, so we point AGENT_LOCK_PATH straight at +# it: the -c arm of the classifier must refuse it. This is SAFE precisely +# because the guard refuses — it is never opened-for-write, stolen, or deleted — +# which the post-run assertion below proves (/dev/null is still a char device). +# Skipped only if /dev/null somehow isn't a char device on this platform. +if [ -c /dev/null ]; then + LOG="$WORK/dev.log"; : > "$LOG" + AGENT_LOCK_PATH="/dev/null" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=3 \ + bash "$LIB" run -- bash -c 'true' 2> "$WORK/t44b.err"; rc=$? + [ "$rc" = 97 ] && ok "device node (/dev/null) at lock path: waiter timed out (97), command never ran" \ + || bad "device node at lock path: rc=$rc (want 97)" + [ -c /dev/null ] && ok "/dev/null untouched (never stolen/deleted, still a char device)" \ + || bad "/dev/null was damaged — the guard must NEVER touch a device node!" + grep -q "is not a lock file" "$WORK/t44b.err" && ok "loud config warning on stderr (device node)" \ + || bad "no config warning for device node at lock path" + grep -q "it is a device node" "$WORK/t44b.err" && ok "warning names the detected type (device node)" \ + || bad "warning does not name the device-node type" + n="$(grep -c "is not a lock file" "$WORK/t44b.err")" + [ "$n" = 1 ] && ok "device-node config warning fired exactly once per process (got $n)" \ + || bad "device-node config warning fired $n times (want 1)" + grep -q STOLE "$LOG" && bad "device node was STOLEN" || ok "no steal attempted on a device node" +else + echo "note: /dev/null is not a char device here — device-node guard not exercised (CI POSIX legs cover it)" +fi + + +echo "== Test 45: log self-truncates past ~1 MB (rotation, not unbounded growth) ==" +# _lock_log starts the log over (not rotate) once it grows past ~1MB: the size +# check at the top of _lock_log truncates the file to empty before the write, +# so a normal log-producing op on an oversized log leaves a small, well-formed +# log carrying only the fresh protocol lines. Pre-fill > 1MB, run one clean +# acquire+release, assert the log SHRANK and the lock still worked. +LOCK="$WORK/t45.lock"; LOG="$WORK/t45.log" +# Pre-fill comfortably above the 1048576-byte (1MB) threshold (~1.2MB of 'x'). +head -c 1200000 /dev/zero | tr '\0' 'x' > "$LOG" +before=$(wc -c < "$LOG") +[ "$before" -gt 1048576 ] && ok "pre-fill exceeds the 1MB threshold (${before} bytes)" \ + || bad "pre-fill not over threshold (${before} bytes)" +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'true'; rc=$? +[ "$rc" = 0 ] && ok "lock op succeeded over an oversized log (rc=0)" \ + || bad "lock op rc=$rc over oversized log (want 0)" +after=$(wc -c < "$LOG") +# Truncation fired iff the log is now far below the threshold (it holds only a +# handful of fresh lines). Use 1MB as the boundary: any non-truncation leaves +# it at/above the 1.2MB pre-fill. +[ "$after" -lt 1048576 ] && ok "log shrank below threshold after the op (${before} -> ${after} bytes — rotation fired)" \ + || bad "log did NOT shrink (${before} -> ${after} bytes — truncation never fired)" +# Well-formed: the new log carries the fresh protocol lines, not the old giant +# 'x' content, and records the truncation. +grep -q 'log exceeded 1MB; truncated' "$LOG" && ok "log records the self-truncation notice" \ + || bad "no truncation notice in the restarted log" +grep -q 'ACQUIRED' "$LOG" && grep -q 'RELEASED' "$LOG" \ + && ok "restarted log carries fresh ACQUIRED + RELEASED protocol lines" \ + || bad "restarted log missing fresh protocol lines (ACQUIRED/RELEASED)" +grep -q 'xxxx' "$LOG" && bad "old oversized 'x' content survived into the restarted log" \ + || ok "old oversized content is gone (clean restart, not appended)" +[ -e "$LOCK" ] && bad "lock left held after run" || ok "lock released after the over-threshold run" +rm -f "$LOCK" "$LOG" + +echo "== Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release ==" +# A10 (steering-coverage.md): _lock_on_exit's no-hold arc-end (:1009,1017-1018). +# A sourced waiter, blocked in the wait loop against a LIVE held lock, exits 0 +# while still parked — the EXIT trap is STILL '_lock_on_exit' (the timeout's +# trap-restore has NOT run, because we never time out), so EXIT fires the +# handler on the NO-HOLD path: claim-trap cleanup (no token => no-op), +# leaked-resolve, restore traps. NO release semantics may run (we never held). +# +# Why interposition and not "lock_acquire times out 97 then exit": the 97 +# timeout path itself runs _lock_restore_traps BEFORE returning, so by the time +# the caller exits the EXIT trap is already gone and _lock_on_exit never fires +# (verified: post-97 `trap -p EXIT` is empty). To exercise the EXIT-while- +# WAITING arc the process must leave the loop via `exit` with the trap still +# armed — so W shadows `sleep` (called once per poll inside the wait loop) to +# park on a marker, then `exit 0` from inside that first poll-sleep. At that +# point _LOCK_HELD=0 and no claim is in flight (the live lock is never stale, so +# no steal/claim was attempted), which is exactly the no-hold arc. +T46_INNER=' + source "$1" || exit 70 + F46=0 + sleep() { + if [ "$F46" = 0 ]; then + F46=1 + command touch "$T46R" # signal: parked in the wait loop + until [ -e "$T46G" ]; do command sleep 0.05; done + # Record the live EXIT trap so the assertions can prove _lock_on_exit + # (not a bare/restored trap) is what fires on the exit below. + trap -p EXIT > "$T46T" + exit 0 # EXIT while waiting, no hold held + fi + command sleep "$@" + } + lock_acquire + echo "REACHED-UNEXPECTED rc=$?" >&2 # the shadowed sleep must exit first +' +LOCK="$WORK/exitwait.lock"; LOG="$WORK/exitwait.log"; : > "$LOG" +HLOG="$WORK/exitwait.h.log"; : > "$HLOG" +T46R="$WORK/t46.ready"; T46G="$WORK/t46.go"; T46T="$WORK/t46.trap" +rm -f "$T46R" "$T46G" "$T46T" "$LOCK" "$LOCK.next" +# H: holder — sourced, takes a FRESH live lock and parks until released. STALE is +# huge so the lock is never judged stealable; W therefore stays a pure waiter. +HR="$WORK/t46.hready"; HG="$WORK/t46.hgo"; rm -f "$HR" "$HG" +HR="$HR" HG="$HG" \ +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$HLOG" AGENT_LOCK_STALE_SECS=600 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \ + bash -c ' + source "$1" || exit 70 + lock_acquire || exit 72 + touch "$HR" + until [ -e "$HG" ]; do sleep 0.05; done + lock_release + ' _ "$LIB" 2>/dev/null & +h46=$! +wait_for_file "$HR" 30 || bad "T46 holder never acquired the lock" +htok=""; IFS= read -r htok < "$LOCK" || true # the live holder's token +# W: the waiter that will exit while parked in the wait loop (no hold). +T46R="$T46R" T46G="$T46G" T46T="$T46T" \ +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=600 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \ + bash -c "$T46_INNER" _ "$LIB" 2>/dev/null & +w46=$! +# Gate on W proving it reached the wait-loop poll (its WAITING line is logged, +# and its shadowed sleep touched the ready marker) before releasing it to exit. +wait_for_grep "WAITING for lock" "$LOG" 30 || bad "T46 waiter never logged WAITING" +wait_for_file "$T46R" 30 || bad "T46 waiter never reached its wait-loop poll" +touch "$T46G" +wait "$w46"; rc=$? +# Core assertion: W exited cleanly via the EXIT no-hold arc, with NO release +# semantics — it never held the lock, so a RELEASED or a 98/'lock LOST' would +# mean the handler wrongly ran the holding branch. +[ "$rc" = 0 ] && ok "waiter exited 0 via the EXIT-while-waiting no-hold arc" \ + || bad "T46 waiter rc=$rc (want 0; EXIT trap mishandled the no-hold arc?)" +grep -q RELEASED "$LOG" && bad "spurious RELEASED on the no-hold EXIT arc (release ran without a hold)" \ + || ok "no RELEASED on the no-hold EXIT arc (no release semantics)" +grep -q "lock LOST" "$LOG" && bad "98-classification ran on the no-hold EXIT arc" \ + || ok "no 98 classification on the no-hold EXIT arc" +# The trap that fired was our handler, not a bare/restored one — this is the +# discriminator that the EXIT-WHILE-WAITING arc ran (vs a post-97 exit, where +# the trap is already empty). Mirrors Test 12d's trap-restoration idiom. +grep -q "_lock_on_exit" "$T46T" && ok "EXIT trap still armed as _lock_on_exit at exit (no-hold arc, not post-97)" \ + || bad "EXIT trap was not _lock_on_exit at exit (got: $(cat "$T46T" 2>/dev/null))" +# The waiter left no claim behind (it never claimed — the live lock is not stale). +[ -e "$LOCK.next" ] && bad "waiter left a claim file behind on the no-hold EXIT arc" \ + || ok "no leftover claim from the no-hold EXIT waiter" +# H's lock is untouched — still the holder's original token, still held. +l1=""; IFS= read -r l1 < "$LOCK" 2>/dev/null || true +[ -n "$htok" ] && [ "$l1" = "$htok" ] && ok "holder's lock untouched by the dying waiter (token intact)" \ + || bad "holder's lock changed by the dying waiter (was=$htok now=$l1)" +# Release H and confirm it shut down cleanly (no fallout from W's exit). +touch "$HG"; wait "$h46" 2>/dev/null +grep -q "lock LOST" "$HLOG" && bad "holder saw a stolen lease (98) — the waiter's exit disturbed the hold" \ + || ok "holder released its still-held lock cleanly (no 98)" +rm -f "$LOCK" "$LOCK.next" "$T46R" "$T46G" "$T46T" "$HR" "$HG" + +echo "== Test 47: no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 — steal still installs ==" +# _lock_rename_over (git-commit-lock.sh ~:961-979) probes once for GNU `mv -T` +# and caches the verdict in _LOCK_MVT (""=unprobed, 1=supported, 0=not). On +# Linux/MINGW the probe ALWAYS picks `mv -T`, so the no-`-T` fallback lane +# (~:976-977: a last-instant `[ -d "$dst" ]` guard + a bare `mv`) is NEVER +# executed in CI except on a real BSD/macOS runner. Pre-seeding _LOCK_MVT=0 in +# the sourced steal shell BEFORE any acquire makes the `[ -z "$_LOCK_MVT" ]` +# probe short-circuit (the var is already non-empty), forcing the fallback on +# the common leg. Two scenarios: +# (a) a normal steal of a stale ghost under _LOCK_MVT=0 installs the lock via +# the unlink-free bare-`mv` fallback (STOLE-BY-CLAIM, the steal acquires); +# (b) a DIRECTORY squatting the lock path under _LOCK_MVT=0 is refused by the +# fallback's `[ -d ]` last-instant guard (no clobber) — the fallback-path +# analogue of Test 37's `mv -T` natural refusal. +# Determinism proof that the fallback truly ran (not GNU `mv -T`): scenario (a) +# shadows `mv` to record, per invocation touching ".next", whether `-T` was +# passed; under _LOCK_MVT=0 the steal's claim->lock rename MUST be a bare `mv` +# (no `-T`). A control run WITHOUT the override is asserted to still steal, so a +# pass cannot come from the override having silently broken the steal entirely. + +# ---- (a) forced-fallback steal of a stale ghost: STOLE-BY-CLAIM via bare mv ---- +LOCK="$WORK/mvt0.lock"; LOG="$WORK/mvt0.log"; : > "$LOG" +MVTRACE="$WORK/mvt0.mvtrace"; : > "$MVTRACE" +fabricate_lock "$LOCK" "tok.ghost.t47" "pid=9 host=ghost"; backdate "$LOCK" 9999 +# Sourced steal shell: pre-seed _LOCK_MVT=0, shadow `mv` to log the flags it was +# called with on the ".next" (claim->lock) rename, then call the real `mv`. +AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 \ + bash -c ' + source "$1" || exit 70 + _LOCK_MVT=0 # force the no-mv-T fallback lane + export MVTRACE_PATH="$2" # pass the trace path into mv() via env + mv() { + case "$*" in + *".next"*) printf "%s\n" "$*" >> "$MVTRACE_PATH" ;; # record claim->lock rename flags + esac + command mv "$@" + } + lock_acquire || exit 72 + lock_release || exit 74 + exit 0 + ' _ "$LIB" "$MVTRACE" 2>/dev/null; rc=$? +[ "$rc" = 0 ] && ok "T47(a): forced-fallback steal acquired+released rc 0 (_LOCK_MVT=0)" \ + || bad "T47(a): forced-fallback steal rc=$rc (want 0)" +grep -q "STOLE-BY-CLAIM" "$LOG" \ + && ok "T47(a): stale ghost stolen via the no-mv-T fallback (STOLE-BY-CLAIM logged)" \ + || bad "T47(a): no STOLE-BY-CLAIM under _LOCK_MVT=0 — fallback did not install the lock" +grep -q "ACQUIRED" "$LOG" && grep -q "RELEASED" "$LOG" \ + && ok "T47(a): fallback steal produced a clean ACQUIRED/RELEASED pair" \ + || bad "T47(a): missing ACQUIRED/RELEASED after the fallback steal" +# The mv trace proves the fallback lane (bare mv, no -T) actually carried the +# claim->lock rename — the whole point of forcing _LOCK_MVT=0. +[ -s "$MVTRACE" ] \ + && ok "T47(a): claim->lock rename went through the shadowed mv (trace non-empty)" \ + || bad "T47(a): no .next rename recorded — the steal did not rename-over as expected" +if grep -q -- '-T' "$MVTRACE"; then + bad "T47(a): claim->lock rename used 'mv -T' — the GNU fast path ran, fallback NOT forced" +else + ok "T47(a): claim->lock rename used a BARE mv (no -T) — the BSD/macOS fallback lane was taken" +fi +{ [ -e "$LOCK" ] || [ -e "$LOCK.next" ]; } \ + && bad "T47(a): leftover lock/claim after the fallback steal+release" \ + || ok "T47(a): clean final state (no lock, no claim) after fallback steal+release" + +# ---- (a-control) same steal WITHOUT the override still succeeds ---- +# Guards against a false pass where _LOCK_MVT=0 silently broke the steal: the +# unmodified library must steal the identical ghost too (here via mv -T). +LOCKC="$WORK/mvt0c.lock"; LOGC="$WORK/mvt0c.log"; : > "$LOGC" +fabricate_lock "$LOCKC" "tok.ghost.t47c" "pid=9 host=ghost"; backdate "$LOCKC" 9999 +AGENT_LOCK_PATH="$LOCKC" AGENT_LOCK_LOG="$LOGC" AGENT_LOCK_STALE_SECS=2 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=10 \ + bash -c 'source "$1" || exit 70; lock_acquire || exit 72; lock_release || exit 74; exit 0' \ + _ "$LIB" 2>/dev/null; rcc=$? +[ "$rcc" = 0 ] && grep -q "STOLE-BY-CLAIM" "$LOGC" \ + && ok "T47(a-control): unmodified steal of the same ghost also succeeds (override didn't trivially break it)" \ + || bad "T47(a-control): control steal rc=$rcc / no STOLE-BY-CLAIM (the (a) pass may be vacuous)" + +# ---- (b) directory at the lock path under _LOCK_MVT=0: [ -d ] guard refuses ---- +# The fallback's last-instant `[ -d "$dst" ]` guard (sh:976) must refuse to +# rename a file over a directory — Test 37's no-clobber outcome, reached via the +# fallback rather than `mv -T`'s natural directory refusal. Test 37 shadows `mv` +# so the directory appears just before the real `mv -T` refuses it; that timing +# does NOT exercise the fallback's `[ -d ]` because the swap lands AFTER the +# library has already passed line 976. To hit the fallback guard itself we wrap +# `_lock_rename_over`: the wrapper installs the directory and pins _LOCK_MVT=0, +# THEN calls the unmodified original — whose own `[ -d "$dst" ]` check (line 976) +# now sees the directory and returns 1, with NO library `mv`/`mv -T` ever run. +# The verifies (step 3.3) ran before the wrapper, so they saw a stale FILE; the +# directory exists only from the wrapper's first line onward. This is the +# fallback-lane analogue of Test 37's wrong-type refusal. +LOCKB="$WORK/mvt0dir.lock"; LOGB="$WORK/mvt0dir.log"; : > "$LOGB" +fabricate_lock "$LOCKB" "tok.ghost.t47b" "pid=9 host=ghost"; backdate "$LOCKB" 9999 +AGENT_LOCK_PATH="$LOCKB" AGENT_LOCK_LOG="$LOGB" AGENT_LOCK_STALE_SECS=1 \ + AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=3 \ + bash -c ' + source "$1" || exit 70 + clone_fn _lock_rename_over _ro_orig + _lock_rename_over() { + # Land a DIRECTORY at the lock path, then force the fallback lane and run + # the REAL rename-over: its own `[ -d ]` guard (sh:976) must refuse (rc 1). + command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null + command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null + _LOCK_MVT=0 + _ro_orig + } + lock_acquire + exit $? + ' _ "$LIB" 2>/dev/null; rcb=$? +[ "$rcb" = 97 ] && ok "T47(b): fallback [ -d ] guard refused; waiter honoured MAX_WAIT (97), no false hold" \ + || bad "T47(b): rc=$rcb (want 97 — a clobber/false hold would differ)" +grep -q "CLAIM-ABORT (rename-refused)" "$LOGB" \ + && ok "T47(b): CLAIM-ABORT (rename-refused) logged — fallback guard hit the wrong-type lane" \ + || bad "T47(b): no CLAIM-ABORT (rename-refused) — fallback guard branch not exercised" +grep -q "non-file at the lock path" "$LOGB" \ + && ok "T47(b): refusal classified as non-file at the lock path" \ + || bad "T47(b): missing 'non-file at the lock path' classification" +grep -q "STOLE-BY-CLAIM" "$LOGB" \ + && bad "T47(b): spurious STOLE-BY-CLAIM — the directory-occupied path was falsely stolen" \ + || ok "T47(b): no STOLE-BY-CLAIM (the [ -d ] guard prevented a false steal)" +[ -d "$LOCKB" ] \ + && ok "T47(b): directory left in place at the lock path (never clobbered by the fallback mv)" \ + || bad "T47(b): lock path no longer the squatting directory — the guard failed to protect it" +[ -e "$LOCKB.next" ] \ + && bad "T47(b): claim leftover (\$LOCK.next) after the fallback rename-refused abort" \ + || ok "T47(b): claim file cleaned up — no leftover \$LOCK.next" +rm -rf "$LOCK" "$LOCK.next" "$LOCKC" "$LOCKC.next" "$LOCKB" "$LOCKB.next" + + # NOTES (deliberately untested here): # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a # foreign no-delete-share handle on the lock file — Windows-only, and the From 3f7bd2372b2de2e3e02544504400dc17d12dfcfb Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 01:52:28 +1000 Subject: [PATCH 33/76] Bucket 2B: fault-injection tests 48-50 (F4, F2/J1, F1) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Three new fault-injection unit tests (failure-modes §4.5), validated on Windows (Git Bash) + Linux (WSL): - Test 48 (F4): unwritable lock dir (chmod 0555) -> clean 97, command never runs, no false hold. POSIX-only (chmod 0555 is a no-op on NTFS -- skip-with-note on Windows; the POSIX CI legs run it). WSL: 5/5. - Test 49 (F2/J1): a failing log path (AGENT_LOCK_LOG under a regular file -> ENOTDIR) -> the lock still acquires+releases, the log write is swallowed. Portable (no guard), runs everywhere. 4/4 both platforms. - Test 50 (F1): ENOSPC on create/write (a tiny full tmpfs) -> wait then 97, no false hold. Linux + passwordless-sudo only (ulimit -f is a SIGXFSZ trap, not usable) -- skip-with-note otherwise; the Linux CI leg runs it. WSL: 2/2. F3 (FD/inode exhaustion) is document-only -- not deterministically injectable (the create needs ~1 FD), per steering-coverage B4. Full unit suite Windows REDUCED: 315 passed, 0 failed, 1..315. No product change. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.test.sh | 79 +++++++++++++++++++++++++++++++++++ 1 file changed, 79 insertions(+) diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index aca7323..e33a98b 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -3050,6 +3050,85 @@ grep -q "STOLE-BY-CLAIM" "$LOGB" \ rm -rf "$LOCK" "$LOCK.next" "$LOCKC" "$LOCKC.next" "$LOCKB" "$LOCKB.next" +echo "== Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4) ==" +# F4 (failure-modes.md §4.5): a read-only / unwritable lock-dir parent makes the +# O_EXCL create fail every poll, so the waiter times out at 97 — no corruption, no +# false hold, and the wrapped command never runs. POSIX-only: chmod 0555 is a no-op +# for writes on Git-Bash/NTFS (the create would wrongly succeed), so skip-with-note +# on Windows; the Linux/macOS CI legs exercise it. +case "$(uname -s)" in + MINGW*|MSYS*|CYGWIN*) + echo "note: Test 48 skipped on Windows — chmod 0555 does not deny writes on NTFS; the POSIX CI legs cover it" ;; + *) + T48DIR="$WORK/t48.nowrite"; T48LOG="$WORK/t48.log"; mkdir -p "$T48DIR"; : > "$T48LOG" + T48MARK="$WORK/t48.ran"; rm -f "$T48MARK" + chmod 0555 "$T48DIR" + AGENT_LOCK_PATH="$T48DIR/commit.lock" AGENT_LOCK_LOG="$T48LOG" \ + AGENT_LOCK_STALE_SECS=1 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \ + bash "$LIB" run -- bash -c "touch '$T48MARK'" 2> "$WORK/t48.err"; rc=$? + [ "$rc" = 97 ] && ok "F4 unwritable lock dir: waiter timed out (97)" \ + || bad "F4 unwritable lock dir: rc=$rc (want 97)" + [ ! -e "$T48MARK" ] && ok "F4: the wrapped command never ran" \ + || bad "F4: the wrapped command ran despite no lock" + [ ! -e "$T48DIR/commit.lock" ] && ok "F4: no lock file created in the unwritable dir" \ + || bad "F4: a lock file appeared in an unwritable dir" + grep -q "WAITING for lock" "$T48LOG" && ok "F4: logged WAITING (the create kept failing)" \ + || bad "F4: no WAITING log" + grep -q "TIMEOUT after" "$T48LOG" && ok "F4: logged the TIMEOUT" || bad "F4: no TIMEOUT log" + chmod 0755 "$T48DIR" 2>/dev/null; rm -rf "$T48DIR" # restore so cleanup() can rm -rf $WORK + ;; +esac + +echo "== Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1) ==" +# F2/J1 (failure-modes.md §4.5): logging is best-effort (every write ends || true). +# Point AGENT_LOCK_LOG under a REGULAR FILE so every append/open fails ENOTDIR — the +# lock must still acquire+release cleanly (rc 0) with the log write swallowed. +# Portable (no chmod/perms). NOTE: bash's redirection-OPEN failure leaks to stderr +# (the ||true is on the write, not the open), so do NOT assert clean stderr; and do +# NOT grep the log (nothing is ever written to it). +T49P="$WORK/t49.notadir"; : > "$T49P" # a regular FILE; using it as a dir -> ENOTDIR +T49LOG="$T49P/x.log" # every open/append under it fails ENOTDIR +T49MARK="$WORK/t49.ran"; rm -f "$T49MARK" +AGENT_LOCK_PATH="$WORK/t49.lock" AGENT_LOCK_LOG="$T49LOG" \ + bash "$LIB" run -- bash -c "touch '$T49MARK'" 2>/dev/null; rc=$? +[ "$rc" = 0 ] && ok "F2/J1 failing log: lock acquired+released, command ran (rc 0)" \ + || bad "F2/J1 failing log: rc=$rc (want 0 — a bad log must not fail the lock)" +[ -e "$T49MARK" ] && ok "F2/J1: the wrapped command ran" \ + || bad "F2/J1: the wrapped command did not run" +[ ! -e "$WORK/t49.lock" ] && ok "F2/J1: lock released/cleaned up despite the failing log" \ + || bad "F2/J1: lock left behind" +[ ! -e "$T49LOG" ] && ok "F2/J1: the log write was swallowed (no log file under the non-dir)" \ + || bad "F2/J1: a log file was created under a non-dir" +rm -f "$T49P" "$WORK/t49.lock" + +echo "== Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1) ==" +# F1 (failure-modes.md §4.5): a full filesystem makes the create's write fail +# (ENOSPC); the created-but-write-failed file is an empty orphan and the waiter +# times out at 97 — no corruption, no false hold. Real ENOSPC needs a full FS, which +# needs root (a small tmpfs); `ulimit -f` is NOT usable (it raises SIGXFSZ and kills +# the wrapper, the wrong lane). So: Linux + passwordless sudo only; skip-with-note +# otherwise. The Linux CI leg (ubuntu runners have passwordless sudo) exercises it. +if [ "$(uname -s)" = Linux ] && sudo -n true 2>/dev/null; then + T50MNT="$WORK/t50.full"; T50LOG="$WORK/t50.log"; mkdir -p "$T50MNT"; : > "$T50LOG" + T50MARK="$WORK/t50.ran"; rm -f "$T50MARK" + if sudo mount -t tmpfs -o size=64k tmpfs "$T50MNT" 2>/dev/null; then + dd if=/dev/zero of="$T50MNT/fill" bs=1k count=256 2>/dev/null || true # fill to ENOSPC + AGENT_LOCK_PATH="$T50MNT/commit.lock" AGENT_LOCK_LOG="$T50LOG" \ + AGENT_LOCK_STALE_SECS=1 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=2 \ + bash "$LIB" run -- bash -c "touch '$T50MARK'" 2> "$WORK/t50.err"; rc=$? + [ "$rc" = 97 ] && ok "F1 ENOSPC: waiter timed out (97)" \ + || bad "F1 ENOSPC: rc=$rc (want 97)" + [ ! -e "$T50MARK" ] && ok "F1: the wrapped command never ran under ENOSPC" \ + || bad "F1: the wrapped command ran despite ENOSPC" + sudo umount "$T50MNT" 2>/dev/null + else + echo "note: Test 50 skipped — could not mount a tmpfs (sudo mount failed); covered where mountable" + fi + rmdir "$T50MNT" 2>/dev/null || true +else + echo "note: Test 50 skipped — ENOSPC injection needs Linux + passwordless sudo (a small tmpfs); the Linux CI leg covers it" +fi + # NOTES (deliberately untested here): # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a # foreign no-delete-share handle on the lock file — Windows-only, and the From ba443c7b1823feda324a52a3e9616fe5e1c99a93 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 01:59:39 +1000 Subject: [PATCH 34/76] Fix Test 37 (rename-refused) portability on macOS/BSD mv Test 37's mv shadow created the squatting directory INSIDE the mv call, which lands AFTER the no-mv-T fallback's [ -d ] guard -- and BSD "mv file dir" moves the file INTO the dir rather than erroring, so the macOS CI leg saw the steal succeed (STOLE-BY-CLAIM) instead of rename-refused (4 FAILs on macos-15; the product is fine). Fix: make the directory appear BEFORE rename-over runs, by wrapping _lock_rename_over (mkdir then delegate to the original) -- refused PORTABLY: GNU "mv -T" refuses to overwrite a directory, and the no-mv-T fallback [ -d ] guard (BSD/macOS) refuses it too. Validated locally on both engine paths (native mv -T and a forced _LOCK_MVT=0 fallback: 8/8 each); Test 47 (A11) already exercises the same wrap+guard and passed on macOS. Full Windows suite: 315 passed, 0 failed. Caught by the cross-platform CI dispatch (run 27701297220). Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.test.sh | 26 +++++++++++++------------- 1 file changed, 13 insertions(+), 13 deletions(-) diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index e33a98b..56cc7c2 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -2229,19 +2229,19 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ AGENT_LOCK_CLAIM_STALE_SECS=600 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=3 \ bash -c ' source "$1" || exit 70 - # Shadow mv: on the claim->lock rename (the only mv touching ".next"), - # replace the stale lock file with a directory, then run the real mv -T, - # which refuses to overwrite a directory with a non-directory. The mv -T - # capability probe inside _lock_rename_over operates on its own temp paths - # (never ".next"), so it is unaffected. - mv() { - case "$*" in - *".next"*) - command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null - command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null - ;; - esac - command mv "$@" + # Make a DIRECTORY appear at the lock path BEFORE the real rename-over runs, + # by wrapping _lock_rename_over (NOT by shadowing mv). It is refused PORTABLY: + # GNU "mv -T" refuses to overwrite a directory with a non-directory, AND the + # no-mv-T fallback [ -d ] guard (BSD/macOS) refuses it too. A mv shadow that + # mkdirs the dir INSIDE the mv call works only on GNU: it lands AFTER the + # fallback [ -d ] check, and BSD "mv file dir" MOVES the file INTO the dir + # rather than erroring (this failed the macOS CI leg). NB: no apostrophes + # here -- this comment lives inside the bash -c single-quoted steering shell. + clone_fn _lock_rename_over _ro_orig + _lock_rename_over() { + command rm -f -- "$AGENT_LOCK_PATH" 2>/dev/null + command mkdir -- "$AGENT_LOCK_PATH" 2>/dev/null + _ro_orig } lock_acquire exit $? From f47185756dfc5dcb4479dc55b3eb849b95a6249b Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 02:18:39 +1000 Subject: [PATCH 35/76] =?UTF-8?q?Plan:=20REOPEN=20D-d=20(merge-to-main=20s?= =?UTF-8?q?trategy)=20=E2=80=94=20cherry-pick=20vs=20tidy-rebase=20vs=20sq?= =?UTF-8?q?uash?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ben reopened the merge-to-main mechanism: cherry-picking may not beat tidying up and preserving history. Recorded the alternatives + git facts in Bucket 5 of the guarantees-and-coverage plan, flipped D-d from settled to open, and cross-referenced from the phase2 build plan's "what lands on main" section. Key facts captured: main has not diverged (merge-base == main HEAD), so a cleaned branch can ff-merge; b430d73 is a mixed commit (with-load.sh graduates, CI wiring drops); and Bucket 6 already rewrites the CI workflows, so the final ci-stress tree is main-worthy and the decision is about history, not the tree. Recommendation: (B) tidy-rebase + ff-merge. Still Ben's call; merge is last. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-ci-stress-guarantees-and-coverage-plan.md | 52 +++++++++++++++---- .../2026-06-17-ci-stress-phase2-build-plan.md | 6 ++- 2 files changed, 47 insertions(+), 11 deletions(-) diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md index 757f601..23ba646 100644 --- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md +++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md @@ -74,13 +74,41 @@ Ben's box). don't hard-fail on? (Recommend the latter — it makes the envelope explicit and stops future stress runs re-raising these as "flakes".) -### Bucket 5 — Branch hygiene (standing, NOT part of this workflow unless wanted) -- The mergeable commits (the 4 test fixes 58c3741/06c6d8e/51a1753/19a28fd + the docs) vs the - **stress-only, do-not-merge** commits (980856b concurrency tweak, b430d73 load wrapper). - When this lands on `main`, cherry-pick the mergeable set and leave the stress scaffolding. - *Open decision D-d:* do this work on `ci-stress` and cherry-pick later, or branch a clean - `failure-modes` off `main` now? (Recommend: keep working on `ci-stress`; cherry-pick at the - end — the stress wrapper is useful for CI-verifying the new tests under load.) +### Bucket 5 — Merge-to-`main` strategy (**D-d REOPENED 2026-06-18**) +Ben reopened this: cherry-picking may not be the best path — "tidying up and preserving +history" is a live alternative. **Git facts (verified 2026-06-18):** +- **`main` has not diverged** — `merge-base(main, ci-stress) == main HEAD (fa43f30)`. So + ci-stress is strictly **34 commits ahead**, and a cleaned-up branch can **fast-forward** + onto `main` (no merge commit). +- The 34 commits are a mix: genuine product/test/doc work; **pure stress-only scaffolding** + (`980856b` concurrency tweak; `b430d73`'s `tests.yml` load-wiring + raised timeouts — *but + `b430d73` also adds `tests/with-load.sh`, which graduates*, so it is a **mixed** commit); + intermediate **plan / AGENTS.md churn**; and the **`/c` commit+revert pairs** + (`534a007` → `959cca9` → `a5df9d9`). +- **Bucket 6 itself rewrites the CI workflows** (3 new files) and reverts the stress wiring. + So after Bucket 6 lands, **ci-stress's final *tree* is already main-worthy** — the + stress-only commits are a *history* concern, not a tree concern. **The decision is therefore + mostly about what history `main` should carry, not about keeping bad code out of the tree.** + +Options: +- **(A) Cherry-pick a curated subset** onto `main` (the prior plan). Surgical, but ~20 + interdependent picks (later commits edit the same test file repeatedly → conflict-prone), + new SHAs disconnected from the branch, and `b430d73` must be split by hand. Drops the + review/decision narrative. +- **(B) Tidy-rebase `ci-stress`, then `--ff-only` merge** ("tidy up + preserve history"). + Interactively rewrite the branch: squash the `/c` commit+revert pairs and the intermediate + plan/changelog churn into their content commits, excise the pure scaffolding (or rely on + Bucket 6 having already removed the wiring from the tree), curate messages; then `git -C +
merge ci-stress --ff-only` lands a clean linear history in one operation. Keeps a + curated narrative; **rewrites history** — gotcha: `rebase.updateRefs=true` moves any branch + pointing into the range, so back up with a **raw SHA/tag, never a branch**. +- **(C) Squash-merge** to one (or a few) curated commit(s). Cleanest `main` log, trivially + excludes scaffolding (final tree only), but discards all granular history. + +*Recommendation:* **(B)** — enabled cleanly by `main` not having diverged; gives a +curated-but-real history (which (C) discards and (A) reconstructs laboriously) and matches +"tidy up and preserve." **Still Ben's call** (it's about `main`'s permanent history); settle it +before the merge step. **Not a blocker for the rest of Phase 3 — the merge is last.** ### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete @@ -181,8 +209,9 @@ the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **V **Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the -matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then (D-d) cherry-pick -the mergeable commits to `main`. +matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then land on `main` +per **D-d** (merge strategy reopened 2026-06-18 — cherry-pick vs tidy-rebase+ff-merge vs +squash; see Bucket 5). ## Decisions (settled 2026-06-17) - **D-a → new `docs/guarantees.md`** (dedicated normative doc). @@ -190,7 +219,10 @@ the mergeable commits to `main`. gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier. - **D-c → split the suite** into a strict-correctness tier (always enforced) and a latency/envelope tier (not hard-failed by extreme-stress runs). -- **D-d → keep on `ci-stress`**, cherry-pick the mergeable commits to `main` at the end. +- **D-d → REOPENED 2026-06-18** (was: keep on `ci-stress`, cherry-pick mergeable commits at + the end). Work continues on `ci-stress`; the *merge-to-`main` mechanism* is now an **open + decision** — cherry-pick (A) vs tidy-rebase + ff-merge (B, recommended) vs squash (C). See + **Bucket 5** for the analysis. Settle before the merge step (it's the last step). - **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the test/matrix count is known. - **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md index 69b3bb6..a547f8b 100644 --- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md +++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md @@ -276,7 +276,11 @@ calibrated envelope leg (IO throttling experimental — don't rely on it); emit **load-manifest** artifact (`{kind, R, nproc, achieved-slowdown, tool versions, os/arch, sha}`) uploaded on success too. -**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):** +**What lands on `main` vs stays scaffolding (refines Bucket 5 / D-d):** *(This lists the +mergeable **content**; it is mechanism-agnostic. The merge **mechanism** — cherry-pick vs +tidy-rebase+ff-merge vs squash — was **reopened 2026-06-18**; see the guarantees-and-coverage +plan's Bucket 5. Note that after this Bucket 6 lands, ci-stress's tree already excludes the +stress wiring, so "what graduates" is mostly a history-curation question, not a tree one.)* - **Graduate to `main`:** the calibrated `with-load.sh` (strip the do-not-merge banner; add ratio calibration + load-manifest); `ok_envelope`/`bad_envelope` + the 3 reassigned assertions; `GCL_TEST_SWEEP` + Axis-A loop (default-off → per-PR identical From 353a2dd1c741964f026dac6c993a85a49d43ea8f Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 02:21:11 +1000 Subject: [PATCH 36/76] =?UTF-8?q?Plan:=20RESOLVE=20D-d=20=E2=80=94=20mild?= =?UTF-8?q?=20tidy-up,=20merge=20to=20main=20via=20GitHub=20PR=20(extent?= =?UTF-8?q?=20=3D=20Ben's=20call)?= MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ben resolved the merge-to-main strategy: option (B), a *mild* tidy-up of the ci-stress history, then merge via a GitHub pull request (not a local ff-merge). The extent of the tidy-up is Ben's call — propose the specific commits to drop/squash and get sign-off before any history rewrite. Updated Bucket 5, the D-d decision line, and Phase 4. Merge remains the last step. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...-ci-stress-guarantees-and-coverage-plan.md | 26 +++++++++++++------ 1 file changed, 18 insertions(+), 8 deletions(-) diff --git a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md index 23ba646..523118a 100644 --- a/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md +++ b/.plans/2026-06-17-ci-stress-guarantees-and-coverage-plan.md @@ -107,8 +107,17 @@ Options: *Recommendation:* **(B)** — enabled cleanly by `main` not having diverged; gives a curated-but-real history (which (C) discards and (A) reconstructs laboriously) and matches -"tidy up and preserve." **Still Ben's call** (it's about `main`'s permanent history); settle it -before the merge step. **Not a blocker for the rest of Phase 3 — the merge is last.** +"tidy up and preserve." + +**RESOLVED (Ben, 2026-06-18): (B) — a *mild* tidy-up, then merge via a GitHub pull request** +(ci-stress → main), **not** a local ff-merge. Refinements: +- **Extent of tidy-up is Ben's call.** Keep it mild. Before any history rewrite, propose the + specific tidy (candidates: drop the pure scaffolding commits `980856b` + `b430d73`'s + required-job wiring; squash the obvious `/c` commit+revert noise `534a007`→`959cca9`→ + `a5df9d9`; leave the rest) and get Ben's sign-off on the extent — do not decide it autonomously. +- **Merge via a GitHub PR**, so the PR's CI is the gate and the merge is reviewable. `main` + has not diverged, so the PR stays clean. +- Still the **last** step of Phase 3/4; not a blocker for the harness/CI work. ### Bucket 6 — Principled load-&-matrix testing STRATEGY (Ben "f", 2026-06-17) — RECOMMENDATION DOC, not code The current load injection (`tests/with-load.sh`: N CPU spin-loops + N disk write/fsync/delete @@ -210,8 +219,8 @@ the agreed CI matrix (Bucket 6). Commit incrementally under the commit-lock. **V **Phase 4 — Review.** Review the diff (Claude + Codex); run the full suite via CI **across the agreed matrix** to confirm new tests pass + are non-flaky, the scoped bounds hold, and the matrix surfaces no new flakes. Iterate to clean. → Ben's final review. Then land on `main` -per **D-d** (merge strategy reopened 2026-06-18 — cherry-pick vs tidy-rebase+ff-merge vs -squash; see Bucket 5). +per **D-d** (resolved 2026-06-18: a mild tidy-up — extent is Ben's call — then merge via a +GitHub PR; see Bucket 5). ## Decisions (settled 2026-06-17) - **D-a → new `docs/guarantees.md`** (dedicated normative doc). @@ -219,10 +228,11 @@ squash; see Bucket 5). gaps (#7 wrong-type-mid-steal, #8 Windows blocked-unlink) as a second tier. - **D-c → split the suite** into a strict-correctness tier (always enforced) and a latency/envelope tier (not hard-failed by extreme-stress runs). -- **D-d → REOPENED 2026-06-18** (was: keep on `ci-stress`, cherry-pick mergeable commits at - the end). Work continues on `ci-stress`; the *merge-to-`main` mechanism* is now an **open - decision** — cherry-pick (A) vs tidy-rebase + ff-merge (B, recommended) vs squash (C). See - **Bucket 5** for the analysis. Settle before the merge step (it's the last step). +- **D-d → RESOLVED 2026-06-18: (B) mild tidy-up, then merge via a GitHub PR** (ci-stress → + main), not a local ff-merge. **Extent of tidy-up is Ben's call** — propose the specific + commits to drop/squash and get his sign-off before rewriting history. (Was briefly reopened + 2026-06-18 across cherry-pick (A) / tidy-rebase (B) / squash (C); see **Bucket 5**.) Still the + last step. - **D-e → my choice:** hand-run Phases 1-2; decide Phase 3-4 (hand vs Workflow) once the test/matrix count is known. - **"f" → Bucket 6**, above: a considered, first-principles load-&-matrix testing From 4ee5899869420d7ae72c77f812a695fe2154f498 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 03:09:30 +1000 Subject: [PATCH 37/76] Bucket 8 item 2: GCL_TEST_ONLY single-test selector MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add a regex single-test selector to the suites (Phase 3, harness ergonomics): - Unit + interop: each top-level `== Test N: … ==` header becomes `if section "Test N: …"; then … fi`. `section` echoes the header (visible output unchanged) and runs the block iff GCL_TEST_ONLY is unset/empty or its regex matches the label, bumping SECTIONS_RUN on a match. Each `fi` sits immediately before the next block's `if section`, so trailing per-test cleanup stays inside the block. - Zero-match guard: if GCL_TEST_ONLY is set but matched no block (SECTIONS_RUN==0), bail loudly with exit 1 — a typo'd regex can't produce a vacuous PASS=0/FAIL=0 green (same spirit as the undercount sentinel). - Integration suite note-and-ignores GCL_TEST_ONLY: it is one indivisible scenario (Tests 1-3 share a repo + the ALL_IDS audit), so it prints a loud stderr note and runs the whole suite. Default runs are byte-identical (selector logic is gated on GCL_TEST_ONLY). Validated: unit 315/0, interop 141/0, integration 12/0 (reduced, exit 0); sorted PASS/FAIL set identical before/after (volatile token/path fields aside); selector precision proven (regex match, trailing-colon anchoring so 'Test 2:' excludes Test 20/2b); zero-match guard exits 1. shellcheck -S style clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/git-commit-lock.integration.test.sh | 10 ++ tests/git-commit-lock.interop.test.sh | 105 +++++++++--- tests/git-commit-lock.test.sh | 195 +++++++++++++++------- 3 files changed, 226 insertions(+), 84 deletions(-) diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh index 579a5da..e7837f4 100644 --- a/tests/git-commit-lock.integration.test.sh +++ b/tests/git-commit-lock.integration.test.sh @@ -114,6 +114,16 @@ echo "fan-out mode: $GCL_MODE (bash swarm ${BROUNDS}x${BN}, mixed swarm ${MSH}+$ # bounded max wait so a wedge fails the suite instead of hanging it. LK_ENV=(AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=240) +# Note-and-ignore the per-test selector the unit/interop suites honour: this +# suite is ONE indivisible scenario (Tests 1-3 share a single repo + the ALL_IDS +# accumulator, and Test 3 audits Tests 1+2's output), so a per-block selector +# can't apply. If GCL_TEST_ONLY is set, say so loudly on stderr and run the +# whole scenario as normal. +GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" +if [ -n "$GCL_TEST_ONLY" ]; then + echo "NOTE: integration suite ignores GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" — Tests 1-3 are one indivisible scenario (shared repo + ALL_IDS audit); running the whole suite." >&2 +fi + # --- scratch repo ------------------------------------------------------------ REPO="$WORK/repo"; OUTD="$WORK/out"; NOHOOKS="$WORK/nohooks" mkdir -p "$REPO" "$OUTD" "$NOHOOKS" diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh index a638005..8bda7c7 100644 --- a/tests/git-commit-lock.interop.test.sh +++ b/tests/git-commit-lock.interop.test.sh @@ -67,8 +67,13 @@ WORK="$(pwsh -NoProfile -Command '[IO.Path]::Combine([IO.Path]::GetTempPath(), " WORK="${WORK//\\//}" mkdir -p "$WORK" -PASS=0; FAIL=0; TAPN=0; DONE=0 +PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0 GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output +# Single-test selector: GCL_TEST_ONLY= runs only the test blocks whose +# `== Test N: ==` label matches the regex (BASH regex, =~). Unset/empty +# runs every block (default). A typo'd regex that matches nothing bails out +# loudly at the verdict (the zero-match guard) rather than passing vacuously. +GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" # ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and # bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted # just before the verdict) lets a TAP consumer fail on a short count; together with the @@ -79,6 +84,19 @@ ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } +# Per-test gate: echoes the block header (so a normal run is byte-unchanged) and +# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label. +# Each top-level `== Test N: ==` block is wrapped `if section "..."; then ... fi`. +# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard can catch a +# selector regex that matched nothing. +section() { + echo "== $1 ==" + if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then + SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0 + fi + return 1 +} + # Failure post-mortems need the logs: keep $WORK when anything failed, and # honour GCL_TEST_PRESERVE_DIR (the CI preserve-logs knob) by copying # the work dir there unconditionally when it is set. @@ -243,7 +261,7 @@ ps_worker() { # $1=lock $2=log $3=holder $4=violations $5=id pwsh -NoProfile -File "$PS1WIN" run "$body" } -echo "== Test 1: mixed pwsh+bash workers, mutual exclusion across implementations ($GCL_MODE width) ==" +if section "Test 1: mixed pwsh+bash workers, mutual exclusion across implementations ($GCL_MODE width)"; then NSH=$T1_NSH; NPS=$T1_NPS; TOT=$((NSH+NPS)) LOCK="$WORK/excl.lock" HOLDER="$WORK/holder"; : > "$HOLDER"; VIOL="$WORK/violations"; : > "$VIOL" @@ -278,8 +296,9 @@ else [ "$st" != 0 ] && { echo " STALE/STEAL log lines:"; grep -E "STALE|STOLE" "$WORK/excl-all.log" | sed 's/^/ /'; } bad "cross-impl exclusion/balance: violations=$nv steals=$st acquired=$a (floor $((TOT/2))) released=$rl leftover=$([ -e "$LOCK" ] && echo yes || echo no)" fi +fi -echo "== Test 2: a bash holder blocks a pwsh waiter (no concurrent hold, no wrongful steal) ==" +if section "Test 2: a bash holder blocks a pwsh waiter (no concurrent hold, no wrongful steal)"; then LOCK="$WORK/b2.lock"; LOG="$WORK/b2.log"; : > "$LOG"; ORDER="$WORK/b2.order"; : > "$ORDER" READY="$WORK/b2.ready"; rm -f "$READY" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \ @@ -295,8 +314,9 @@ wait "$holder" got="$(tr '\n' ',' < "$ORDER")" [ "$got" = "sh-start,sh-end,ps-ran," ] && ok "bash-holds / pwsh-waits ordering correct" || bad "ordering wrong: $got" grep -q STOLE "$LOG" && bad "pwsh wrongly STOLE a live bash lock" || ok "pwsh did not steal the live bash lock" +fi -echo "== Test 3: a pwsh holder blocks a bash waiter ==" +if section "Test 3: a pwsh holder blocks a bash waiter"; then LOCK="$WORK/b3.lock"; LOG="$WORK/b3.log"; : > "$LOG"; ORDER="$WORK/b3.order"; : > "$ORDER" READY="$WORK/b3.ready"; rm -f "$READY" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=60 \ @@ -309,8 +329,9 @@ wait "$holder" got="$(tr '\n' ',' < "$ORDER")" [ "$got" = "ps-start,ps-end,sh-ran," ] && ok "pwsh-holds / bash-waits ordering correct" || bad "ordering wrong: $got" grep -q STOLE "$LOG" && bad "bash wrongly STOLE a live pwsh lock" || ok "bash did not steal the live pwsh lock" +fi -echo "== Test 4: pwsh steals a STALE lock fabricated as bash's (old file mtime) ==" +if section "Test 4: pwsh steals a STALE lock fabricated as bash's (old file mtime)"; then # AGENT_LOCK_MAX_WAIT caps the run so a steal regression fails in ~20s, not 420s. LOCK="$WORK/b4.lock"; LOG="$WORK/b4.log"; : > "$LOG"; MARK="$WORK/b4.mark"; printf '%s' before > "$MARK" fabricate_lock "$LOCK" "tok.sh.ghost.1" "pid=99999 host=ghost" @@ -323,8 +344,9 @@ grep -q STOLE "$LOG" && ok "log records the cross-impl steal" || bad "no STOLE e grep -q "holder=pid=99999 host=ghost" "$LOG" \ && ok "STALE log line carries the holder parsed from line 2 (cross-impl wire format)" \ || bad "holder from line 2 missing in pwsh's STALE log line" +fi -echo "== Test 5: bash steals a STALE lock GENUINELY created by pwsh (holder killed mid-hold) ==" +if section "Test 5: bash steals a STALE lock GENUINELY created by pwsh (holder killed mid-hold)"; then # The stale lock really is pwsh's: a pwsh process dot-sources the lock, acquires (writing # its tok.ps.* token to line 1 and flushing+closing the file), signals ready, then # SELF-EXITS via [Environment]::Exit(0) — the port's documented hard-exit that bypasses @@ -356,8 +378,9 @@ else kill -9 "$hpid" 2>/dev/null; wait "$hpid" 2>/dev/null bad "T5 pwsh holder never acquired/signalled ready" fi +fi -echo "== Test 6: deterministic lost-update counter, mixed bash+pwsh increments ($GCL_MODE width) ==" +if section "Test 6: deterministic lost-update counter, mixed bash+pwsh increments ($GCL_MODE width)"; then # The deterministic complement to Test 1's exclusion probe (which has a blind # window and tolerates launch flakiness): every worker MUST launch (strict rc # checks) and the final counter MUST equal the total increments — any lost @@ -403,8 +426,9 @@ cat "$WORK"/cnt-*.log > "$WORK/cnt-all.log" 2>/dev/null || : > "$WORK/cnt-all.lo a="$(grep -c ACQUIRED "$WORK/cnt-all.log")"; rl="$(grep -c RELEASED "$WORK/cnt-all.log")" [ "$a" = "$CTOT" ] && [ "$rl" = "$CTOT" ] && ok "lock logs balanced ($a acquired / $rl released)" || bad "lock logs unbalanced: acquired=$a released=$rl want=$CTOT" [ -e "$LOCK" ] && bad "leftover counter lock" || ok "no leftover lock" +fi -echo "== Test 7: pwsh run propagates the command's exit code (two contending runs in parallel) ==" +if section "Test 7: pwsh run propagates the command's exit code (two contending runs in parallel)"; then LOCK="$WORK/rc.lock"; LOG="$WORK/rc.log"; : > "$LOG" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_MAX_WAIT=60 \ pwsh -NoProfile -File "$PS1WIN" run "exit 0" & p0=$! @@ -415,8 +439,9 @@ wait "$p7"; rc7=$? [ "$rc0" = 0 ] && ok "pwsh exit 0 propagated" || bad "pwsh exit 0 not propagated (rc=$rc0)" [ "$rc7" = 7 ] && ok "pwsh exit 7 propagated" || bad "pwsh exit code not propagated ($rc7)" [ -e "$LOCK" ] && bad "lock left held after pwsh run" || ok "lock released after pwsh run (success and failure)" +fi -echo "== Test 7b: ps1 run verdicts for PowerShell-NATIVE failure (a failing cmdlet must not exit 0) ==" +if section "Test 7b: ps1 run verdicts for PowerShell-NATIVE failure (a failing cmdlet must not exit 0)"; then # A cmdlet's non-terminating error never sets LASTEXITCODE, so a runner # consulting only LASTEXITCODE would return 0 for a failed command. The # runner must consult the staged script's FINAL '$?' when no nonzero native @@ -454,8 +479,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_MAX_WAIT=20 \ [ "$rc" = 0 ] && ok "mid-command cmdlet failure + succeeding final statement -> 0 (the documented final-statement limitation)" \ || bad "limitation pin: rc=$rc (want 0 — has the final-statement contract changed?)" [ -e "$LOCK" ] && bad "lock left held after the failing-cmdlet verdict runs" || ok "no leftover lock after the failing-cmdlet verdict runs" +fi -echo "== Test 7c: ps1 CLI help/usage convention — explicit help -> stdout + exit 0; usage errors -> stderr + 96 ==" +if section "Test 7c: ps1 CLI help/usage convention — explicit help -> stdout + exit 0; usage errors -> stderr + 96"; then # (bash's side of the same convention is pinned in the unit suite, Test 7.) for h in --help -h; do pwsh -NoProfile -File "$PS1WIN" "$h" > "$WORK/t7c.out" 2> "$WORK/t7c.err"; rc=$? @@ -475,8 +501,9 @@ pwsh -NoProfile -File "$PS1WIN" > "$WORK/t7c-noargs.out" 2> "$WORK/t7c-noargs.er || bad "ps1 no-args rc=$rc (want 96) stderr-usage=$(grep -c '^usage:' "$WORK/t7c-noargs.err")" pwsh -NoProfile -File "$PS1WIN" frobnicate >/dev/null 2>&1; rc=$? [ "$rc" = 96 ] && ok "ps1 unknown subcommand -> 96" || bad "ps1 unknown subcommand rc=$rc (want 96)" +fi -echo "== Test 8: a ROBBED holder exits 98 — pwsh victim/bash thief, then bash victim/pwsh thief ==" +if section "Test 8: a ROBBED holder exits 98 — pwsh victim/bash thief, then bash victim/pwsh thief"; then # Fail-open ceiling, cross-impl: the victim holds past its 1s stale window # UNTIL THE THIEF IS DONE (marker, not a fixed sleep — a fixed hold once let a # slow-starting thief arrive after the victim had already released), the other @@ -509,15 +536,17 @@ touch "$TDONE" wait "$vic"; vic_rc=$? [ "$vic_rc" = 98 ] && ok "robbed bash holder exited 98" || bad "robbed bash holder exited $vic_rc (want 98)" [ "$thief_rc" = 0 ] && ok "pwsh thief exited 0" || bad "pwsh thief exited $thief_rc" +fi -echo "== Test 9: a slow but UNCONTENDED pwsh holder keeps its lock (slowness != failure) ==" +if section "Test 9: a slow but UNCONTENDED pwsh holder keeps its lock (slowness != failure)"; then LOCK="$WORK/slow.lock"; LOG="$WORK/slow.log"; : > "$LOG" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 AGENT_LOCK_POLL_SECS=0.1 AGENT_LOCK_MAX_WAIT=30 \ pwsh -NoProfile -File "$PS1WIN" run "Start-Sleep 2"; rc=$? [ "$rc" = 0 ] && ok "uncontended slow pwsh holder exited 0" || bad "uncontended slow pwsh holder exited $rc" grep -q "WARNING" "$LOG" && bad "spurious theft WARNING with no contender" || ok "no spurious WARNING when uncontended" +fi -echo "== Test 10: default lock location is /commit.lock for BOTH impls (regression: item 1) ==" +if section "Test 10: default lock location is /commit.lock for BOTH impls (regression: item 1)"; then # The BLOCKER this guards against: the .ps1 silently fell back to a CWD lock at # default config, so the two impls never contended. Run BOTH impls from a # SUBDIRECTORY of a scratch repo with AGENT_LOCK_PATH/LOG unset; each command @@ -539,8 +568,9 @@ nps="$(grep -c "ACQUIRED.*tok=tok\.ps\." "$DLOG" 2>/dev/null)" && ok "shared log shows 1 bash + 1 pwsh acquisition" \ || bad "default-log evidence wrong: ACQUIRED=$na (want 2), pwsh tokens=$nps (want 1) in $DLOG" [ -e "$GITDIR2/commit.lock" ] && bad "leftover default lock" || ok "no leftover default lock" +fi -echo "== Test 11: release-time classification agrees across impls — truncated => unverifiable (1); deleted => theft (98) ==" +if section "Test 11: release-time classification agrees across impls — truncated => unverifiable (1); deleted => theft (98)"; then # (i) TRUNCATED at release: the file still exists but reads EMPTY after the # retry ladder. NOT provable theft (it is the probe-F create->write window of # a successor after a boundary steal, or external truncation), so BOTH impls @@ -569,8 +599,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_MAX_WAIT=20 \ pwsh -NoProfile -File "$PS1WIN" run "Remove-Item -LiteralPath '$LOCK' -Force" 2>/dev/null; rc_ps=$? [ "$rc_sh" = 98 ] && ok "bash: lock GONE at release -> exit 98 (theft)" || bad "bash gone-at-release rc=$rc_sh (want 98)" [ "$rc_ps" = 98 ] && ok "pwsh: lock GONE at release -> exit 98 (theft)" || bad "pwsh gone-at-release rc=$rc_ps (want 98)" +fi -echo "== Test 12: fractional STALE/MAX_WAIT rejected identically by both impls (note + default) ==" +if section "Test 12: fractional STALE/MAX_WAIT rejected identically by both impls (note + default)"; then # These two knobs are integers in both impls; a fractional value silently # rounded by one side but rejected by the other would give the two impls # DIFFERENT steal thresholds for the same env. Both must note + use defaults. @@ -625,10 +656,11 @@ n_ps="$(grep -c 'ignoring invalid' "$WORK/poll-ps.err")" [ "$rc_sh" = 0 ] && [ "$n_sh" = 0 ] && [ "$rc_ps" = 0 ] && [ "$n_ps" = 0 ] \ && ok "POLL_SECS='' (empty): silent default in BOTH impls (no note)" \ || bad "POLL_SECS='' parity: sh rc=$rc_sh notes=$n_sh; pwsh rc=$rc_ps notes=$n_ps (want rc 0 + 0 notes each)" +fi if [ "$GCL_WINDOWS" = 1 ]; then -echo "== Test 13: blocked release (no-delete-share handle) — deterministic LEFTOVER, run keeps the command's code, then recovery ==" +if section "Test 13: blocked release (no-delete-share handle) — deterministic LEFTOVER, run keeps the command's code, then recovery"; then # Probe D1 made this lane deterministically testable (TODO #30): a pwsh # FileShare.Read handle on the lock file blocks the release unlink (and any # steal rename) until it closes. (a) sourced bash: lock_release returns 1 and @@ -732,8 +764,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 AGENT_LOCK [ "$rc" = 0 ] && ok "leftover reclaimed once the handle closed + stale window elapsed (TODO #30 lane)" \ || bad "leftover recovery rc=$rc (want 0)" grep -q STOLE "$LOG" && ok "recovery steal logged" || bad "no STOLE entry during leftover recovery" +fi -echo "== Test 14: blocked steal — a no-delete-share handle on a STALE lock defers the steal until it closes ==" +if section "Test 14: blocked steal — a no-delete-share handle on a STALE lock defers the steal until it closes"; then # Same handle class against a stale lock: the stealer's rename keeps failing # while the handle is open (probe D1), so it re-polls — and acquires promptly # once the handle closes. Run with the ps1 stealer: this exercises its @@ -761,8 +794,9 @@ else touch "$BGO"; wait "$blk14" 2>/dev/null bad "T14 blocker never signalled its handle open" fi +fi -echo "== Test 14b: blocked steal NEVER bypasses MAX_WAIT — squatted stale lock => 97 with bounded logging (regression: busy-spin) ==" +if section "Test 14b: blocked steal NEVER bypasses MAX_WAIT — squatted stale lock => 97 with bounded logging (regression: busy-spin)"; then # Discriminator: when the steal rename keeps # failing with the lock file still present (a no-delete-share handle squatting # it), a failed-steal lane that `continue`s past the timeout check AND the @@ -834,13 +868,14 @@ else bad "T14b squatter never signalled its handle open" fi rm -f "$LOCK" +fi else echo "== Tests 13/14/14b SKIPPED (POSIX): open handles never block unlink/rename here ==" echo "note: the LEFTOVER and blocked-steal lanes are Windows-only by construction (.NET's Unix FileShare gates no namespace operation); the Windows CI leg covers them" fi -echo "== Test 15: ps1-side never-steal guards — dir, dangling symlink, non-lock content (parity with the bash guards) ==" +if section "Test 15: ps1-side never-steal guards — dir, dangling symlink, non-lock content (parity with the bash guards)"; then # The ps1 guards use different APIs than bash (PSIsContainer, reparse # attributes, the catch-all CreateNew exception), so bash coverage proves # nothing about them. The wrong-type warning needs the SAME concrete type on @@ -899,8 +934,9 @@ grep -q "is not a lock file" "$WORK/psuser.err" && ok "ps1: config warning names || bad "ps1: no config warning for non-lock content" grep -q STOLE "$LOG" && bad "ps1 STOLE the user file" || ok "ps1: no steal of the user file" rm -f "$LOCK" +fi -echo "== Test 16: crash recovery under CONTENTION, mixed impls — claim-serialized: zero displacement, zero 98s ==" +if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-serialized: zero displacement, zero 98s"; then # Cross-impl variant of the unit suite's Test 2b (which carries the full # rationale): 2 bash + 2 pwsh waiters race ONE crashed lock. Under the claim # protocol the straggler-robs-recovery-winner race is PREVENTED (the claim @@ -1032,8 +1068,9 @@ if [ "$t16_valid" = 1 ]; then else bad "T16: no clean run under a conclusive backdate in $T16_TRIES attempts (see above)" fi +fi -echo "== Test 16b: bash claimant vs ps1 claimant racing ONE ghost — one claim winner, cross-impl wire parity ==" +if section "Test 16b: bash claimant vs ps1 claimant racing ONE ghost — one claim winner, cross-impl wire parity"; then # The 1+1 distilled version of Test 16: one bash and one pwsh waiter race the # same ancient ghost. Exactly one wins the O_EXCL claim and steals # (STOLE-BY-CLAIM x1); the loser either loses the claim create (a young @@ -1105,8 +1142,9 @@ if [ "$t16b_valid" = 1 ]; then else bad "T16b: no clean run under a conclusive backdate in $T16B_TRIES attempts (see above)" fi +fi -echo "== Test 16c: cross-impl claim staleness — each side clears the OTHER side's aged claim; young foreign claims are respected ==" +if section "Test 16c: cross-impl claim staleness — each side clears the OTHER side's aged claim; young foreign claims are respected"; then # (a) bash clears an aged ps1-tokened claim, then completes the steal. LOCK="$WORK/cstale.lock"; LOG="$WORK/cstale.log"; : > "$LOG" fabricate_lock "$LOCK" "tok.ghost.cstale" "pid=9 host=ghost"; backdate "$LOCK" 9999 @@ -1156,8 +1194,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ && ok "ps1 respected a young bash claim (97, claim intact, no clear/steal)" \ || bad "ps1 young-bash-claim handling: rc=$rc intact=$([ -f "$LOCK.next" ] && echo yes || echo no)" rm -f "$LOCK" "$LOCK.next" +fi -echo "== Test 16d: static checks — no File.Replace in the ps1 port ==" +if section "Test 16d: static checks — no File.Replace in the ps1 port"; then # File.Replace is deliberately never used: it throws on a # read-only destination and has partial-failure states when called without a # backup file. The 5.1 lane must stay unlink + fail-if-exists Move. @@ -1166,8 +1205,9 @@ if grep -qE 'File\]?::Replace' "$ROOT/git-commit-lock.ps1"; then else ok "git-commit-lock.ps1 contains no File.Replace call" fi +fi -echo "== Test 16e: ps1 arc-end pass keeps INCONCLUSIVE entries; trap-time discovery-HOLD releases per normal release semantics ==" +if section "Test 16e: ps1 arc-end pass keeps INCONCLUSIVE entries; trap-time discovery-HOLD releases per normal release semantics"; then # Driven directly via a dot-sourcing pwsh driver — the ps1 side's # unit-equivalent steering mechanism (the lib skips its CLI when # dot-sourced). Part 1: the arc-end resolution pass's entry-drop is gated @@ -1261,8 +1301,9 @@ PSEOF else echo "note: the blocked trap-time release leg is Windows-only by construction (POSIX open handles never block unlink); the happy-path leg above pins the honest-log contract" fi +fi -echo "== Test 16f: ps1 claim-gone-at-touch — the SetLastWriteTimeUtc FileNotFound gone signal fires; no resurrection ==" +if section "Test 16f: ps1 claim-gone-at-touch — the SetLastWriteTimeUtc FileNotFound gone signal fires; no resurrection"; then # The unit suite's discovery-position matrix (T25) covers bash's # touch-gone lane; this is the ps1 counterpart: the claim passes the # step-3.1 recheck, vanishes before the step-3.2 touch (steered via the @@ -1321,9 +1362,10 @@ PSEOF else echo "== Test 16f SKIPPED: claim-gone-at-touch steering uses Windows pwsh (POSIX legs cover the protocol via the bash matrix; the ps1 gone-catch is probed Q1) ==" fi +fi if command -v powershell >/dev/null 2>&1; then -echo "== Test 17: Windows PowerShell 5.1 smoke lane — the ps1 must run, not just parse, on the in-box engine ==" +if section "Test 17: Windows PowerShell 5.1 smoke lane — the ps1 must run, not just parse, on the in-box engine"; then # Everything above runs the port under pwsh (7+). 5.1 ships in every Windows # 10/11 box and stays supported, so its claim is tested, not asserted: the # run lane's exit-code contract (0 / exit 7 / the failing-cmdlet -> 1) and @@ -1393,12 +1435,23 @@ AGENT_LOCK_PATH="$LOCK51" AGENT_LOCK_LOG="$LOG51" AGENT_LOCK_STALE_SECS=2 \ grep -q "CLAIM .*tok=tok\.ps\." "$LOG51" && ok "5.1: claim create logged with its per-attempt token" || bad "5.1: no CLAIM line with a tok.ps.* token" [ -e "$LOCK51" ] && bad "5.1: leftover lock after the steal ladder" || ok "5.1: no leftover lock" [ -e "$LOCK51.next" ] && bad "5.1: leftover claim after the steal ladder" || ok "5.1: no leftover claim" +fi else echo "== Test 17 SKIPPED: Windows PowerShell 5.1 (powershell) not on PATH — POSIX leg; the Windows CI leg covers it ==" echo "note: the 5.1 unlink+Move steal-ladder leg is part of this lane and is covered by the Windows CI leg" fi echo +# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran no test block, so +# the (vacuously green) verdict below would lie. Bail loudly instead — a typo'd +# selector regex must FAIL, not pass with zero assertions. +if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then + echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2 + exit 1 +fi +# When a selector is active, report how many blocks it matched (the default run +# stays byte-unchanged because this is gated on GCL_TEST_ONLY being non-empty). +[ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)" DONE=1 echo "==== INTEROP RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" [ "$GCL_TAP" = 1 ] && echo "1..$TAPN" diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 56cc7c2..7fc5f2b 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -64,8 +64,21 @@ finish() { } trap finish EXIT -PASS=0; FAIL=0; TAPN=0; DONE=0 +PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0 GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output +GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" # if set, run ONLY test blocks whose label REGEX-matches (single-test selector) +# section() replaces each per-test header `echo "== Test N: … =="`: it echoes the +# header verbatim (visible output unchanged) and returns success — gating the +# `if section …; then … fi` block — iff GCL_TEST_ONLY is unset/empty OR its regex +# matches the label. A run-counter (SECTIONS_RUN) backs the zero-match guard below, +# so a typo'd selector regex can't masquerade as a vacuous PASS=0/FAIL=0 green. +section() { + echo "== $1 ==" + if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then + SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0 + fi + return 1 +} # ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and # bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted # just before the verdict) lets a TAP consumer fail on a short count; together with the @@ -202,7 +215,7 @@ wait_for_grep() { # Critical section that loses updates without a mutex: read, gap, write+1. INCR='n="$(cat "$1")"; sleep 0.03; echo $((n+1)) > "$1"' -echo "== Test 1: concurrent workers, mutual exclusion (repeated rounds, $GCL_MODE width) ==" +if section "Test 1: concurrent workers, mutual exclusion (repeated rounds, $GCL_MODE width)"; then # A single pass is too weak to trust a rare exclusion race (the release-steal # bug found 2026-05-30 lost ~1 update per 25 only intermittently). Repeat # several rounds; ANY lost update across ALL rounds fails the test. @@ -232,8 +245,9 @@ done grep -q "Staleness detection is BROKEN" "$T1ERR" \ && bad "spurious mtime-probe WARNING under contention (see $T1ERR)" \ || ok "no spurious mtime-probe warnings under contention" +fi -echo "== Test 2: stale lock (old file mtime) is stolen; holder comes from line 2 ==" +if section "Test 2: stale lock (old file mtime) is stolen; holder comes from line 2"; then LOCK="$WORK/steal.lock"; LOG="$WORK/steal.log"; : > "$LOG"; MARKER="$WORK/steal-marker" fabricate_lock "$LOCK" "tok.fake.99999.1" "pid=99999 host=ghost" backdate "$LOCK" 9999 # make the FILE mtime ancient -> stale @@ -247,8 +261,9 @@ grep -q STOLE "$LOG" && ok "log records a steal" || bad "no STOLE entry" grep -q "holder=pid=99999 host=ghost" "$LOG" \ && ok "STALE log line carries the holder parsed from line 2" \ || bad "holder from line 2 missing in the STALE log line" +fi -echo "== Test 2b: crash recovery under CONTENTION — claim-serialized: zero displacement, zero 98s ($GCL_MODE: $T2B_ROUNDS rounds) ==" +if section "Test 2b: crash recovery under CONTENTION — claim-serialized: zero displacement, zero 98s ($GCL_MODE: $T2B_ROUNDS rounds)"; then # The claim SERIALIZES stealers, so the straggler-robs-recovery-winner race # is PREVENTED, not detected-and-repaired. Scenario: one crashed lock, N # waiters judging stale in the same poll window (the launch/backdate sync @@ -383,8 +398,9 @@ done || bad "'STOLE stale lock' line appeared x$t2b_old_shape — an unserialized steal lane is present" [ "$t2b_disp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)" \ || bad "STEAL-DISPLACED fired x$t2b_disp — displacement-repair machinery present?" +fi -echo "== Test 3: REGRESSION — EMPTY lock file (crash between create and write) is still stolen ==" +if section "Test 3: REGRESSION — EMPTY lock file (crash between create and write) is still stolen"; then # The file-protocol descendant of the 2026-05-30 orphan bug: an acquirer that # died after the open but before (or mid-) content write leaves an empty file. # Staleness MUST come from the file mtime and the content guard MUST class an @@ -398,8 +414,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 \ bash "$LIB" run -- bash -c 'echo after > "$1"' _ "$MARKER"; rc=$? [ "$rc" = 0 ] && ok "empty-file orphan stolen (no hang)" || bad "orphan NOT stolen (rc=$rc) — regression!" [ "$(cat "$MARKER")" = after ] && ok "command ran after stealing orphan" || bad "command did not run" +fi -echo "== Test 4: a LIVE lock is NOT stolen (waiter logs WAITING, blocks, then proceeds) ==" +if section "Test 4: a LIVE lock is NOT stolen (waiter logs WAITING, blocks, then proceeds)"; then LOCK="$WORK/live.lock"; LOG="$WORK/live.log"; : > "$LOG"; ORDER="$WORK/order"; echo none > "$ORDER" READY="$WORK/t4.ready"; GO4="$WORK/t4.go" # Holder keeps the lock until the test has SEEN the waiter contend (the @@ -422,8 +439,9 @@ wait "$waiter"; wait "$holder" [ "$(tr '\n' ',' < "$ORDER")" = "none,holder-start,holder-end,waiter-ran," ] \ && ok "ordering correct" || bad "ordering wrong: $(tr '\n' ',' < "$ORDER")" grep -q STOLE "$LOG" && bad "waiter wrongly STOLE a live lock" || ok "no wrongful steal of live lock" +fi -echo "== Test 4b: a ROBBED slow holder detects the theft and FAILS with 98 on release ==" +if section "Test 4b: a ROBBED slow holder detects the theft and FAILS with 98 on release"; then # The fail-open ceiling: a hold longer than the stale window CAN be stolen by a # contender. The robbed holder must DETECT this at release (the lock file is # gone, or carries the thief's token) and exit EXACTLY 98 (the reserved @@ -454,8 +472,9 @@ wait "$vpid"; victim_rc=$? grep -q "WARNING: lock LOST" "$LOG" && ok "robbed holder logged a loud theft WARNING" || bad "no theft WARNING logged" [ "$thief_rc" = 0 ] && ok "thief (its own fresh hold) released cleanly (rc 0)" || bad "thief rc=$thief_rc (should be 0)" grep -q thief-work "$OUT" && ok "thief did its work" || bad "thief work missing" +fi -echo "== Test 4c: a slow but UNCONTENDED holder keeps its lock (slowness != failure) ==" +if section "Test 4c: a slow but UNCONTENDED holder keeps its lock (slowness != failure)"; then # Documents the boundary: exceeding the stale window is only dangerous when a # contender actually steals. With no waiter, the file is never moved, the token # still matches, and release succeeds. (If this failed, the lock would punish @@ -466,16 +485,18 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 AGENT_LOCK [ "$solo_rc" = 0 ] && ok "uncontended slow holder released cleanly (rc 0)" || bad "uncontended slow holder rc=$solo_rc (should be 0)" grep -q "WARNING: lock LOST" "$LOG" && bad "spurious theft WARNING with no contender" || ok "no spurious WARNING when uncontended" grep -q solo-done "$OUT" && ok "uncontended slow holder did its work" || bad "work missing" +fi -echo "== Test 5: run propagates the command's exit code, releases either way ==" +if section "Test 5: run propagates the command's exit code, releases either way"; then LOCK="$WORK/rc.lock"; LOG="$WORK/rc.log"; : > "$LOG" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'exit 0'; rc=$? [ "$rc" = 0 ] && ok "exit 0 propagated" || bad "exit 0 not propagated (rc=$rc)" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash "$LIB" run -- bash -c 'exit 7'; rc=$? [ "$rc" = 7 ] && ok "exit 7 propagated" || bad "exit code not propagated (rc=$rc)" [ -e "$LOCK" ] && bad "lock left held after run" || ok "lock released after run (success and failure)" +fi -echo "== Test 6: default lock FILE and log live in the git dir ==" +if section "Test 6: default lock FILE and log live in the git dir"; then SCRATCH="$WORK/scratch"; mkdir -p "$SCRATCH" git -C "$SCRATCH" init -q; git -C "$SCRATCH" config user.email t@t; git -C "$SCRATCH" config user.name t GITDIR="$(git -C "$SCRATCH" rev-parse --absolute-git-dir)" @@ -494,8 +515,9 @@ touch "$GO6" wait "$h6" [ -e "$GITDIR/commit.lock" ] && bad "default lock file left behind after release" || ok "default lock file removed on release" [ -f "$GITDIR/git-commit-lock.log" ] && ok "lock log created in git dir ($GITDIR)" || bad "no log in git dir" +fi -echo "== Test 7: CLI usage errors exit 96 (stderr); explicit --help/-h exits 0 (stdout) ==" +if section "Test 7: CLI usage errors exit 96 (stderr); explicit --help/-h exits 0 (stdout)"; then bash "$LIB" >/dev/null 2>&1; [ "$?" = 96 ] && ok "no args -> 96" || bad "no args rc=$? (want 96)" bash "$LIB" frobnicate > "$WORK/t7.err.out" 2> "$WORK/t7.err.err" [ "$?" = 96 ] && ok "unknown subcommand -> 96" || bad "unknown subcommand rc=$? (want 96)" @@ -514,8 +536,9 @@ for h in --help -h; do && ok "$h -> usage on stdout, exit 0, stderr empty" \ || bad "$h rc=$rc (want 0) stdout-usage=$(grep -c '^usage:' "$WORK/t7.help.out") stderr=$(head -c 60 "$WORK/t7.help.err")" done +fi -echo "== Test 8: acquire timeout exits 97 and the command NEVER runs ==" +if section "Test 8: acquire timeout exits 97 and the command NEVER runs"; then LOCK="$WORK/tmo.lock"; LOG="$WORK/tmo.log"; : > "$LOG"; READY="$WORK/t8.ready"; DONE8="$WORK/t8.done" # Holder keeps the lock until the test says so (marker, not a fixed sleep — # under heavy load a slow-starting waiter once arrived AFTER a 4s holder had @@ -561,8 +584,9 @@ grep -q "raise AGENT_LOCK_MAX_WAIT" "$WORK/t8.warn3.err" \ || ok "explicit MAX_WAIT silences the knob-relation warning (left-default gate kept)" wait "$h8"; rc=$? [ "$rc" = 0 ] && ok "holder unaffected by the timed-out waiter" || bad "holder rc=$rc (want 0)" +fi -echo "== Test 9: sub-floor (pre-2000) file mtime is NOT treated as stale ==" +if section "Test 9: sub-floor (pre-2000) file mtime is NOT treated as stale"; then # The FILETIME-zero guard: a freshly created file can transiently report a 1601 # mtime to an observer on Windows (probes C/C1b); # anything before 2000-01-01 must be classed unsettled — the waiter WAITS (and @@ -578,8 +602,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ grep -q STOLE "$LOG" && bad "sub-floor lock was wrongly STOLEN" || ok "no steal of sub-floor lock" [ -f "$LOCK" ] && ok "sub-floor lock file untouched" || bad "sub-floor lock file was removed" rm -f "$LOCK" +fi -echo "== Test 10: every worktree gets its OWN lock (git-dir scoping) ==" +if section "Test 10: every worktree gets its OWN lock (git-dir scoping)"; then WTREPO="$WORK/wtrepo"; mkdir -p "$WTREPO" git -C "$WTREPO" init -q; git -C "$WTREPO" config user.email t@t; git -C "$WTREPO" config user.name t git -C "$WTREPO" commit -q --allow-empty -m init @@ -612,8 +637,9 @@ wait "$h10" [ -e "$WTGD/commit.lock" ] && bad "worktree lock left behind" || ok "worktree lock released" [ -f "$WTGD/git-commit-lock.log" ] && ok "worktree log lives in its worktree git dir" || bad "no log at $WTGD" [ -e "$MAINGD/commit.lock" ] && bad "main-repo lock left behind" || ok "main-repo lock released" +fi -echo "== Test 11: TERM mid-hold — lock released, wrapper dies with 128+15 ==" +if section "Test 11: TERM mid-hold — lock released, wrapper dies with 128+15"; then # Two discriminators: (a) the EXIT/TERM trap must actually # release the lock when the `run` wrapper is killed; (b) the wrapper must NOT # swallow the signal (a swallowing wrapper releases, keeps going, and exits 0 @@ -637,8 +663,9 @@ wait "$w11"; rc=$? || bad "TERM'd run wrapper rc=$rc (want 143)" [ -e "$LOCK" ] && bad "lock left held after TERM" || ok "lock released on TERM" grep -q RELEASED "$LOG" && ok "release logged on TERM path" || bad "no RELEASED entry on TERM path" +fi -echo "== Test 12: sourced API — acquire/release, traps, strict-mode hygiene ==" +if section "Test 12: sourced API — acquire/release, traps, strict-mode hygiene"; then # 12a: sourcing must not impose errexit/nounset/pipefail; acquire/release work # across separate commands; reentrant acquire is refused (rc 1, lock kept); # release is idempotent. Distinct failure codes pinpoint the broken step. @@ -730,8 +757,9 @@ done wait "$p12"; rc=$? [ "$rc" = 143 ] && ok "post-release shell dies on TERM (143) — signal disposition restored" \ || bad "post-release shell rc=$rc on TERM (want 143; signal-immune shell?)" +fi -echo "== Test 13: garbage AGENT_LOCK_* numerics fall back to defaults with a note ==" +if section "Test 13: garbage AGENT_LOCK_* numerics fall back to defaults with a note"; then LOCK="$WORK/num.lock"; LOG="$WORK/num.log"; : > "$LOG" AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \ AGENT_LOCK_STALE_SECS=banana AGENT_LOCK_POLL_SECS=-1 AGENT_LOCK_MAX_WAIT=0 \ @@ -740,8 +768,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \ [ "$rc" = 0 ] && ok "run succeeds despite garbage numeric config" || bad "rc=$rc with garbage numerics" n="$(grep -c "ignoring invalid" "$WORK/t13.err")" [ "$n" = 4 ] && ok "all 4 garbage values noted on stderr, incl. CLAIM_STALE_SECS (got $n)" || bad "expected 4 'ignoring invalid' notes, got $n" +fi -echo "== Test 14: run outside any git repo hard-fails 96 unless AGENT_LOCK_PATH is set ==" +if section "Test 14: run outside any git repo hard-fails 96 unless AGENT_LOCK_PATH is set"; then NR="$WORK/norepo"; mkdir -p "$NR" ( cd "$NR" && env GIT_CEILING_DIRECTORIES="$WORK" bash "$LIB" run -- bash -c 'true' ) 2> "$WORK/t14.err"; rc=$? [ "$rc" = 96 ] && ok "run outside a repo refused with 96" || bad "run outside a repo rc=$rc (want 96)" @@ -749,8 +778,9 @@ grep -q "AGENT_LOCK_PATH" "$WORK/t14.err" && ok "refusal message mentions AGENT_ ( cd "$NR" && env GIT_CEILING_DIRECTORIES="$WORK" AGENT_LOCK_PATH="$NR/x.lock" AGENT_LOCK_LOG="$NR/x.log" \ bash "$LIB" run -- bash -c 'true' ) 2>/dev/null; rc=$? [ "$rc" = 0 ] && ok "explicit AGENT_LOCK_PATH works outside a repo" || bad "explicit AGENT_LOCK_PATH outside repo rc=$rc" +fi -echo "== Test 14b: SOURCING outside a repo warns on stderr and creates NO files ==" +if section "Test 14b: SOURCING outside a repo warns on stderr and creates NO files"; then # Sourcing keeps the CWD fallback (it must never explode), but the warning # goes to STDERR — warning via the lock log instead would, as a side # effect, CREATE ./git-commit-lock.log in whatever random directory the @@ -770,8 +800,9 @@ leftovers="$(ls -A "$NRS" 2>/dev/null)" # (There is deliberately no Test 15: the steal installs by rename-over and # never creates a move-aside (.dead.*) file, so there is no sweep to test. # An implementation must never create one; Test 2b's sampler enforces that.) +fi -echo "== Test 16: EMPTY lock file at release — unverifiable lane (2 / run:1), NOT a theft verdict ==" +if section "Test 16: EMPTY lock file at release — unverifiable lane (2 / run:1), NOT a theft verdict"; then # Truncation stands in for the probe-F window: a file that reads empty after # the retry ladder is a successor mid-create after a boundary steal, or # external truncation — it canNOT be our own failed write (acquire's @@ -799,8 +830,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \ bash "$LIB" run -- bash -c ': > "$AGENT_LOCK_PATH"; exit 7' 2>/dev/null; rc=$? [ "$rc" = 7 ] && ok "run keeps a failing command's own code (7) over the unverifiable 1" || bad "run empty-file+exit-7 rc=$rc (want 7)" rm -f "$LOCK" +fi -echo "== Test 16b: lock file GONE at release — definitive theft, exactly 98 ==" +if section "Test 16b: lock file GONE at release — definitive theft, exactly 98"; then # Acquire's read-back proved our # token was AT the path, so a missing file at release can only mean someone # renamed/removed it (a steal, or external interference) — report 98, loudly. @@ -819,8 +851,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" \ bash "$LIB" run -- bash -c 'rm -f "$AGENT_LOCK_PATH"' 2>/dev/null; rc=$? [ "$rc" = 98 ] && ok "run reports 98 (overrides a successful command) when the lock file is gone" \ || bad "run gone-at-release rc=$rc (want 98)" +fi -echo "== Test 16c: release rides out a TRANSIENT empty read (escalating retry ladder — ps1 parity) ==" +if section "Test 16c: release rides out a TRANSIENT empty read (escalating retry ladder — ps1 parity)"; then # A sub-second window in which the lock file reads EMPTY (stand-in for an AV # scanner's blocking handle, or a probe-F create->write gap that resolves) # must NOT produce the unverifiable verdict: the read-retry ladder (shared @@ -853,8 +886,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash -c ' grep -q "EMPTY/unreadable at release" "$WORK/t16c.err" \ && bad "spurious unverifiable warning despite the token reappearing" \ || ok "no unverifiable warning for the ridden-out transient" +fi -echo "== Test 17: NON-FILE at the lock path — never stolen, loud one-time config warning, waiters reach 97 ==" +if section "Test 17: NON-FILE at the lock path — never stolen, loud one-time config warning, waiters reach 97"; then # (a) a directory (a config typo like AGENT_LOCK_PATH=\$HOME, or a directory # lock left by an older release). The per-poll type guard fires regardless of # age — but only after the SAME concrete type is seen on two consecutive @@ -929,8 +963,9 @@ else rm -f "$LOCK" 2>/dev/null echo "note: mkfifo unavailable/unusable here — FIFO guard not exercised (CI POSIX legs cover it)" fi +fi -echo "== Test 17d: REGRESSION — create/delete churn at the lock path must NOT fire the non-lock warning ==" +if section "Test 17d: REGRESSION — create/delete churn at the lock path must NOT fire the non-lock warning"; then # The per-poll guard's existence (-e/-L) and classification (-f && ! -L) # checks are SEPARATE stats. A rival's release/steal unlink landing between # them — or a Windows delete-pending ghost (the unlink queues behind a rival @@ -1069,8 +1104,9 @@ if [ -n "$churn_pid" ]; then else echo "note: $churn_skip — churn-vs-guard regression not exercised here (CI legs cover it)" fi +fi -echo "== Test 18: stale NON-LOCK CONTENT at the lock path is never stolen; torn tokens split on the tok. prefix ==" +if section "Test 18: stale NON-LOCK CONTENT at the lock path is never stolen; torn tokens split on the tok. prefix"; then # The content guard (age-gated): steal only an empty file or a line 1 starting # "tok.". A real user file at a typo'd AGENT_LOCK_PATH must survive, forever. # (a) a user file @@ -1113,8 +1149,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=2 \ && ok "tok.-prefixed torn token IS stolen by staleness (crash-orphan lane)" \ || bad "tok.-prefixed torn token not stolen (rc=$rc marker=$(cat "$MARKER"))" grep -q STOLE "$LOG" && ok "steal of the torn token logged" || bad "no STOLE entry for torn token" +fi -echo "== Test 19: wire format — token on line 1 (tok.-prefixed), owner on line 2 ==" +if section "Test 19: wire format — token on line 1 (tok.-prefixed), owner on line 2"; then # Pins the on-disk format the ps1 port must match, and that token parsing # takes LINE 1 only (an owner line present must not pollute the token). LOCK="$WORK/wire.lock"; LOG="$WORK/wire.log"; : > "$LOG" @@ -1130,8 +1167,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" bash -c ' ' _ "$LIB" "$LOCK"; rc=$? [ "$rc" = 0 ] && ok "lock file carries token (line 1, tok.-prefixed) + owner (line 2); release parses line 1 with owner present" \ || bad "wire-format check failed at step code $rc" +fi -echo "== Test 20: claim contention — N concurrent stealers, ONE claim winner ($GCL_MODE: $T20_N workers) ==" +if section "Test 20: claim contention — N concurrent stealers, ONE claim winner ($GCL_MODE: $T20_N workers)"; then # N stealers race one ancient ghost: exactly one wins the O_EXCL claim and # steals (one STOLE-BY-CLAIM); the rest lose the claim create and acquire # normally in sequence after the winner releases. No displacement (zero @@ -1165,8 +1203,9 @@ nlost="$(grep -c "lock LOST" "$WORK/contend.all.log")" [ "$nlost" = 0 ] && ok "zero LOST warnings under claim contention" || bad "$nlost LOST warnings under claim contention" [ -e "$LOCK" ] && bad "leftover lock after contention" || ok "no leftover lock" [ -e "$LOCK.next" ] && bad "leftover claim after contention" || ok "no leftover claim" +fi -echo "== Test 21: crashed-claimant and empty-claim orphans age out; steals resume ==" +if section "Test 21: crashed-claimant and empty-claim orphans age out; steals resume"; then # (a) an aged foreign claim (crashed claimant): cleared by CLAIM-STALE-CLEARED, # then the steal completes; recovery latency bounded. LOCK="$WORK/cc.lock"; LOG="$WORK/cc.log"; : > "$LOG" @@ -1191,8 +1230,9 @@ AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$LOG" AGENT_LOCK_STALE_SECS=1 \ bash "$LIB" run -- bash -c 'true' 2>/dev/null; rc=$? [ "$rc" = 0 ] && ok "empty claim orphan aged out and recovery completed (rc 0)" || bad "rc=$rc behind an empty claim orphan" grep -q "CLAIM-STALE-CLEARED" "$LOG" && ok "empty claim cleared via the same staleness lane" || bad "empty claim was not cleared" +fi -echo "== Test 22: NON-CLAIM objects at the claim path — never deleted, per-path warn state ==" +if section "Test 22: NON-CLAIM objects at the claim path — never deleted, per-path warn state"; then # (a) a directory at ${LOCK}.next blocks steals (waiter reaches 97), is never # deleted, and warns once naming the claim path. LOCK="$WORK/cwt.lock"; LOG="$WORK/cwt.log"; : > "$LOG" @@ -1305,8 +1345,9 @@ AGENT_LOCK_PATH="$PPD2/c1.lock" AGENT_LOCK_LOG="$PPD2/ppg2.log" AGENT_LOCK_STALE grep -q "is not a claim file" "$PPD2/ba.err" && grep -q "is not a lock file" "$PPD2/ba.err" \ && ok "claim-path warning did not suppress the lock-path warning (reverse order)" \ || bad "lock-path warning suppressed after a claim-path warning (shared warn-once state?)" +fi -echo "== Test 23: live-slow holder — re-verify under the claim sees a fresh lock, CLAIM-ABORT (fresh), no steal ==" +if section "Test 23: live-slow holder — re-verify under the claim sees a fresh lock, CLAIM-ABORT (fresh), no steal"; then # Steered deterministically: the lock's mtime is renewed (as a live-slow # holder's re-create/renewal would) at the exact step-2 re-verify position, # via a sourced shell that wraps the library's verify internal. The claimant @@ -1337,8 +1378,9 @@ wait "$w23"; rc=$? [ "$rc" = 0 ] && ok "waiter then acquired and released normally (rc 0)" || bad "waiter rc=$rc after the slow holder released" grep -q "STOLE-BY-CLAIM" "$LOG" && bad "live lock was STOLEN despite the fresh re-verify" || ok "no steal of the live-slow holder's lock" [ -e "$LOCK.next" ] && bad "claim leftover after the fresh abort" || ok "claim deleted on the fresh abort" +fi -echo "== Test 24: OVERAGED own claim — CLAIM-ABORT (contested), no rename ==" +if section "Test 24: OVERAGED own claim — CLAIM-ABORT (contested), no rename"; then # A suspended claimant's recheck must refuse to proceed on its own overaged # claim (a clearer may be acting on it). Steered: every recheck sees the # claim backdated past CLAIM_STALE. Mutation check: an implementation that @@ -1364,8 +1406,9 @@ l1=""; IFS= read -r l1 < "$LOCK" || true [ "$l1" = "tok.ghost.t24" ] && ok "ghost lock untouched by the contested aborts" || bad "ghost lock was modified (line1=$l1)" [ -e "$LOCK.next" ] && bad "claim leftover after contested aborts" || ok "claim deleted on each contested abort" rm -f "$LOCK" +fi -echo "== Test 25: discovery-position matrix — own-claim-installed discovered on EVERY exit ==" +if section "Test 25: discovery-position matrix — own-claim-installed discovered on EVERY exit"; then # A rival's rename can install OUR claim as the lock while we sit at any # post-claim position. Each position steers that rename to the exact spot # (wrapping a library internal or shadowing mv/rm/touch in a sourced shell) @@ -1468,8 +1511,9 @@ for pos in step2-fresh recheck-gone touch-gone lock-gone contested deletion-gone bad "position $pos: rc=$rc discovery=$(grep -c DISCOVERY-HOLD "$LOG") expect-line=$(grep -cF "$expect" "$LOG") lock-left=$([ -e "$LOCK" ] && echo yes || echo no) claim-left=$([ -e "$LOCK.next" ] && echo yes || echo no)" fi done +fi -echo "== Test 26: delayed claim still installs a FRESH lease (the pre-rename touch) ==" +if section "Test 26: delayed claim still installs a FRESH lease (the pre-rename touch)"; then # A claim aged close to CLAIM_STALE (steered: backdated 40s of 60 at the # recheck) must still install a lock whose mtime is ~now — the step-3.2 # touch resets the clock; rename preserves it (probe R2). A no-touch @@ -1500,8 +1544,9 @@ case "$rc" in *) bad "delayed-claim lease harness rc=$rc" ;; esac grep -q "STOLE-BY-CLAIM" "$LOG" && ok "the delayed claim still completed its steal" || bad "no STOLE-BY-CLAIM in the lease test" +fi -echo "== Test 27: lock GONE at re-verify — CLAIM-ABORT (gone), NO rename onto the absent path ==" +if section "Test 27: lock GONE at re-verify — CLAIM-ABORT (gone), NO rename onto the absent path"; then # A live-slow holder releasing under a claimant must route to the normal # create race, never a rename onto the absent path. Mutation check: a # renaming implementation would install the CLAIM token; the correct one @@ -1532,8 +1577,9 @@ else bad "claim token vs acquired token: claim='$ctok' acquired='$atok' (equal or missing => renamed onto the absent path?)" fi grep -q "DISCOVERY-HOLD" "$LOG" && bad "spurious discovery-HOLD in the gone lane" || ok "no spurious discovery-HOLD" +fi -echo "== Test 28: SUB-FLOOR claim mtime is never cleared — treated as just-created ==" +if section "Test 28: SUB-FLOOR claim mtime is never cleared — treated as just-created"; then LOCK="$WORK/cfloor.lock" LOG="$WORK/cfloor.log" : >"$LOG" @@ -1549,8 +1595,9 @@ grep -q "CLAIM-STALE-CLEARED" "$LOG" && bad "sub-floor claim was CLEARED — mti || ok "sub-floor claim never cleared (floor applies to the claim)" [ -f "$LOCK.next" ] && ok "sub-floor claim file untouched" || bad "sub-floor claim file was removed" rm -f "$LOCK" "$LOCK.next" +fi -echo "== Test 29: BLOCKED steal rename — claim deleted IMMEDIATELY, no CLAIM_STALE penalty ==" +if section "Test 29: BLOCKED steal rename — claim deleted IMMEDIATELY, no CLAIM_STALE penalty"; then # The rename is forced to fail-with-the-lock-still-present (a shadowed mv — # the no-delete-share squat, deterministically). The claimant must delete its # own claim at once and re-poll: with CLAIM_STALE=600, a leftover claim would @@ -1579,8 +1626,9 @@ grep -q "steal FAILED" "$LOG" && ok "blocked rename logged (damped steal FAILED) [ -e "$LOCK.next" ] && bad "claim leftover after the blocked steal attempts" || ok "no claim leftover at exit" [ -f "$LOCK" ] && ok "squatted lock left in place" || bad "lock vanished in the blocked lane" rm -f "$LOCK" +fi -echo "== Test 30: static checks — the claim touch is NON-creating with an explicit existence check ==" +if section "Test 30: static checks — the claim touch is NON-creating with an explicit existence check"; then grep -q 'touch -c -- "\$_LOCK_CLAIM_PATH"' "$LIB" \ && ok "claim touch uses 'touch -c --' (non-creating)" \ || bad "no 'touch -c -- \$_LOCK_CLAIM_PATH' in the implementation" @@ -1590,8 +1638,9 @@ grep -A3 'touch -c -- "\$_LOCK_CLAIM_PATH"' "$LIB" | grep -q -- '-e "\$_LOCK_CLA bad_touch="$(grep 'touch ' "$LIB" | grep '_LOCK_CLAIM_PATH' | grep -v -- '-c')" [ -z "$bad_touch" ] && ok "no creating touch of the claim path anywhere" \ || bad "creating touch of the claim path found: $bad_touch" +fi -echo "== Test 31: LEAKED-claim discovery — the leaked-token memory closes the unverified-claim lanes ==" +if section "Test 31: LEAKED-claim discovery — the leaked-token memory closes the unverified-claim lanes"; then # (a) main leg: a recheck-unreadable exit leaks the claim token; a rival # (the external mv below) then installs that claim as the lock; the leaver # adopts it (HOLD) and release returns 0. Adoption may go through EITHER of @@ -1801,8 +1850,9 @@ case "$(uname -s 2>/dev/null)" in echo "note: the blocked-unlink feeder leg is Windows-only by construction (POSIX open handles never block unlink); the read-shadow legs above cover the memory machinery" ;; esac +fi -echo "== Test 32: per-attempt tokens — an abandoned own-token lock never aliases discovery or release ==" +if section "Test 32: per-attempt tokens — an abandoned own-token lock never aliases discovery or release"; then # Walk: the first CREATE's read-back is forced blank (and the abandoned lock # backdated stale). A later CLAIM attempt is steered into a recheck-gone # discovery against that abandoned lock: a reused-per-acquire-token @@ -1845,8 +1895,9 @@ grep -q "DISCOVERY-HOLD" "$LOG" && bad "FALSE discovery-HOLD on the abandoned ow || ok "no false discovery-HOLD — the abandoned token did not alias the claim attempt" grep -q "STOLE-BY-CLAIM" "$LOG" && ok "the abandoned lock was then reclaimed by a normal steal" \ || bad "no STOLE-BY-CLAIM of the abandoned lock" +fi -echo "== Test 32b: steal-path read-back FAILED — rename-over WON but the lock did not read back our token (F2) ==" +if section "Test 32b: steal-path read-back FAILED — rename-over WON but the lock did not read back our token (F2)"; then # The steal-path twin of Test 32. Here the stealer WINS the claim race AND wins # the rename-over (STOLE-BY-CLAIM is logged, the ghost is destroyed), but the # mandatory post-rename read-back verification (git-commit-lock.sh:1171) comes @@ -1898,8 +1949,9 @@ else fi [ -e "$LOCK" ] && bad "lock leftover after the steal-readback walk" || ok "lock released cleanly" [ -e "$LOCK.next" ] && bad "claim leftover after the steal-readback walk" || ok "no claim leftover" +fi -echo "== Test 33: TERM mid-claim — the trap deletes the claim (token-checked), no 98, no ageout penalty ==" +if section "Test 33: TERM mid-claim — the trap deletes the claim (token-checked), no 98, no ageout penalty"; then # (a) main: claimant paused inside its claim window (at the touch), TERM'd. # The trap must delete OUR claim, run the discovery read (miss: the ghost is # foreign), restore traps, re-raise (143) — and must NOT touch the lock. @@ -2030,8 +2082,9 @@ case "$(uname -s 2>/dev/null)" in echo "note: TERM-blocked-unlink leg is Windows-only by construction (POSIX open handles never block unlink)" ;; esac +fi -echo "== Test 34: TERM on a STEAL-acquired hold releases exactly like a create-acquired one ==" +if section "Test 34: TERM on a STEAL-acquired hold releases exactly like a create-acquired one"; then # All acquisition paths go through the shared claim-the-hold helper, so a # steal-acquired holder must run the same HELD/trap machinery: release on # TERM, re-raise, 143 (T11's contract, on a steal-acquired hold). @@ -2054,8 +2107,9 @@ wait "$w34"; rc=$? [ "$rc" = 143 ] && ok "TERM'd steal-acquired holder exited 143 (signal re-raised)" || bad "steal-acquired TERM rc=$rc (want 143)" [ -e "$LOCK" ] && bad "lock left held after TERM on a steal-acquired hold" || ok "steal-acquired lock released on TERM" grep -q "RELEASED" "$LOG" && ok "release logged on the steal-acquired TERM path" || bad "no RELEASED entry for the steal-acquired hold" +fi -echo "== Test 35: release-time leaked-claim cleanup — displaced hold cleans its own installed leak, 98 ==" +if section "Test 35: release-time leaked-claim cleanup — displaced hold cleans its own installed leak, 98"; then # (a) B leaks token L (recheck-unreadable; the ghost vanishes at the same # moment), acquires fresh N normally; a rival installs L over the lock, # displacing B's held N. B's release must return 98 AND unlink L (the lock @@ -2148,8 +2202,9 @@ esac grep -q "RELEASE-CLEANED-LEAKED-CLAIM" "$LOG" && bad "boundary variant wrongly logged a leaked-claim cleanup" \ || ok "no cleanup line when the re-read backed off" rm -f "$LOCK" "$LOCK.next" "$WORK/t35b.succ" +fi -echo "== Test 36: arc-end resolution pass — an INCONCLUSIVE lock read keeps the entry pending; conclusive ones drop it ==" +if section "Test 36: arc-end resolution pass — an INCONCLUSIVE lock read keeps the entry pending; conclusive ones drop it"; then # The pass's entry-drop is gated on one lock-path read. That read resolves # the entry ONLY when it is conclusive: a DIFFERENT readable token, or the # path definitively absent. A lock PRESENT but unreadable/empty proves @@ -2207,8 +2262,9 @@ grep -q "DISCOVERY-HOLD (leaked-token memory)" "$LOG" && ok "the surviving entry grep -q "resolved tok=tok.leak.t36.2" "$LOG" && ok "conclusive resolution logged for the dropped entry" \ || bad "no resolution log line for the conclusive drop" rm -f "$LOCK" "$LOCK.next" +fi -echo "== Test 37: rename-refused — a directory appearing at the lock path mid-steal aborts the steal, no false hold ==" +if section "Test 37: rename-refused — a directory appearing at the lock path mid-steal aborts the steal, no false hold"; then # The only acquire/steal VERDICT branch with no test: a NON-regular object (a # directory) appears AT the lock path between the claimant's final re-verify # (step 3.3, sees a stale FILE) and its rename-over, so the rename is refused @@ -2270,8 +2326,9 @@ grep -q "acquire verification FAILED" "$LOG" \ && ok "directory left in place at the lock path (never overwritten)" \ || bad "lock path is no longer the squatting directory" rm -rf "$LOCK" "$LOCK.next" +fi -echo "== Test 38: step-3.3 pre-rename re-verify abort — claim cleaned, discovery, no false hold ==" +if section "Test 38: step-3.3 pre-rename re-verify abort — claim cleaned, discovery, no false hold"; then # The step-2 re-verify (sh:1075) and the step-3.3 re-verify immediately before # the rename (sh:1149) are near-identical abort lanes; Test 23/27 exercise the # step-2 lane only, leaving 3.3 untested. Steered with a CALL-COUNTER on @@ -2330,9 +2387,10 @@ wait "$w38"; rc=$? || bad "waiter rc=$rc after the slow holder released (want 0)" [ -e "$LOCK.next" ] && bad "claim leftover after the waiter finished" || ok "no claim leftover at exit" rm -f "$LOCK" "$LOCK.next" +fi -echo "== Test 39: foreign claim at recheck — left intact, discovery, no false 98 ==" +if section "Test 39: foreign claim at recheck — left intact, discovery, no false 98"; then # After winning its claim and passing step-2 re-verify, the claimant rechecks # its OWN claim file before installing. The `gone` recheck leg is covered (Test # 25 recheck-gone / Test 32); the `foreign` leg is NOT: a waiter judged our @@ -2404,8 +2462,9 @@ gl1=""; IFS= read -r gl1 < "$LOCK" 2>/dev/null || true [ "$gl1" = "tok.ghost.t39" ] && ok "ghost lock untouched by the foreign-recheck backoff" \ || bad "ghost lock modified (line1=$gl1)" rm -f "$LOCK" "$LOCK.next" "$SF" +fi -echo "== Test 40: exec-bypass boundary — exec in the lock-holding shell skips release (OOS-5); exec in a child does not ==" +if section "Test 40: exec-bypass boundary — exec in the lock-holding shell skips release (OOS-5); exec in a child does not"; then # `lock_run` runs the wrapped command vector with `"$@"` IN THE WRAPPER SHELL # (git-commit-lock.sh), so a command that is itself an `exec` REPLACES the # lock-holding wrapper process: the trailing `lock_release` AND the EXIT trap @@ -2496,8 +2555,9 @@ grep -q "WARNING" "$LOG" \ && bad "an unexpected WARNING was logged by the displaced exec-0 holder" \ || ok "displaced holder's exec-0 emitted NO WARNING at all (unwarned silent loss)" rm -f "$LOCK" +fi -echo "== Test 41: forward clock jump steals a live lock — detected as 98, never silent (E2) ==" +if section "Test 41: forward clock jump steals a live lock — detected as 98, never silent (E2)"; then # Staleness is age = now - mtime (git-commit-lock.sh ~:928, ~:1409), where `now` # is _lock_now. A process whose clock has LEAPED FORWARD computes an inflated age # for everyone's lock, so it can judge a LIVE, fresh lock ancient and steal it. @@ -2561,8 +2621,9 @@ grep -q "WARNING: lock LOST" "$LOG" \ && ok "robbed holder logged a loud theft WARNING (no silent double-commit)" \ || bad "no theft WARNING logged for the forward-jump steal" rm -f "$LOCK" "$LOCK.next" +fi -echo "== Test 42: mtime unreadable — staleness disabled, fail-safe (no steal), warn-once, 97 (E3) ==" +if section "Test 42: mtime unreadable — staleness disabled, fail-safe (no steal), warn-once, 97 (E3)"; then # §E3: if the lock file's mtime cannot be read AT ALL (every probe fails on a # PRESENT file), staleness detection is BROKEN. The mtime floor fails closed to # "fresh": _lock_verify_stale returns state=fresh, so a crashed/stale holder is @@ -2631,8 +2692,9 @@ t42_warns="$(grep -c "Staleness detection is BROKEN" "$T42_ERR" 2>/dev/null || e && ok "mtime-unreadable: broken-staleness warning fired at most once on stderr ($t42_warns)" \ || bad "mtime-unreadable: warning repeated ($t42_warns times — warn-once broken)" rm -f "$T42_LOCK" "$T42_LOCK.next" +fi -echo "== Test 43: malformed/unreadable lock content at the poll guard — never stolen, warned/skipped ==" +if section "Test 43: malformed/unreadable lock content at the poll guard — never stolen, warned/skipped"; then # Two sibling branches of the in-acquire steal CONTENT GUARD (git-commit-lock.sh # ~:1419-1444), both gated on an already-stale candidate, neither of which the # torn/empty/tok.-prefixed cases (Tests 17/18) reach: @@ -2700,8 +2762,9 @@ grep -q "STOLE" "$LOG" && bad "#17 ghost was STOLEN despite the unreadable conte || ok "#17 no steal while the steal-guard read fails" [ -f "$LOCK" ] && ok "#17 stale ghost left in place" || bad "#17 stale ghost was removed" rm -f "$LOCK" +fi -echo "== Test 44: socket & device-node at the lock path — never stolen/deleted, refused (97) ==" +if section "Test 44: socket & device-node at the lock path — never stolen/deleted, refused (97)"; then # The never-steal wrong-type guard (git-commit-lock.sh ~:1557-1567) classifies # NON-regular objects at the lock path so they are NEVER stolen and NEVER # deleted: a real config error (a typo'd AGENT_LOCK_PATH, a stray special file) @@ -2793,9 +2856,10 @@ if [ -c /dev/null ]; then else echo "note: /dev/null is not a char device here — device-node guard not exercised (CI POSIX legs cover it)" fi +fi -echo "== Test 45: log self-truncates past ~1 MB (rotation, not unbounded growth) ==" +if section "Test 45: log self-truncates past ~1 MB (rotation, not unbounded growth)"; then # _lock_log starts the log over (not rotate) once it grows past ~1MB: the size # check at the top of _lock_log truncates the file to empty before the write, # so a normal log-producing op on an oversized log leaves a small, well-formed @@ -2827,8 +2891,9 @@ grep -q 'xxxx' "$LOG" && bad "old oversized 'x' content survived into the restar || ok "old oversized content is gone (clean restart, not appended)" [ -e "$LOCK" ] && bad "lock left held after run" || ok "lock released after the over-threshold run" rm -f "$LOCK" "$LOG" +fi -echo "== Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release ==" +if section "Test 46: EXIT while waiting (no hold) — no-hold trap arc, no spurious release"; then # A10 (steering-coverage.md): _lock_on_exit's no-hold arc-end (:1009,1017-1018). # A sourced waiter, blocked in the wait loop against a LIVE held lock, exits 0 # while still parked — the EXIT trap is STILL '_lock_on_exit' (the timeout's @@ -2921,8 +2986,9 @@ touch "$HG"; wait "$h46" 2>/dev/null grep -q "lock LOST" "$HLOG" && bad "holder saw a stolen lease (98) — the waiter's exit disturbed the hold" \ || ok "holder released its still-held lock cleanly (no 98)" rm -f "$LOCK" "$LOCK.next" "$T46R" "$T46G" "$T46T" "$HR" "$HG" +fi -echo "== Test 47: no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 — steal still installs ==" +if section "Test 47: no-mv-T rename-over fallback (BSD/macOS lane) forced via _LOCK_MVT=0 — steal still installs"; then # _lock_rename_over (git-commit-lock.sh ~:961-979) probes once for GNU `mv -T` # and caches the verdict in _LOCK_MVT (""=unprobed, 1=supported, 0=not). On # Linux/MINGW the probe ALWAYS picks `mv -T`, so the no-`-T` fallback lane @@ -3048,9 +3114,10 @@ grep -q "STOLE-BY-CLAIM" "$LOGB" \ && bad "T47(b): claim leftover (\$LOCK.next) after the fallback rename-refused abort" \ || ok "T47(b): claim file cleaned up — no leftover \$LOCK.next" rm -rf "$LOCK" "$LOCK.next" "$LOCKC" "$LOCKC.next" "$LOCKB" "$LOCKB.next" +fi -echo "== Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4) ==" +if section "Test 48: unwritable lock dir -> clean 97, command never runs, no false hold (F4)"; then # F4 (failure-modes.md §4.5): a read-only / unwritable lock-dir parent makes the # O_EXCL create fail every poll, so the waiter times out at 97 — no corruption, no # false hold, and the wrapped command never runs. POSIX-only: chmod 0555 is a no-op @@ -3078,8 +3145,9 @@ case "$(uname -s)" in chmod 0755 "$T48DIR" 2>/dev/null; rm -rf "$T48DIR" # restore so cleanup() can rm -rf $WORK ;; esac +fi -echo "== Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1) ==" +if section "Test 49: failing log path -> lock still works, the log write is swallowed (F2/J1)"; then # F2/J1 (failure-modes.md §4.5): logging is best-effort (every write ends || true). # Point AGENT_LOCK_LOG under a REGULAR FILE so every append/open fails ENOTDIR — the # lock must still acquire+release cleanly (rc 0) with the log write swallowed. @@ -3100,8 +3168,9 @@ AGENT_LOCK_PATH="$WORK/t49.lock" AGENT_LOCK_LOG="$T49LOG" \ [ ! -e "$T49LOG" ] && ok "F2/J1: the log write was swallowed (no log file under the non-dir)" \ || bad "F2/J1: a log file was created under a non-dir" rm -f "$T49P" "$WORK/t49.lock" +fi -echo "== Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1) ==" +if section "Test 50: ENOSPC on lock create/write -> wait then 97, no false hold (F1)"; then # F1 (failure-modes.md §4.5): a full filesystem makes the create's write fail # (ENOSPC); the created-but-write-failed file is an empty orphan and the waiter # times out at 97 — no corruption, no false hold. Real ENOSPC needs a full FS, which @@ -3128,6 +3197,7 @@ if [ "$(uname -s)" = Linux ] && sudo -n true 2>/dev/null; then else echo "note: Test 50 skipped — ENOSPC injection needs Linux + passwordless sudo (a small tmpfs); the Linux CI leg covers it" fi +fi # NOTES (deliberately untested here): # * lock_release's LEFTOVER lane (the unlink blocked persistently) needs a @@ -3141,6 +3211,15 @@ fi # Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by # Test 32b. +# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran NO test block. Without +# this, the suite would fall through to a vacuous PASS=0 FAIL=0 "green" — a typo'd +# selector regex would silently look like success. Fail loudly instead. (The finish +# EXIT trap also fires here since DONE is still 0; this exit is non-zero regardless.) +if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then + echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2 + exit 1 +fi + DONE=1 echo echo "==== RESULT: $PASS passed, $FAIL failed, $ENV_WARN envelope warning(s) (fan-out: $GCL_MODE) ====" From b8e29513406265f0905a0e6770586313079735aa Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 03:42:29 +1000 Subject: [PATCH 38/76] Bucket 8 item 3: extract tests/_harness.sh (shared TAP/selector/helpers) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Pure deduplication, zero behavior change — the last harness-restructure step. A new tests/_harness.sh (177 lines), sourced by all three suites, holds the genuinely-shared code: - Tier 1 (all three): the PASS/FAIL/TAPN/DONE/SECTIONS_RUN inits; the GCL_TAP / GCL_TEST_ONLY reads; ok()/bad(); section(); the finish/sentinel EXIT-trap helper (which calls the suite-local cleanup); the shared shellcheck disables; and a unified selector_report() (zero-match guard + the "ran N block(s)" line) so unit and interop behave identically. - Tier 2 (unit + interop, each verified byte-identical before extracting): epoch_to_stamp, backdate, backdate_ghost, sync_waiting_fresh, fabricate_lock, wait_for_grep. Left per-suite (deliberately): cleanup (closes over each $WORK; interop differs); clone_fn + its export -f (unit-only); ok_envelope/bad_envelope/ENV_WARN (unit-only); the two poll helpers wait_for_file (unit, secs) and wait_for (interop, 50ms iters) — different names/semantics, NOT unified; and each suite's verdict line + GCL_TEST_FULL mode handling. Sourcing is CWD-independent (resolved from BASH_SOURCE). A `# shellcheck source=tests/_harness.sh` directive at each source site resolves SC1091, and tests/_harness.sh is added to the CI shellcheck file list so the shared code is linted. Validated (reduced, exit 0): unit 315/0, interop 141/0, integration 12/0; sorted PASS/FAIL identical before/after (volatile token/path/bounded-count fields aside); selector + zero-match guard + integration note-and-ignore all intact; shellcheck -S style clean across all files incl. _harness.sh. Net -42 lines. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/tests.yml | 1 + tests/_harness.sh | 177 ++++++++++++++++++++++ tests/git-commit-lock.integration.test.sh | 37 ++--- tests/git-commit-lock.interop.test.sh | 160 ++++--------------- tests/git-commit-lock.test.sh | 152 ++++--------------- 5 files changed, 243 insertions(+), 284 deletions(-) create mode 100644 tests/_harness.sh diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index 52961e6..268c257 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -145,6 +145,7 @@ jobs: /tmp/shellcheck-v0.11.0/shellcheck --version /tmp/shellcheck-v0.11.0/shellcheck -S style \ git-commit-lock.sh \ + tests/_harness.sh \ tests/git-commit-lock.test.sh \ tests/git-commit-lock.interop.test.sh \ tests/git-commit-lock.integration.test.sh \ diff --git a/tests/_harness.sh b/tests/_harness.sh new file mode 100644 index 0000000..d5d8215 --- /dev/null +++ b/tests/_harness.sh @@ -0,0 +1,177 @@ +# shellcheck shell=bash +# tests/_harness.sh — shared test harness for the git-commit-lock suites. +# +# Sourced by all three suites (git-commit-lock.test.sh, .interop.test.sh, +# .integration.test.sh) to share the bits they all copy-pasted: the PASS/FAIL/ +# TAP counters, the GCL_TAP / GCL_TEST_ONLY reads, ok()/bad(), section(), the +# end-of-suite DONE sentinel (finish), and the per-test selector verdict helper. +# Pure deduplication — ZERO behaviour change vs the inline copies it replaces. +# +# Contract for sourcing suites: +# * Source this EARLY (before any use of the inits/helpers below), CWD- +# independently — resolve it from the sourcing script's own location: +# _HARNESS_DIR="$(CDPATH= cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)" +# # shellcheck source=tests/_harness.sh +# . "$_HARNESS_DIR/_harness.sh" +# * Each suite still defines its OWN cleanup() (it closes over the suite's +# $WORK and the bodies genuinely differ); finish() below calls whatever +# cleanup() is in scope when the EXIT trap fires. +# * Each suite installs the trap itself: `trap finish EXIT`. +# * The suite reaching its end sets DONE=1 before its verdict line. +# +# The whole project runs its suites under `set -uo pipefail` (NOT set -e); these +# helpers are written for that (they assert on values, never on implicit exit +# propagation), and the disables below cover the idioms that pervade the suites. +# +# shellcheck disable=SC2015 # The pervasive ` && ok ... || bad ...` +# idiom is deliberate throughout: ok/bad are echo+counter helpers that cannot +# fail, so the classic A && B || C pitfall (C running after B fails) is moot. +# shellcheck disable=SC2310,SC2312 # info-level, deliberate: helper functions +# and command substitutions run inside conditions all over a test suite; the +# suites run WITHOUT errexit (set -uo only) and assert on values, not on +# implicit exit propagation. + +PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0 +GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output +GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" # if set, run ONLY test blocks whose label REGEX-matches (single-test selector) + +# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and +# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted +# by each suite just before its verdict) lets a TAP consumer fail on a short count; +# together with the DONE sentinel below this closes the silent-undercount gap. +# `return 0` preserves the "ok/bad cannot fail" property the +# ` && ok ... || bad ...` idiom relies on. +ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" + [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } +bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" + [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } + +# Per-test gate: echoes the block header (so a normal run is byte-unchanged) and +# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label. +# Each top-level `== Test N: ==` block is wrapped `if section "..."; then ... fi`. +# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard (selector_report) +# can catch a selector regex that matched nothing. +section() { + echo "== $1 ==" + if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then + SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0 + fi + return 1 +} + +# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with +# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is +# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` +# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. +# Calls the suite-local cleanup() (each suite defines its own, closing over its +# own $WORK); whatever cleanup is in scope when the trap fires is used. +finish() { + cleanup + if [ "${DONE:-0}" != 1 ]; then + echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 + exit 1 + fi +} + +# Selector verdict helper, called by the section-using suites just before their +# verdict line. Two parts, both gated on GCL_TEST_ONLY being non-empty so a +# default run stays byte-identical: +# 1. Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran NO test block, +# so the (vacuously green) verdict would lie — a typo'd selector regex must +# FAIL, not pass with zero assertions. Bail loudly. (The finish EXIT trap +# also fires here since DONE is still 0; this exit is non-zero regardless.) +# 2. Report how many blocks the selector matched. +# Integration does NOT call this — it is one indivisible scenario that does not +# use section(), so it note-and-ignores GCL_TEST_ONLY at its top instead. +selector_report() { + if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then + echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2 + exit 1 + fi + [ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)" + return 0 +} + +# --- Shared timing/lock helpers (unit + interop; integration uses none) ------- +# Backdate a path's mtime by $2 seconds — how a test fakes a stale lock (the +# lock's staleness clock is the lock FILE's own mtime, stamped by the creating +# write). Portable: BSD/macOS touch has no `-d @epoch`, so convert the target +# epoch to a `touch -t` stamp via GNU `date -d @` with BSD `date -r` as +# fallback. +epoch_to_stamp() { + date -d "@$1" +%Y%m%d%H%M.%S 2>/dev/null || date -r "$1" +%Y%m%d%H%M.%S 2>/dev/null +} +backdate() { touch -t "$(epoch_to_stamp "$(( $(date +%s) - $2 ))")" "$1"; } + +# Token-guarded backdate for the contended-recovery rounds (unit T2b / +# interop T16/T16b). Why: under load a fast waiter can complete its ENTIRE steal +# (claim -> rename-over -> ACQUIRED) before the harness's `touch` executes, so a +# blind backdate lands on the WINNER'S freshly installed lock, making it +# instantly stale for every rival — a legitimate re-steal then fails the round's +# "zero 98s / exactly one STOLE-BY-CLAIM" assertions although the protocol +# behaved exactly as designed (observed 2026-06-12 on a loaded box). Verdicts: +# * pre-read not the ghost: a waiter stole the ghost BEFORE the touch (it +# aged stale naturally during a stalled sync); no touch is performed and +# the round premise is gone — invalid, the caller retries the round. +# * post-read the ghost: conclusive — nothing ever rewrites the ghost +# token at the path, so the touch verifiably hit the ghost; any steal +# after the post-read steals an ALREADY-ancient ghost, exactly the +# scenario the round wants. Valid. +# * post-read anything else: a steal raced the touch->re-read window — +# COMMON under load (waiters poll every 0.05s; the post-read costs +# subprocess spawns), so it must not blindly invalidate. The lock's +# MTIME arbitrates which file the touch hit: a winner's installed lock +# is FRESH (the rename carries the claim file's just-created mtime), so +# fresh => the touch hit the GHOST and a legitimate steal followed — +# valid; ancient => the touch landed on the WINNER'S live lock and +# corrupted the round — invalid, retry. Vanished => cannot arbitrate — +# invalid, retry. +backdate_ghost() { # $1=lock $2=ghost token $3=age-secs -> 0 iff the round premise is intact + local pre post now mt + pre="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')" + [ "$pre" = "$2" ] || return 1 + backdate "$1" "$3" 2>/dev/null || return 1 + post="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')" + [ "$post" = "$2" ] && return 0 + [ -e "$1" ] || return 1 + now="$(date +%s)" + mt="$(stat -c %Y -- "$1" 2>/dev/null || stat -f %m -- "$1" 2>/dev/null)" || return 1 + [ $(( now - mt )) -lt $(( $3 / 2 )) ] +} + +# Wait for every waiter's WAITING line while keeping the ghost lock FRESH +# (touch -c to now, no-create so a released path is never resurrected): a +# fresh ghost cannot be judged stale, so no waiter can steal it before the +# guarded backdate — without this, a sync stalled past STALE (slow worker +# cold starts on a loaded box) lets the ghost age stale naturally and a +# waiter steals it mid-sync. Freshening is race-safe: if a steal slipped in +# anyway, touching the winner's (already fresh) live lock to "now" is a +# harmless no-op, and backdate_ghost's pre-read catches the broken premise. +sync_waiting_fresh() { # $1=lock $2=timeout-secs $3..=waiter logs -> 0 iff all logged WAITING + local lock="$1" deadline f ok=1 + deadline=$(( $(date +%s) + $2 )); shift 2 + for f in "$@"; do + until grep -q "WAITING for lock" "$f" 2>/dev/null; do + touch -c "$lock" 2>/dev/null + if [ "$(date +%s)" -ge "$deadline" ]; then ok=0; break; fi + sleep 0.2 + done + done + [ "$ok" = 1 ] +} + +# Fabricate a lock file the way a real (foreign) holder would have written it: +# token line + owner line. The token MUST be "tok."-prefixed (wire format) or +# the steal's content guard will — correctly — refuse to steal it. +fabricate_lock() { # $1=path $2=token $3=owner + printf '%s\n%s\n' "$2" "$3" > "$1" +} + +# Wait (up to $3 seconds, default 15) for a pattern to appear in a file. Used to +# gate on the WAITING log line: proof the waiter actually contended, without a +# fixed-length hold. +wait_for_grep() { + local pat="$1" f="$2" tries=$(( ${3:-15} * 20 )) + while ! grep -q "$pat" "$f" 2>/dev/null && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done + grep -q "$pat" "$f" 2>/dev/null +} diff --git a/tests/git-commit-lock.integration.test.sh b/tests/git-commit-lock.integration.test.sh index e7837f4..49badf8 100644 --- a/tests/git-commit-lock.integration.test.sh +++ b/tests/git-commit-lock.integration.test.sh @@ -36,6 +36,13 @@ # they expand inside a worker's `bash -c` invocation, not here. set -uo pipefail +# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad, +# section, the finish EXIT-trap sentinel (calls our cleanup below). Resolved from +# THIS script's own dir so it sources regardless of CWD. +_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=tests/_harness.sh +. "$_HARNESS_DIR/_harness.sh" + DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" ROOT="$(cd "$DIR/.." && pwd)" # the implementations live at the repo root LIB="$ROOT/git-commit-lock.sh" @@ -59,31 +66,10 @@ cleanup() { rm -rf "$WORK" 2>/dev/null || true fi } -# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with -# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is -# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` -# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. -finish() { - cleanup - if [ "${DONE:-0}" != 1 ]; then - echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 - exit 1 - fi -} +# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup() +# above and fails loudly if the suite died before setting DONE=1. trap finish EXIT -PASS=0; FAIL=0; TAPN=0; DONE=0 -GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output -# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and -# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted -# just before the verdict) lets a TAP consumer fail on a short count; together with the -# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the -# "ok/bad cannot fail" property the ` && ok ... || bad ...` idiom relies on. -ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" - [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } -bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" - [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } - # --- sizing ------------------------------------------------------------------ # Commits serialise (that's the whole point), so wall time ≈ workers x commit # cost, and on this Windows/Cygwin box a spawn+add+commit is ~0.5-1s, a pwsh @@ -117,9 +103,8 @@ LK_ENV=(AGENT_LOCK_STALE_SECS=300 AGENT_LOCK_POLL_SECS=0.2 AGENT_LOCK_MAX_WAIT=2 # Note-and-ignore the per-test selector the unit/interop suites honour: this # suite is ONE indivisible scenario (Tests 1-3 share a single repo + the ALL_IDS # accumulator, and Test 3 audits Tests 1+2's output), so a per-block selector -# can't apply. If GCL_TEST_ONLY is set, say so loudly on stderr and run the -# whole scenario as normal. -GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" +# can't apply. If GCL_TEST_ONLY is set (read by _harness.sh), say so loudly on +# stderr and run the whole scenario as normal. if [ -n "$GCL_TEST_ONLY" ]; then echo "NOTE: integration suite ignores GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" — Tests 1-3 are one indivisible scenario (shared repo + ALL_IDS audit); running the whole suite." >&2 fi diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh index 8bda7c7..4bad30f 100644 --- a/tests/git-commit-lock.interop.test.sh +++ b/tests/git-commit-lock.interop.test.sh @@ -40,6 +40,16 @@ # they expand inside a worker's `bash -c` or pwsh invocation, not here. set -uo pipefail +# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad, +# section, the finish EXIT-trap sentinel (calls our cleanup below), and the +# shared timing/lock helpers (epoch_to_stamp, backdate, backdate_ghost, +# sync_waiting_fresh, fabricate_lock, wait_for_grep). Resolved from THIS +# script's own dir so it sources regardless of CWD; sourced EARLY (before any +# use of the inits/helpers below). +_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=tests/_harness.sh +. "$_HARNESS_DIR/_harness.sh" + DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" ROOT="$(cd "$DIR/.." && pwd)" # the implementations live at the repo root SH="$ROOT/git-commit-lock.sh" @@ -67,35 +77,11 @@ WORK="$(pwsh -NoProfile -Command '[IO.Path]::Combine([IO.Path]::GetTempPath(), " WORK="${WORK//\\//}" mkdir -p "$WORK" -PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0 -GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output -# Single-test selector: GCL_TEST_ONLY= runs only the test blocks whose -# `== Test N: ==` label matches the regex (BASH regex, =~). Unset/empty -# runs every block (default). A typo'd regex that matches nothing bails out -# loudly at the verdict (the zero-match guard) rather than passing vacuously. -GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" -# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and -# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted -# just before the verdict) lets a TAP consumer fail on a short count; together with the -# DONE sentinel below this closes the silent-undercount gap. `return 0` preserves the -# "ok/bad cannot fail" property the ` && ok ... || bad ...` idiom relies on. -ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" - [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } -bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" - [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } - -# Per-test gate: echoes the block header (so a normal run is byte-unchanged) and -# returns success iff GCL_TEST_ONLY is unset/empty OR its regex matches the label. -# Each top-level `== Test N: ==` block is wrapped `if section "..."; then ... fi`. -# Bumps SECTIONS_RUN on a match so the verdict's zero-match guard can catch a -# selector regex that matched nothing. -section() { - echo "== $1 ==" - if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then - SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0 - fi - return 1 -} +# The PASS/FAIL/TAP/SECTIONS_RUN inits, the GCL_TAP/GCL_TEST_ONLY reads, ok/bad, +# and section() all come from _harness.sh (sourced above). GCL_TEST_ONLY is the +# single-test selector: a that runs only the `== Test N: ==` +# blocks whose label matches (BASH =~); unset/empty runs every block; a typo'd +# regex that matches nothing bails out loudly at the verdict (selector_report). # Failure post-mortems need the logs: keep $WORK when anything failed, and # honour GCL_TEST_PRESERVE_DIR (the CI preserve-logs knob) by copying @@ -112,17 +98,8 @@ cleanup() { fi rm -rf "$WORK" 2>/dev/null || true } -# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with -# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is -# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` -# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. -finish() { - cleanup - if [ "${DONE:-0}" != 1 ]; then - echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 - exit 1 - fi -} +# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup() +# above and fails loudly if the suite died before setting DONE=1. trap finish EXIT # Poll for a marker file: ready-markers replace fixed head-start sleeps so a @@ -132,88 +109,11 @@ wait_for() { # $1=file $2=max iterations of 50ms (default 200 = 10s) return 1 } -# Wait (up to $3 seconds, default 15) for a pattern to appear in a file — -# used to gate on the WAITING log line (proof a waiter actually contended) -# without a fixed-length hold. Same helper as the unit suite. -wait_for_grep() { - local pat="$1" f="$2" tries=$(( ${3:-15} * 20 )) - while ! grep -q "$pat" "$f" 2>/dev/null && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done - grep -q "$pat" "$f" 2>/dev/null -} - -# Backdate a path's mtime by $2 seconds — how a test fakes a stale lock (the -# staleness clock is the lock FILE's own mtime, stamped by the creating -# write). Portable: BSD/macOS touch has no `-d @epoch`, so convert the target -# epoch to a `touch -t` stamp via GNU `date -d @` with BSD `date -r` as -# fallback (same helper as the unit suite). -epoch_to_stamp() { - date -d "@$1" +%Y%m%d%H%M.%S 2>/dev/null || date -r "$1" +%Y%m%d%H%M.%S 2>/dev/null -} -backdate() { touch -t "$(epoch_to_stamp "$(( $(date +%s) - $2 ))")" "$1"; } - -# Token-guarded backdate for the contended-recovery tests (T16/T16b; same -# guard as the unit suite's T2b — full rationale there). Why: under load a -# fast waiter can complete its ENTIRE steal (claim -> rename-over -> -# ACQUIRED) before the harness's `touch` executes, so a blind backdate lands -# on the WINNER'S freshly installed lock, making it instantly stale for -# every rival — a legitimate re-steal then fails the test's "zero 98s / -# exactly one STOLE-BY-CLAIM" assertions although the protocol behaved -# exactly as designed (observed 2026-06-12 on a loaded box: a fast pwsh -# waiter judged the FRESH ghost at age==STALE, stole and ACQUIRED before the -# touch, which then aged its live lock to 10000s and a rival re-stole it). -# Verdicts: -# * pre-read not the ghost: stolen BEFORE the touch (no touch performed) — -# invalid, the caller retries the run. -# * post-read the ghost: conclusive — the touch hit the ghost. Valid. -# * post-read anything else: a steal raced the touch->re-read window — -# COMMON under load (waiters poll every 0.05s; the post-read costs -# subprocess spawns), so it must not blindly invalidate. The lock's -# MTIME arbitrates which file the touch hit: a winner's installed lock -# is FRESH (the rename carries the claim file's just-created mtime), so -# fresh => the touch hit the GHOST and a legitimate steal followed — -# valid; ancient => the touch landed on the WINNER'S live lock and -# corrupted the run — invalid, retry. Vanished => cannot arbitrate — -# invalid, retry. -backdate_ghost() { # $1=lock $2=ghost token $3=age-secs -> 0 iff the run premise is intact - local pre post now mt - pre="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')" - [ "$pre" = "$2" ] || return 1 - backdate "$1" "$3" 2>/dev/null || return 1 - post="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')" - [ "$post" = "$2" ] && return 0 - [ -e "$1" ] || return 1 - now="$(date +%s)" - mt="$(stat -c %Y -- "$1" 2>/dev/null || stat -f %m -- "$1" 2>/dev/null)" || return 1 - [ $(( now - mt )) -lt $(( $3 / 2 )) ] -} - -# Wait for every waiter's WAITING line while keeping the ghost lock FRESH -# (touch -c to now, no-create so a released path is never resurrected): a -# fresh ghost cannot be judged stale, so no waiter can steal it before the -# guarded backdate — without this, a sync stalled past STALE (slow pwsh cold -# starts on a loaded box) lets the ghost age stale naturally and a waiter -# steals it mid-sync. Freshening is race-safe: if a steal slipped in anyway, -# touching the winner's (already fresh) live lock to "now" is a harmless -# no-op, and backdate_ghost's pre-read catches the broken premise. -sync_waiting_fresh() { # $1=lock $2=timeout-secs $3..=waiter logs -> 0 iff all logged WAITING - local lock="$1" deadline f ok=1 - deadline=$(( $(date +%s) + $2 )); shift 2 - for f in "$@"; do - until grep -q "WAITING for lock" "$f" 2>/dev/null; do - touch -c "$lock" 2>/dev/null - if [ "$(date +%s)" -ge "$deadline" ]; then ok=0; break; fi - sleep 0.2 - done - done - [ "$ok" = 1 ] -} - -# Fabricate a lock file the way a real (foreign) holder would have written it: -# token line + owner line. The token MUST be "tok."-prefixed (wire format) or -# the steal's content guard will — correctly — refuse to steal it. -fabricate_lock() { # $1=path $2=token $3=owner - printf '%s\n%s\n' "$2" "$3" > "$1" -} +# wait_for_grep, epoch_to_stamp, backdate, backdate_ghost, sync_waiting_fresh, +# and fabricate_lock now live in _harness.sh (sourced above) — shared +# byte-for-byte with the unit suite. (wait_for above is interop-only: its arg-2 +# is a count of 50ms iterations, distinct from the unit suite's wait_for_file +# whole-seconds semantics, so the two poll helpers stay separate.) # A pwsh process that holds the lock FILE open with FileShare.Read — the # no-delete-share handle class that blocks unlink AND rename alike (probe @@ -1442,16 +1342,12 @@ else fi echo -# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran no test block, so -# the (vacuously green) verdict below would lie. Bail loudly instead — a typo'd -# selector regex must FAIL, not pass with zero assertions. -if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then - echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2 - exit 1 -fi -# When a selector is active, report how many blocks it matched (the default run -# stays byte-unchanged because this is gated on GCL_TEST_ONLY being non-empty). -[ -n "${GCL_TEST_ONLY:-}" ] && echo "selector GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" ran $SECTIONS_RUN test block(s)" +# Zero-match guard + selector-report line (shared helper in _harness.sh): a +# set-but-non-matching GCL_TEST_ONLY ran no test block, so the (vacuously green) +# verdict below would lie — bail loudly; a typo'd selector regex must FAIL, not +# pass with zero assertions. When the selector matched, report how many blocks +# ran. Both gated on GCL_TEST_ONLY non-empty so the default run stays unchanged. +selector_report DONE=1 echo "==== INTEROP RESULT: $PASS passed, $FAIL failed (fan-out: $GCL_MODE) ====" [ "$GCL_TAP" = 1 ] && echo "1..$TAPN" diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 7fc5f2b..8b1aa08 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -25,6 +25,16 @@ # inside the worker's `bash -c`, not here. set -uo pipefail +# Shared harness: PASS/FAIL/TAP counters, GCL_TAP/GCL_TEST_ONLY reads, ok/bad, +# section, the finish EXIT-trap sentinel (calls our cleanup below), and the +# shared timing/lock helpers (epoch_to_stamp, backdate, backdate_ghost, +# sync_waiting_fresh, fabricate_lock, wait_for_grep). Resolved from THIS +# script's own dir so it sources regardless of CWD; sourced EARLY (before any +# use of the inits/helpers below). +_HARNESS_DIR="$(CDPATH='' cd -- "$(dirname -- "${BASH_SOURCE[0]}")" && pwd)" +# shellcheck source=tests/_harness.sh +. "$_HARNESS_DIR/_harness.sh" + DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" ROOT="$(cd "$DIR/.." && pwd)" # the implementations live at the repo root LIB="$ROOT/git-commit-lock.sh" @@ -51,44 +61,10 @@ cleanup() { rm -rf "$WORK" 2>/dev/null || true fi } -# Sentinel: the suite reaching its end sets DONE=1. If the EXIT trap fires with -# DONE!=1, the suite died early (a stray exit/crash) and the assertion count is -# unreliable — fail loudly even if the pre-trap code was 0. A bare trap `return` -# is IGNORED (the script keeps its pre-trap code), so the guard must `exit 1`. -finish() { - cleanup - if [ "${DONE:-0}" != 1 ]; then - echo "Bail out! suite terminated early before the plan line; ran ${TAPN:-0} assertion(s), count unreliable" >&2 - exit 1 - fi -} +# The finish EXIT-trap sentinel (defined in _harness.sh) calls the cleanup() +# above and fails loudly if the suite died before setting DONE=1. trap finish EXIT -PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0 -GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output -GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" # if set, run ONLY test blocks whose label REGEX-matches (single-test selector) -# section() replaces each per-test header `echo "== Test N: … =="`: it echoes the -# header verbatim (visible output unchanged) and returns success — gating the -# `if section …; then … fi` block — iff GCL_TEST_ONLY is unset/empty OR its regex -# matches the label. A run-counter (SECTIONS_RUN) backs the zero-match guard below, -# so a typo'd selector regex can't masquerade as a vacuous PASS=0/FAIL=0 green. -section() { - echo "== $1 ==" - if [ -z "${GCL_TEST_ONLY:-}" ] || [[ "$1" =~ $GCL_TEST_ONLY ]]; then - SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0 - fi - return 1 -} -# ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and -# bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted -# just before the verdict) lets a TAP consumer fail on a short count; together with the -# DONE sentinel above this closes the silent-undercount gap. `return 0` preserves the -# "ok/bad cannot fail" property the ` && ok ... || bad ...` idiom relies on. -ok() { PASS=$((PASS+1)); TAPN=$((TAPN+1)); echo "PASS: $*" - [ "$GCL_TAP" = 1 ] && echo "ok $TAPN - $*"; return 0; } -bad() { FAIL=$((FAIL+1)); TAPN=$((TAPN+1)); echo "FAIL: $*" - [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*"; return 0; } - # Envelope-tier assertions (Bucket 4 / decision D-c). A wall-clock or poll-count # bound is a Tier-2 (best-effort latency) property, NOT a correctness one (see # guarantees.md BE-1). In the default 'strict' tier these behave exactly like @@ -109,72 +85,8 @@ bad_envelope() { [ "$GCL_TAP" = 1 ] && echo "not ok $TAPN - $*" fi; return 0; } -# Backdate a path's mtime by $2 seconds — the lock's staleness clock is the -# lock FILE's own mtime (stamped by the creating write), so this is how a -# test fakes a stale lock. Portable: BSD touch has no `-d @epoch`, so convert -# the target epoch to a `touch -t` stamp via GNU `date -d @` with BSD -# `date -r` as fallback. -epoch_to_stamp() { - date -d "@$1" +%Y%m%d%H%M.%S 2>/dev/null || date -r "$1" +%Y%m%d%H%M.%S 2>/dev/null -} -backdate() { touch -t "$(epoch_to_stamp "$(( $(date +%s) - $2 ))")" "$1"; } - -# Token-guarded backdate for the contended-recovery rounds (T2b). Why: under -# load a fast waiter can complete its ENTIRE steal (claim -> rename-over -> -# ACQUIRED) before the harness's `touch` executes, so a blind backdate lands -# on the WINNER'S freshly installed lock, making it instantly stale for every -# rival — a legitimate re-steal then fails the round's "zero 98s / exactly -# one STOLE-BY-CLAIM" assertions although the protocol behaved exactly as -# designed (observed 2026-06-12 on a loaded box). Verdicts: -# * pre-read not the ghost: a waiter stole the ghost BEFORE the touch (it -# aged stale naturally during a stalled sync); no touch is performed and -# the round premise is gone — invalid, the caller retries the round. -# * post-read the ghost: conclusive — nothing ever rewrites the ghost -# token at the path, so the touch verifiably hit the ghost; any steal -# after the post-read steals an ALREADY-ancient ghost, exactly the -# scenario the round wants. Valid. -# * post-read anything else: a steal raced the touch->re-read window — -# COMMON under load (waiters poll every 0.05s; the post-read costs -# subprocess spawns), so it must not blindly invalidate. The lock's -# MTIME arbitrates which file the touch hit: a winner's installed lock -# is FRESH (the rename carries the claim file's just-created mtime), so -# fresh => the touch hit the GHOST and a legitimate steal followed — -# valid; ancient => the touch landed on the WINNER'S live lock and -# corrupted the round — invalid, retry. Vanished => cannot arbitrate — -# invalid, retry. -backdate_ghost() { # $1=lock $2=ghost token $3=age-secs -> 0 iff the round premise is intact - local pre post now mt - pre="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')" - [ "$pre" = "$2" ] || return 1 - backdate "$1" "$3" 2>/dev/null || return 1 - post="$(head -n 1 -- "$1" 2>/dev/null | tr -d '\r')" - [ "$post" = "$2" ] && return 0 - [ -e "$1" ] || return 1 - now="$(date +%s)" - mt="$(stat -c %Y -- "$1" 2>/dev/null || stat -f %m -- "$1" 2>/dev/null)" || return 1 - [ $(( now - mt )) -lt $(( $3 / 2 )) ] -} - -# Wait for every waiter's WAITING line while keeping the ghost lock FRESH -# (touch -c to now, no-create so a released path is never resurrected): a -# fresh ghost cannot be judged stale, so no waiter can steal it before the -# guarded backdate — without this, a sync stalled past STALE (slow worker -# cold starts on a loaded box) lets the ghost age stale naturally and a -# waiter steals it mid-sync. Freshening is race-safe: if a steal slipped in -# anyway, touching the winner's (already fresh) live lock to "now" is a -# harmless no-op, and backdate_ghost's pre-read catches the broken premise. -sync_waiting_fresh() { # $1=lock $2=timeout-secs $3..=waiter logs -> 0 iff all logged WAITING - local lock="$1" deadline f ok=1 - deadline=$(( $(date +%s) + $2 )); shift 2 - for f in "$@"; do - until grep -q "WAITING for lock" "$f" 2>/dev/null; do - touch -c "$lock" 2>/dev/null - if [ "$(date +%s)" -ge "$deadline" ]; then ok=0; break; fi - sleep 0.2 - done - done - [ "$ok" = 1 ] -} +# epoch_to_stamp, backdate, backdate_ghost, and sync_waiting_fresh now live in +# _harness.sh (sourced above) — shared byte-for-byte with the interop suite. # Clone a shell function under a new name — the steering tests' interposition # mechanism: a sourced test shell wraps a library internal (or a command like @@ -187,31 +99,19 @@ clone_fn() { # $1=existing function $2=new name } export -f clone_fn epoch_to_stamp backdate -# Fabricate a lock file the way a real (foreign) holder would have written it: -# token line + owner line. The token MUST be "tok."-prefixed (wire format) or -# the steal's content guard will — correctly — refuse to steal it. -fabricate_lock() { # $1=path $2=token $3=owner - printf '%s\n%s\n' "$2" "$3" > "$1" -} +# fabricate_lock and wait_for_grep now live in _harness.sh (sourced above) — +# shared byte-for-byte with the interop suite. # Wait (up to $2 seconds, default 15) for a marker file to appear. Holders # touch a ready-marker as their first act INSIDE the lock; tests gate on that -# instead of sleep-margin head starts, which flaked under load. +# instead of sleep-margin head starts, which flaked under load. Unit-only: the +# interop suite has its own poll helper (wait_for, 50ms-iteration semantics). wait_for_file() { local f="$1" tries=$(( ${2:-15} * 20 )) while [ ! -e "$f" ] && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done [ -e "$f" ] } -# Wait (up to $3 seconds, default 15) for a pattern to appear in a file. -# Used to gate on the WAITING log line: proof the waiter actually contended, -# without a fixed-length hold. -wait_for_grep() { - local pat="$1" f="$2" tries=$(( ${3:-15} * 20 )) - while ! grep -q "$pat" "$f" 2>/dev/null && [ "$tries" -gt 0 ]; do sleep 0.05; tries=$((tries-1)); done - grep -q "$pat" "$f" 2>/dev/null -} - # Critical section that loses updates without a mutex: read, gap, write+1. INCR='n="$(cat "$1")"; sleep 0.03; echo $((n+1)) > "$1"' @@ -3211,14 +3111,14 @@ fi # Test 32, the steal-path lane (F2 — rename-over won, read-back wrong) by # Test 32b. -# Zero-match guard: a set-but-non-matching GCL_TEST_ONLY ran NO test block. Without -# this, the suite would fall through to a vacuous PASS=0 FAIL=0 "green" — a typo'd -# selector regex would silently look like success. Fail loudly instead. (The finish -# EXIT trap also fires here since DONE is still 0; this exit is non-zero regardless.) -if [ -n "${GCL_TEST_ONLY:-}" ] && [ "$SECTIONS_RUN" = 0 ]; then - echo "Bail out! GCL_TEST_ONLY=\"$GCL_TEST_ONLY\" matched no test" >&2 - exit 1 -fi +# Zero-match guard + selector-report line (shared helper in _harness.sh): a +# set-but-non-matching GCL_TEST_ONLY ran NO test block, which without the guard +# would fall through to a vacuous PASS=0 FAIL=0 "green" — a typo'd selector regex +# would silently look like success; bail loudly instead. (The finish EXIT trap +# also fires there since DONE is still 0; that exit is non-zero regardless.) When +# the selector matched, it reports how many blocks ran. Both are gated on +# GCL_TEST_ONLY being non-empty, so a default run stays byte-identical. +selector_report DONE=1 echo From d2ac607e050a34e5fd9d639a49c90281aa65da28 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 03:43:05 +1000 Subject: [PATCH 39/76] Plan changelog: Bucket 8 items 2+3 done (selector + _harness.sh extraction) Record completion of 8.2 (GCL_TEST_ONLY selector, 4ee5899) and 8.3 (tests/ _harness.sh extraction, b8e2951). (8.2 + 8.3 complete; next is Bucket 6.) Cross-platform CI verification pending. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../2026-06-17-ci-stress-phase2-build-plan.md | 23 +++++++++++++++++++ 1 file changed, 23 insertions(+) diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md index a547f8b..75700f1 100644 --- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md +++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md @@ -436,3 +436,26 @@ Workflow once the final test count is known (plan D-e) — likely a Workflow for (`_harness.sh` extraction — also a large harness change) into one validated harness-restructure step near the end. **Revised phasing: 8.1 → 3 → 4 → 2A → 2B → (8.2 + 8.3 together) → 6.** +- **Step (commit `4ee5899`) — Bucket 8 item 2 done** (`GCL_TEST_ONLY` selector). Each + top-level `== Test N: … ==` header in unit + interop became `if section "Test N: …"; + then … fi` (each `fi` before the next `if section`, so trailing cleanup stays inside); + `section` runs a block iff `GCL_TEST_ONLY` is unset/empty or its regex matches, bumping + `SECTIONS_RUN`. Zero-match guard bails loudly (exit 1) on a set-but-non-matching regex + (no vacuous green). Integration note-and-ignores (one indivisible scenario). Built by 3 + parallel sub-agents (one per suite), each self-validating byte-identical + selector + precision + the guard; orchestrator re-verified independently. Validated reduced: unit + 315/0, interop 141/0, integration 12/0; selector precision proven (regex, trailing-colon + anchoring); `shellcheck -S style` clean. +- **Step (commit `b8e2951`) — Bucket 8 item 3 done** (`tests/_harness.sh` extraction, 177 + lines, net −42). Tier 1 (all three): inits + `GCL_TAP`/`GCL_TEST_ONLY` reads + `ok`/`bad` + + `section` + the `finish` sentinel + shared shellcheck disables + a unified + `selector_report` (so unit/interop match). Tier 2 (unit+interop, byte-identical-verified + first): `epoch_to_stamp`, `backdate`, `backdate_ghost`, `sync_waiting_fresh`, + `fabricate_lock`, `wait_for_grep`. Left per-suite: `cleanup` (closes over `$WORK`), + `clone_fn`+`export -f` (unit-only), `ok_envelope`/`bad_envelope` (unit-only), both poll + helpers (`wait_for_file` secs vs `wait_for` 50ms-iters — Tier 3, not unified), verdict + lines. CWD-independent sourcing (`BASH_SOURCE`) + `# shellcheck source=` directive; + `tests/_harness.sh` added to the CI lint list. Byte-identical (315/141/12), `shellcheck` + clean, selector/guard/integration-note all intact; orchestrator re-verified independently. +- **(8.2 + 8.3 COMPLETE.) Next: Bucket 6 (CI matrix wiring).** Cross-platform CI verification + of these two commits pending (dispatch `tests.yml` on `ci-stress`). From 6f20a5b53d7af9a16b3a667415f467284f28e14a Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 04:22:08 +1000 Subject: [PATCH 40/76] Bucket 6a: de-stress tests.yml + record no-branch-protection decision MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Ben (2026-06-18): no branch protection — single-dev project. So the planned `tests-passed` aggregator + per-job doc-only path-filter are dropped (both only served a *required* check). tests.yml keeps the simple workflow-level paths-ignore and reports per-cell matrix contexts directly. Recorded the decision at the top of Bucket 6 in the build plan. tests.yml de-stress = reverse-apply of the two stress-only commits' tests.yml hunks (precise, nothing else touched): - 980856b: per-run-unique concurrency group -> group: {workflow}-{ref} + cancel-in-progress: true. - b430d73: drop the stress workflow_dispatch inputs, the GCL_STRESS_* env, and the `tests/with-load.sh` wrapper on each suite (suites run un-wrapped); restore original step timeouts (unit 15win/10posix, interop 10, integration 7) and job_timeouts (ubuntu/macos 35, win-unit 20, win-interop-integration 22). The later tests/_harness.sh lint-list entry (b8e2951) is preserved. actionlint clean (-shellcheck=); no with-load/GCL_STRESS/aggregator residue. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/tests.yml | 42 ++++++------------- .../2026-06-17-ci-stress-phase2-build-plan.md | 16 +++++++ 2 files changed, 29 insertions(+), 29 deletions(-) diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index 268c257..2156133 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -17,21 +17,10 @@ on: schedule: - cron: '17 3 * * 1' # weekly Monday run: catches runner-image/tool drift workflow_dispatch: - inputs: - stress_kind: - description: 'STRESS BRANCH: artificial load during suites — none|cpu|disk|both' - default: both - stress_load: - description: 'STRESS BRANCH: hogs per kind (blank = runner core count)' - default: '' concurrency: - # STRESS-BRANCH ONLY — do NOT merge to main. The per-run-unique group + no - # cancellation lets many workflow_dispatch runs execute in parallel on this one - # branch (flakiness stress test). On main the group is - # `${{ github.workflow }}-${{ github.ref }}` with cancel-in-progress: true. - group: ${{ github.workflow }}-${{ github.ref }}-${{ github.run_id }} - cancel-in-progress: false + group: ${{ github.workflow }}-${{ github.ref }} + cancel-in-progress: true permissions: contents: read @@ -48,22 +37,17 @@ jobs: # process-spawn overhead, not the PowerShell engines). Suites must NOT run # concurrently inside one runner: they're timing-sensitive on 2-core # runners. POSIX legs are fast enough to stay single-job. - include: # STRESS BRANCH: job_timeouts raised to clear the summed step budgets under artificial load - - { os: ubuntu-24.04, leg: all, job_timeout: 80 } - - { os: macos-15, leg: all, job_timeout: 80 } - - { os: windows-2025, leg: unit, job_timeout: 40 } - - { os: windows-2025, leg: interop-integration, job_timeout: 50 } + include: + - { os: ubuntu-24.04, leg: all, job_timeout: 35 } + - { os: macos-15, leg: all, job_timeout: 35 } + - { os: windows-2025, leg: unit, job_timeout: 20 } + - { os: windows-2025, leg: interop-integration, job_timeout: 22 } timeout-minutes: ${{ matrix.job_timeout }} # backstop only: sum of the leg's step budgets + upload headroom defaults: run: shell: bash # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires env: GCL_TEST_FULL: 1 # full fan-out — CI runners are dedicated; the reduced default protects live dev boxes (TODO 58) - # STRESS-BRANCH ONLY (do not merge): artificial CPU/disk load wrapped around each - # suite (tests/with-load.sh) to widen timing windows and surface latency/race - # flakes. From the workflow_dispatch inputs; empty on push/schedule => 'none'. - GCL_STRESS_KIND: ${{ inputs.stress_kind || 'none' }} - GCL_STRESS_LOAD: ${{ inputs.stress_load }} steps: - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0 # v6.0.3, SHA-pinned with: @@ -88,30 +72,30 @@ jobs: - name: Unit suite if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }} - timeout-minutes: ${{ matrix.os == 'windows-2025' && 30 || 25 }} # STRESS BRANCH: raised (15->30 / 10->25) so artificial load slowness doesn't trip the step timeout and masquerade as a flake + timeout-minutes: ${{ matrix.os == 'windows-2025' && 15 || 10 }} # a step timeout FAILS the step (not the job), so the upload step reliably runs; sized from run 27325978197 + one internal MAX_WAIT hang env: GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit run: | mkdir -p test-output - bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log + bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log - name: Interop suite (bash + pwsh) if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} # run even if an earlier suite failed — every signal is useful - timeout-minutes: 25 # STRESS BRANCH: raised 10->25 for artificial load + timeout-minutes: 10 env: GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop run: | mkdir -p test-output - bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log + bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log - name: Integration suite (real concurrent commits) if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} - timeout-minutes: 20 # STRESS BRANCH: raised 7->20 for artificial load (internal AGENT_LOCK_MAX_WAIT cap is 240s) + timeout-minutes: 7 # its internal AGENT_LOCK_MAX_WAIT cap is 240s env: GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration run: | mkdir -p test-output - bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log + bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log - name: Upload failure diagnostics if: ${{ failure() || cancelled() }} # failure() covers step timeouts (they fail the step); cancelled() is best-effort cover for manual cancels / the job-level backstop diff --git a/.plans/2026-06-17-ci-stress-phase2-build-plan.md b/.plans/2026-06-17-ci-stress-phase2-build-plan.md index 75700f1..8da0f2a 100644 --- a/.plans/2026-06-17-ci-stress-phase2-build-plan.md +++ b/.plans/2026-06-17-ci-stress-phase2-build-plan.md @@ -196,6 +196,22 @@ bad_envelope() { ## Bucket 6 — CI matrix wiring (the accepted load-strategy §9 decisions) +> **DECISION (Ben, 2026-06-18): NO branch protection — single-dev project.** We will not +> enforce required status checks. Consequences for this bucket: +> 1. **The `tests-passed` aggregator and the per-job doc-only path-filter (the `changes` +> job) are DROPPED.** Both existed only to make a *required* check behave well (one +> green context to require; doc-only PRs not blocked by it). With nothing required, +> `tests.yml` keeps the simple **workflow-level `paths-ignore`** and reports the per-cell +> matrix contexts directly. So **Bucket 6a = the de-stress revert only** (revert +> `980856b` + `b430d73`'s `tests.yml` half; restore original concurrency/timeouts; drop +> the stress `workflow_dispatch` inputs; suites run un-wrapped). +> 2. The 3-workflow file split (`tests.yml` / `nightly.yml` / `deep-sweep.yml`) is **kept**, +> but now purely for separation of concerns (per-PR no-load gate vs scheduled load vs +> on-demand deep) — not to stop `workflow_dispatch` publishing gating contexts (moot +> without protection). The "distinct `deep-*` job names" detail is likewise now cosmetic. +> The paragraphs below that describe the aggregator / path-filter / required-context gotchas +> are **SUPERSEDED** by this note; keep them only as the rationale for why they're unneeded. + **Three-workflow structure** (revised after review — a `workflow_dispatch` run publishes check contexts on the head SHA, so keeping Deep in `tests.yml` under shared job names risks a failed Deep run gating a PR; separate files + a stable required From 43cb64810f541e0b4adc77c7f27b885a41e30aa7 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 04:22:32 +1000 Subject: [PATCH 41/76] Bucket 6b: graduate tests/with-load.sh (calibrated ratio + load-manifest) Promote with-load.sh from stress-branch scaffolding to a main-worthy, calibrated load wrapper (used by the nightly/deep-sweep tiers, not the required tests.yml): - Load expressed as oversubscription ratio R = stressors/nproc (GCL_STRESS_RATIO), with a total-ratio cap (GCL_STRESS_RATIO_MAX, default 2); GCL_STRESS_LOAD kept as a back-compat raw-count override. GCL_STRESS_KIND=none|cpu|disk|both; none/unset is a clean pass-through (zero load, propagates the wrapped command's exit code). - Prefers stress-ng when present, portable shell spinner fallback (Windows + here); disk churn via dd conv=fsync. Probe-gated Linux cgroup-v2 CPU-quota path (recorded, Linux-only; not actuated elsewhere). IO throttling intentionally not relied on. - Emits a per-run load-manifest JSON (kind, R, nproc, stressor counts, achieved slowdown, tool versions, os/arch, git sha) under test-output/ for reproducibility. - Robust teardown: every spawned stressor PID tracked and killed by exact PID on a trap (never by name); verified no leak on success and on a failing wrapped command. - Do-not-merge banner stripped. Validated locally: shellcheck -S style + bash -n clean; pass-through (none) -> exit propagated; cpu R=1 -> R*nproc spinners, 2.29x slowdown, clean reap, manifest written. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/with-load.sh | 279 +++++++++++++++++++++++++++++++++++++++++---- 1 file changed, 257 insertions(+), 22 deletions(-) diff --git a/tests/with-load.sh b/tests/with-load.sh index e19ae5f..077511f 100644 --- a/tests/with-load.sh +++ b/tests/with-load.sh @@ -1,40 +1,183 @@ #!/usr/bin/env bash -# STRESS-BRANCH ONLY — do NOT merge to main. +# with-load.sh — run a command under a calibrated, reproducible background load. # -# Run "$@" while artificial CPU and/or disk load saturates the runner, to widen the -# timing windows that latency/race flakes depend on (e.g. Test 17d's churn "absent -# window" — driven by both CPU descheduling of the churner AND slow file create/delete -# IO). Hogs are reaped by their EXACT PIDs afterward (never by name), so this is safe on -# a shared machine; on an ephemeral CI runner it is doubly safe. +# Usage: bash tests/with-load.sh [args...] +# Example: bash tests/with-load.sh bash tests/git-commit-lock.test.sh # -# GCL_STRESS_KIND = none | cpu | disk | both (default: both) -# GCL_STRESS_LOAD = N hogs of EACH selected kind (default: detected core count) +# Wraps "$@", applies artificial background load for the command's lifetime, then +# tears the load down (by EXACT spawned PIDs — never by name, so it is safe on a +# shared dev box and doubly safe on an ephemeral CI runner) and exits with the +# wrapped command's exit code. # -# CPU hog = a bare bash spin loop (one core each). -# Disk hog = a tight create / write+fsync / delete loop of a small file on the same -# volume as the test's scratch dir (TMPDIR) — metadata + write-back pressure -# that contends with the lock-file create/delete the suite itself does. +# WHY load exists here (see docs/load-testing-strategy.md §1): this protocol's +# *correctness* is load-independent (O_EXCL + atomic rename + per-attempt tokens +# never consult the clock for a correctness decision), so load cannot break +# exclusion. Load's only jobs are (J1) perturb scheduling so the protocol's +# multi-syscall sequences get preempted at adversarial points, and (J2) stretch +# the few genuinely timing-derived decisions. Magnitude past ~2x CPU +# oversubscription mostly manufactures harness wall-clock flakes, not bugs — which +# is why load is expressed as an oversubscription RATIO and the total ratio is +# CAPPED. +# +# ── Calibrated interface (the contract nightly/deep-sweep CI calls against) ────── +# +# GCL_STRESS_KIND none | cpu | disk | both (default: none) +# none/unset => CLEAN PASS-THROUGH: zero added load, the +# command's exit code is propagated verbatim. +# +# GCL_STRESS_RATIO Oversubscription ratio R = stressors / nproc, PER KIND. +# (default: 1) Stressors-per-kind = round(R * nproc), +# floored at 1 when a kind is selected. Runner-independent: +# "R=2" means the same pressure on a 2-core and a 32-core box, +# whereas a raw hog count does not. +# +# GCL_STRESS_RATIO_MAX Cap on the TOTAL oversubscription ratio across all kinds +# (default: 2). `both` runs cpu + disk, so its total ratio is +# 2*R; this cap scales each kind's stressor count down +# proportionally so the runner is never wedged. Set the +# deep-sweep flake-hunt higher deliberately. +# +# GCL_STRESS_LOAD BACK-COMPAT raw-count override. If set to a positive +# integer it REPLACES the ratio computation: exactly N +# stressors per selected kind (still capped by RATIO_MAX +# unless GCL_STRESS_RATIO_MAX is also raised). Empty/unset => +# use the ratio. Kept so the existing deep-sweep +# `stress_load=N` dispatch input keeps working. +# +# GCL_STRESS_CGROUP 1 => on Linux with a writable cgroup v2 cpu controller, +# PROBE the calibrated cgroup CPU-quota path (envelope leg). +# The probe is recorded in the manifest. cgroup IO throttling +# is experimental and intentionally NOT attempted here. +# (default: 0) Absent/unwritable => fall back to spinners. +# +# GCL_LOAD_MANIFEST Path for the per-run load-manifest JSON +# (default: test-output/load-manifest..json, created +# under a known dir so CI can upload it). One file per run, +# capturing {kind, R, nproc, stressor counts, achieved +# slowdown, tool versions, os/arch, git sha} so any flake is +# reproducible. Written on success too. +# +# CPU stressor: `stress-ng --cpu` when available (calibrated, measurable), else a +# portable bash spin loop (one busy core each). +# Disk stressor: a tight create / write+fsync / delete loop over a small file on the +# same volume as the test scratch dir — metadata + write-back pressure +# that contends with the lock-file create/delete the suite itself does. +# (Always the portable shell hog; cross-platform, low-fidelity but real +# metadata-op pressure — see strategy §4.) set -uo pipefail -kind="${GCL_STRESS_KIND:-both}" -cores="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)" -load="${GCL_STRESS_LOAD:-$cores}" -case "$load" in ''|*[!0-9]*) load="$cores" ;; esac # guard non-numeric / empty +# ── Inputs ─────────────────────────────────────────────────────────────────── +kind="${GCL_STRESS_KIND:-none}" +nproc_count="$(nproc 2>/dev/null || sysctl -n hw.ncpu 2>/dev/null || echo 4)" +case "$nproc_count" in ''|*[!0-9]*) nproc_count=4 ;; esac +[ "$nproc_count" -lt 1 ] && nproc_count=1 + +ratio="${GCL_STRESS_RATIO:-1}" +case "$ratio" in ''|*[!0-9]*) ratio=1 ;; esac # integer ratios only (R in {0,1,2,…}) + +ratio_max="${GCL_STRESS_RATIO_MAX:-2}" +case "$ratio_max" in ''|*[!0-9]*) ratio_max=2 ;; esac + +raw_load="${GCL_STRESS_LOAD:-}" +case "$raw_load" in *[!0-9]*) raw_load="" ;; esac # non-numeric => ignore, use ratio + +manifest="${GCL_LOAD_MANIFEST:-test-output/load-manifest.$$.json}" + +# ── Stressor-count calibration ───────────────────────────────────────────────── +# Per-kind count: raw-count override wins, else round(R * nproc) floored at 1. +if [ -n "$raw_load" ]; then + per_kind="$raw_load" +else + per_kind=$(( ratio * nproc_count )) + [ "$ratio" -gt 0 ] && [ "$per_kind" -lt 1 ] && per_kind=1 +fi + +# How many kinds spawn stressors. +n_kinds=0 +case "$kind" in + cpu|disk) n_kinds=1 ;; + both) n_kinds=2 ;; +esac + +# R_total cap: total stressors must not exceed ratio_max * nproc. `both` would +# otherwise be 2*per_kind; scale each kind down proportionally if it would breach. +cpu_count=0 +disk_count=0 +capped="no" +if [ "$n_kinds" -gt 0 ] && [ "$per_kind" -gt 0 ]; then + total_cap=$(( ratio_max * nproc_count )) + [ "$total_cap" -lt "$n_kinds" ] && total_cap="$n_kinds" # always allow >=1 per active kind + requested_total=$(( per_kind * n_kinds )) + if [ "$requested_total" -gt "$total_cap" ]; then + per_kind=$(( total_cap / n_kinds )) + [ "$per_kind" -lt 1 ] && per_kind=1 + capped="yes" + fi + case "$kind" in + cpu) cpu_count="$per_kind" ;; + disk) disk_count="$per_kind" ;; + both) cpu_count="$per_kind"; disk_count="$per_kind" ;; + esac +fi +# ── Tool discovery ───────────────────────────────────────────────────────────── +stress_ng_bin="$(command -v stress-ng 2>/dev/null || true)" +stress_ng_ver="none" +[ -n "$stress_ng_bin" ] && stress_ng_ver="$("$stress_ng_bin" --version 2>/dev/null | head -1 | tr -d '\r')" +bash_ver="$(bash --version 2>/dev/null | head -1 | tr -d '\r')" +os_uname="$(uname -srm 2>/dev/null | tr -d '\r' || echo unknown)" +git_sha="$(git rev-parse --short HEAD 2>/dev/null || echo unknown)" + +# CPU mechanism actually used. +cpu_mech="none" +[ "$cpu_count" -gt 0 ] && { [ -n "$stress_ng_bin" ] && cpu_mech="stress-ng" || cpu_mech="spinner"; } + +# ── cgroup v2 CPU-quota probe (Linux envelope leg only; probe-gated) ─────────── +# We only PROBE writability + record it; we do not create scopes here (that needs a +# usable systemd manager — see strategy §3). IO throttling is experimental: skipped. +cgroup_probe="not-requested" +if [ "${GCL_STRESS_CGROUP:-0}" = 1 ]; then + cgroup_probe="unavailable" + if [ "$(uname -s 2>/dev/null)" = "Linux" ] && [ -r /sys/fs/cgroup/cgroup.controllers ]; then + if grep -qw cpu /sys/fs/cgroup/cgroup.controllers 2>/dev/null; then + # cpu controller present at the v2 root; is a cpu.max writable in our subtree? + if [ -w /sys/fs/cgroup/cgroup.subtree_control ] 2>/dev/null; then + cgroup_probe="writable" # the calibrated quota path is reachable on this leg + else + cgroup_probe="present-not-delegated" + fi + else + cgroup_probe="no-cpu-controller" + fi + else + cgroup_probe="no-cgroup-v2" + fi +fi + +# ── Stressor scratch dir (same volume as the test scratch) ───────────────────── hogdir="${TMPDIR:-/tmp}/gcl-stress.$$" mkdir -p "$hogdir" 2>/dev/null || hogdir="." +# ── Spawn / teardown (track EXACT PIDs; kill only those) ─────────────────────── hogs=() + spawn_cpu() { local i - for ((i = 0; i < load; i++)); do - bash -c 'while :; do :; done' & + if [ "$cpu_mech" = "stress-ng" ]; then + # One stress-ng manager spawning $cpu_count workers; reap the manager's PID. + "$stress_ng_bin" --cpu "$cpu_count" --cpu-load 100 >/dev/null 2>&1 & hogs+=("$!") - done + else + for ((i = 0; i < cpu_count; i++)); do + bash -c 'while :; do :; done' & + hogs+=("$!") + done + fi } + spawn_disk() { local i - for ((i = 0; i < load; i++)); do + for ((i = 0; i < disk_count; i++)); do bash -c ' d="$1"; j=0 while :; do @@ -46,24 +189,116 @@ spawn_disk() { hogs+=("$!") done } + cleanup() { local p for p in "${hogs[@]:-}"; do [ -n "$p" ] && kill "$p" 2>/dev/null done + # stress-ng forks workers under its manager; kill the worker group too (only the + # manager PIDs we spawned are used as the group leader — never a name match). + if [ "$cpu_mech" = "stress-ng" ]; then + for p in "${hogs[@]:-}"; do + [ -n "$p" ] && kill -- "-$p" 2>/dev/null # negative PID = the manager's process group + done + fi rm -rf "$hogdir" 2>/dev/null } trap cleanup EXIT INT TERM +# ── Achieved-slowdown micro-benchmark (cheap fixed busy-loop, baseline vs loaded) ─ +# A small fixed integer loop timed once unloaded (baseline) and once mid-load gives a +# coarse, reproducible "how much did this load slow a CPU-bound task" figure for the +# manifest. Pure bash, no deps. Only run when load is actually applied — on the +# none/pass-through path it would be pure overhead. +micro_bench() { + local start end k=0 + start="$(date +%s%N 2>/dev/null || echo 0)" + while [ "$k" -lt 50000 ]; do k=$((k + 1)); done + end="$(date +%s%N 2>/dev/null || echo 0)" + echo $(( (end - start) / 1000000 )) # ms +} + +# Will any stressors spawn? (kind selected AND a positive per-kind count.) +will_load="no" +case "$kind" in + cpu) [ "$cpu_count" -gt 0 ] && will_load="yes" ;; + disk) [ "$disk_count" -gt 0 ] && will_load="yes" ;; + both) { [ "$cpu_count" -gt 0 ] || [ "$disk_count" -gt 0 ]; } && will_load="yes" ;; +esac + +base_ms=0 +loaded_ms=0 +slowdown="1.00" +[ "$will_load" = yes ] && base_ms="$(micro_bench)" + +# ── Apply load ───────────────────────────────────────────────────────────────── case "$kind" in cpu) spawn_cpu ;; disk) spawn_disk ;; both) spawn_cpu; spawn_disk ;; none) : ;; - *) echo "with-load: unknown GCL_STRESS_KIND='$kind' — running with NO load" >&2 ;; + *) echo "with-load: unknown GCL_STRESS_KIND='$kind' — running with NO load" >&2; kind="none" ;; esac -echo "stress: kind=$kind load=$load cores=$cores hogs=${#hogs[@]} :: $*" +if [ "${#hogs[@]}" -gt 0 ] && [ "$base_ms" -gt 0 ]; then + loaded_ms="$(micro_bench)" + # slowdown = loaded/base to 2 dp, integer-only arithmetic. Pad the centi-value to + # >=3 digits so the integer part is always whatever precedes the last 2 digits + # (handles slowdown <1.00 from timing noise, e.g. 80 -> "0.80"). + centi="$(( loaded_ms * 100 / base_ms ))" + while [ "${#centi}" -lt 3 ]; do centi="0$centi"; done + slowdown="${centi%??}.${centi: -2}" +fi + +# ── Write the load-manifest (best-effort; never fails the run) ────────────────── +write_manifest() { + local dir + dir="$(dirname "$manifest")" + mkdir -p "$dir" 2>/dev/null || return 0 + # Hand-rolled JSON (no jq/python dependency on the runner). Escape the JSON-special + # chars in string values: backslash, double-quote, and the control chars that the + # wrapped command line can legitimately contain (newline/tab/CR) — a raw newline in + # a value is invalid JSON. awk keeps this robust where sed's newline handling is not. + esc() { + printf '%s' "$1" | awk ' + BEGIN { ORS = "" } + { + if (NR > 1) printf "\\n" # join input lines with an escaped newline + gsub(/\\/, "\\\\"); gsub(/"/, "\\\""); gsub(/\t/, "\\t"); gsub(/\r/, "\\r") + print + }' + } + { + printf '{\n' + printf ' "kind": "%s",\n' "$(esc "$kind")" + printf ' "ratio_R": %s,\n' "$ratio" + printf ' "ratio_max": %s,\n' "$ratio_max" + printf ' "raw_load_override": "%s",\n' "$(esc "${raw_load:-}")" + printf ' "nproc": %s,\n' "$nproc_count" + printf ' "cpu_stressors": %s,\n' "$cpu_count" + printf ' "disk_stressors": %s,\n' "$disk_count" + printf ' "total_stressors": %s,\n' "${#hogs[@]}" + printf ' "ratio_total_capped": "%s",\n' "$capped" + printf ' "cpu_mechanism": "%s",\n' "$(esc "$cpu_mech")" + printf ' "cgroup_cpu_probe": "%s",\n' "$(esc "$cgroup_probe")" + printf ' "baseline_ms": %s,\n' "$base_ms" + printf ' "loaded_ms": %s,\n' "$loaded_ms" + printf ' "achieved_slowdown": %s,\n' "$slowdown" + printf ' "stress_ng_version": "%s",\n' "$(esc "$stress_ng_ver")" + printf ' "bash_version": "%s",\n' "$(esc "$bash_ver")" + printf ' "os_arch": "%s",\n' "$(esc "$os_uname")" + printf ' "git_sha": "%s",\n' "$(esc "$git_sha")" + printf ' "command": "%s"\n' "$(esc "$*")" + printf '}\n' + } > "$manifest" 2>/dev/null || true +} +write_manifest "$@" + +echo "stress: kind=$kind R=$ratio nproc=$nproc_count cpu=$cpu_count disk=$disk_count" \ + "mech=$cpu_mech capped=$capped slowdown=${slowdown}x manifest=$manifest :: $*" + +# ── Run the wrapped command, tear down, propagate its exit code ───────────────── "$@" rc=$? From 36b0033eec931ef0e9552abecc9295905892054d Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 04:24:05 +1000 Subject: [PATCH 42/76] chore: gitignore test-output/ (runtime CI/test artifact dir) CI jobs `mkdir -p test-output` for suite logs, and the graduated with-load.sh writes its load-manifest there. It's never committed; ignore it so it can't be swept into a commit locally. Co-Authored-By: Claude Opus 4.8 (1M context) --- .gitignore | 3 +++ 1 file changed, 3 insertions(+) diff --git a/.gitignore b/.gitignore index 9bdb6bd..abf679e 100644 --- a/.gitignore +++ b/.gitignore @@ -11,3 +11,6 @@ Thumbs.db /.agent/review-queue.lock.* /.agent/last-opened /.agent/.tmp.* + +# Test/CI artifact output (manifests, suite logs); created at runtime, never committed. +test-output/ From 6a33cbe37116a852cb9e7bb12f47808ca3300ec4 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 05:58:01 +1000 Subject: [PATCH 43/76] Bucket 6e: Axis-A waiter-count sweep (GCL_TEST_SWEEP), nightly/deep-only Parametrize the fan-out/contention tests over a waiter-count axis so the nightly/deep CI tiers exercise more surface, while per-PR (default) runs stay byte-identical. - T_AXIS_A list + GCL_TEST_SWEEP read in tests/_harness.sh: unset/0 -> "4" (today's floor, deterministic); =1 -> "4 12 24". - Test 2b, Test 20 (unit) and interop Test 16 loop their waiter count N over T_AXIS_A, naming N in every assertion. Test 20 keeps its mode floor and appends 12,24 when sweeping. - Anti-flake discipline: correctness assertions stay strict (ok/bad) and config-independent; MAX_WAIT and STALE scale with N (a real N=24 over-steal was caught and fixed by scaling STALE>=N when sweeping, keeping exactly-one- steal strict at every N). Codex-reviewed default-byte-identicality hardenings adopted (Test 20 default MAX_WAIT, fixture-token N-segment, recov-log glob). Validated: default unit 315/0 + interop 141/0 (byte-identical); GCL_TEST_SWEEP=1 unit 337/0 + interop 163/0, all N pass; selector still works; shellcheck clean. Co-Authored-By: Claude Opus 4.8 (1M context) --- tests/_harness.sh | 15 +++ tests/git-commit-lock.interop.test.sh | 126 ++++++++++++--------- tests/git-commit-lock.test.sh | 155 +++++++++++++++++++------- 3 files changed, 206 insertions(+), 90 deletions(-) diff --git a/tests/_harness.sh b/tests/_harness.sh index d5d8215..88b344c 100644 --- a/tests/_harness.sh +++ b/tests/_harness.sh @@ -35,6 +35,21 @@ PASS=0; FAIL=0; TAPN=0; DONE=0; SECTIONS_RUN=0 GCL_TAP="${GCL_TAP:-0}" # CI sets GCL_TAP=1 for machine-readable TAP13 output GCL_TEST_ONLY="${GCL_TEST_ONLY:-}" # if set, run ONLY test blocks whose label REGEX-matches (single-test selector) +# Axis-A waiter-count sweep (Bucket 6). GCL_TEST_SWEEP=1 (nightly/deep CI) widens +# the fan-out/contention tests over several waiter counts to wring more coverage +# from the existing tests; unset/0 (per-PR default + plain dev) keeps the floor so +# default runs are byte-identical to today. T_AXIS_A is the shared waiter-count +# list the contention tests (unit Test 2b, interop Test 16) iterate N over; each +# names N in every assertion message so a sweep failure says which N broke. The +# floor is 4 — the count those two tests hardcode today, so the single-element +# default reproduces today's behaviour exactly. (Test 20's floor is mode-driven +# `$T20_N` (5 REDUCED / 10 FULL), not 4, so it composes its own list from $T20_N + +# the sweep's higher counts rather than from T_AXIS_A — see that test.) +GCL_TEST_SWEEP="${GCL_TEST_SWEEP:-0}" +# shellcheck disable=SC2034 # T_AXIS_A is consumed by the sourcing suites (unit +# Test 2b, interop Test 16), not within this harness file. +if [ "$GCL_TEST_SWEEP" = 1 ]; then T_AXIS_A="4 12 24"; else T_AXIS_A="4"; fi + # ok/bad are TAP-aware (gated by GCL_TAP so plain dev runs are byte-unchanged) and # bump the running assertion number TAPN. The trailing `1..$TAPN` plan line (emitted # by each suite just before its verdict) lets a TAP consumer fail on a short count; diff --git a/tests/git-commit-lock.interop.test.sh b/tests/git-commit-lock.interop.test.sh index 4bad30f..0244d1a 100644 --- a/tests/git-commit-lock.interop.test.sh +++ b/tests/git-commit-lock.interop.test.sh @@ -838,18 +838,18 @@ fi if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-serialized: zero displacement, zero 98s"; then # Cross-impl variant of the unit suite's Test 2b (which carries the full -# rationale): 2 bash + 2 pwsh waiters race ONE crashed lock. Under the claim -# protocol the straggler-robs-recovery-winner race is PREVENTED (the claim -# serializes stealers across the wire format), not detected-and-repaired, so -# the assertions are strict: every waiter exits 0 (zero spurious 98s — an -# unserialized implementation displaces the recovery winner near-certainly), -# exactly ONE STOLE-BY-CLAIM, NO move-aside file ever exists (an -# implementation that staged the steal through an intermediate .dead.* file -# would re-open the displacement race; a background sampler proves no such -# file ever appears — and the unserialized "STOLE stale lock" line shape and -# any STEAL-DISPLACED repair line must never appear), and the final state -# is clean (no lock, no claim). Sync: waiters launch against a FRESH -# fabricated lock and only once all four have logged WAITING is it +# rationale): N waiters split half bash / half pwsh race ONE crashed lock. +# Under the claim protocol the straggler-robs-recovery-winner race is +# PREVENTED (the claim serializes stealers across the wire format), not +# detected-and-repaired, so the assertions are strict: every waiter exits 0 +# (zero spurious 98s — an unserialized implementation displaces the recovery +# winner near-certainly), exactly ONE STOLE-BY-CLAIM, NO move-aside file ever +# exists (an implementation that staged the steal through an intermediate +# .dead.* file would re-open the displacement race; a background sampler proves +# no such file ever appears — and the unserialized "STOLE stale lock" line +# shape and any STEAL-DISPLACED repair line must never appear), and the final +# state is clean (no lock, no claim). Sync: waiters launch against a FRESH +# fabricated lock and only once all have logged WAITING is it # backdated, so all judge stale within one poll window despite pwsh's slow # cold start; the sync keeps the ghost fresh while it waits # (sync_waiting_fresh) so a stalled sync can't let the ghost age stale on @@ -861,13 +861,34 @@ if section "Test 16: crash recovery under CONTENTION, mixed impls — claim-seri # the run's premise is broken (the touch may have aged the WINNER'S live # lock), so the run is discarded and retried (bounded) instead of failing # assertions the protocol never violated. +# +# Waiter count is swept over $T_AXIS_A (Bucket 6): one iteration at N=4 by +# default (2 bash + 2 pwsh — byte-identical to today) and at N=4,12,24 under +# GCL_TEST_SWEEP=1. N is split into a bash half (N/2) and a pwsh half (the +# remainder); at N=4 that is 2+2 exactly. The correctness invariants stay strict +# at EVERY N — but that needs STALE >> the winner's EFFECTIVE hold, which grows +# with N under load (the winner is one of N concurrent processes), so STALE is +# floored to N when sweeping (t16_stale); at the default floor it is the same 8 +# as today. MAX_WAIT scales too (30*N => 120 at N=4) so a wide, pwsh-cold-start- +# heavy sweep has time to drain. The per-N tag on the non-count-naming +# assertions is suppressed in the default run so the messages stay byte-identical. LOCK="$WORK/recov.lock" T16_TRIES=3 T16_GRAVESEEN="$WORK/recov.graveseen"; T16_SAMPSTOP="$WORK/recov.sampstop" +for T16_N in $T_AXIS_A; do +t16_nsh=$(( T16_N / 2 )); t16_nps=$(( T16_N - t16_nsh )) # bash half + pwsh half (2+2 at N=4) +t16_maxwait=$(( 30 * T16_N )) +# STALE budget: today's 8 in the default (non-sweep) run for byte-identical +# behaviour; when sweeping, floor it to N so a wide fan-out's load-stretched +# winner hold can never make its own live lock look stale (a legitimate but +# unwanted second steal), keeping "exactly one steal" strict at every N. +if [ "$GCL_TEST_SWEEP" = 1 ] && [ "$T16_N" -gt 8 ]; then t16_stale="$T16_N"; else t16_stale=8; fi +if [ "$GCL_TEST_SWEEP" = 1 ]; then t16_ntag=" at N=$T16_N"; else t16_ntag=""; fi t16_valid=0; t16_sync=1; t16_fail=0; n98=0 for t16_try in $(seq 1 "$T16_TRIES"); do - T16_GHOST="tok.ghost.recov.$t16_try" - rm -f "$WORK"/recov.ran.* "$T16_GRAVESEEN" "$T16_SAMPSTOP" "$LOCK" "$LOCK.next" 2>/dev/null + T16_GHOST="tok.ghost.recov.$T16_N.$t16_try" + rm -f "$WORK"/recov.ran.* "$WORK"/recov-sh*.log "$WORK"/recov-ps*.log \ + "$T16_GRAVESEEN" "$T16_SAMPSTOP" "$LOCK" "$LOCK.next" 2>/dev/null fabricate_lock "$LOCK" "$T16_GHOST" "pid=999 host=ghost" # fresh mtime: not yet stale ( while [ ! -e "$T16_SAMPSTOP" ]; do @@ -878,41 +899,45 @@ for t16_try in $(seq 1 "$T16_TRIES"); do done ) & t16_sampler=$! - pids=() - for i in 1 2; do + pids=(); t16_logs=() + for i in $(seq 1 "$t16_nsh"); do : > "$WORK/recov-sh$i.log" # per-waiter logs: concurrent appends to one log drop lines - AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-sh$i.log" AGENT_LOCK_STALE_SECS=8 \ - AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \ + t16_logs+=("$WORK/recov-sh$i.log") + AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-sh$i.log" AGENT_LOCK_STALE_SECS="$t16_stale" \ + AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t16_maxwait" \ bash "$SH" run -- bash -c 'touch "$1"; sleep 0.1' _ "$WORK/recov.ran.sh$i" 2>/dev/null & pids+=($!) done - for i in 1 2; do + for i in $(seq 1 "$t16_nps"); do : > "$WORK/recov-ps$i.log" - AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-ps$i.log" AGENT_LOCK_STALE_SECS=8 \ - AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \ + t16_logs+=("$WORK/recov-ps$i.log") + AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov-ps$i.log" AGENT_LOCK_STALE_SECS="$t16_stale" \ + AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t16_maxwait" \ pwsh -NoProfile -File "$PS1WIN" run "[IO.File]::WriteAllText('$WORK/recov.ran.ps$i', 'x'); Start-Sleep -Milliseconds 100" 2>/dev/null & pids+=($!) done t16_sync=1 - if ! sync_waiting_fresh "$LOCK" 90 "$WORK/recov-sh1.log" "$WORK/recov-sh2.log" \ - "$WORK/recov-ps1.log" "$WORK/recov-ps2.log"; then + if ! sync_waiting_fresh "$LOCK" 90 "${t16_logs[@]}"; then t16_sync=0 - for f in "$WORK/recov-sh1.log" "$WORK/recov-sh2.log" "$WORK/recov-ps1.log" "$WORK/recov-ps2.log"; do - grep -q "WAITING for lock" "$f" 2>/dev/null || echo " T16 waiter never contended (no WAITING in ${f##*/})" + for f in "${t16_logs[@]}"; do + grep -q "WAITING for lock" "$f" 2>/dev/null || echo " T16 N=$T16_N waiter never contended (no WAITING in ${f##*/})" done fi - backdate_ghost "$LOCK" "$T16_GHOST" 9999; t16_bd=$? # all four now judge the ghost stale together + backdate_ghost "$LOCK" "$T16_GHOST" 9999; t16_bd=$? # all waiters now judge the ghost stale together t16_fail=0; n98=0 for p in "${pids[@]}"; do wait "$p"; rc=$? case "$rc" in 0) ;; - 98) n98=$((n98+1)); echo " T16 waiter rc=98 — displacement under the claim protocol" ;; - *) t16_fail=1; echo " T16 waiter rc=$rc (want 0)" ;; + 98) n98=$((n98+1)); echo " T16 N=$T16_N waiter rc=98 — displacement under the claim protocol" ;; + *) t16_fail=1; echo " T16 N=$T16_N waiter rc=$rc (want 0)" ;; esac done touch "$T16_SAMPSTOP"; wait "$t16_sampler" 2>/dev/null - cat "$WORK"/recov-*.log > "$WORK/recov-all.log" 2>/dev/null || : > "$WORK/recov-all.log" + # Aggregate from the explicit per-waiter log list, NOT a recov-*.log glob: the + # glob would also match recov-all.log itself, which now persists across sweep N + # iterations, so a glob could self-cat a stale aggregate into the count. + cat "${t16_logs[@]}" > "$WORK/recov-all.log" 2>/dev/null || : > "$WORK/recov-all.log" if [ "$t16_bd" != 0 ]; then # The backdate was NOT conclusively clean (see backdate_ghost; under # load the whole steal+release cycle often completes before the @@ -929,7 +954,7 @@ for t16_try in $(seq 1 "$T16_TRIES"); do [ "$(grep -c "lock LOST" "$WORK/recov-all.log")" = 0 ] || t16_dirty=1 { [ -e "$LOCK" ] || [ -e "$LOCK.next" ]; } && t16_dirty=1 if [ "$t16_dirty" = 1 ]; then - echo " T16 try $t16_try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying" + echo " T16 N=$T16_N try $t16_try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying" rm -f "$LOCK" "$LOCK.next" 2>/dev/null continue fi @@ -944,30 +969,31 @@ if [ "$t16_valid" = 1 ]; then nold="$(grep -c "STOLE stale lock" "$WORK/recov-all.log")" ndisp="$(grep -c "STEAL-DISPLACED" "$WORK/recov-all.log")" [ "$t16_fail" = 0 ] && [ "$t16_sync" = 1 ] \ - && ok "2 bash + 2 pwsh waiters on one crashed lock: every waiter exited 0" \ - || bad "mixed crash-recovery exits wrong (see above)" - [ "$n98" = 0 ] && ok "zero spurious 98s — the claim serialized recovery across implementations" \ - || bad "$n98 waiter(s) exited 98 — displacement happened under the claim protocol" - [ "$nran" = 4 ] && ok "all 4 waiter commands ran" || bad "only $nran/4 waiter commands ran" - [ "$nstole" = 1 ] && ok "exactly ONE STOLE-BY-CLAIM (the claim serialized the cross-impl recovery)" \ - || bad "STOLE-BY-CLAIM x$nstole (want exactly 1)" + && ok "$t16_nsh bash + $t16_nps pwsh waiters on one crashed lock: every waiter exited 0" \ + || bad "mixed crash-recovery exits wrong$t16_ntag (see above)" + [ "$n98" = 0 ] && ok "zero spurious 98s$t16_ntag — the claim serialized recovery across implementations" \ + || bad "$n98 waiter(s) exited 98$t16_ntag — displacement happened under the claim protocol" + [ "$nran" = "$T16_N" ] && ok "all $T16_N waiter commands ran" || bad "only $nran/$T16_N waiter commands ran" + [ "$nstole" = 1 ] && ok "exactly ONE STOLE-BY-CLAIM$t16_ntag (the claim serialized the cross-impl recovery)" \ + || bad "STOLE-BY-CLAIM x$nstole$t16_ntag (want exactly 1)" grep -q "STOLE-BY-CLAIM.*ghost=pid=999 host=ghost" "$WORK/recov-all.log" \ - && ok "the steal line attributes the crashed ghost cross-impl (wire-format line 2 parsed)" \ - || bad "STOLE-BY-CLAIM does not carry the ghost's line-2 attribution" + && ok "the steal line attributes the crashed ghost cross-impl (wire-format line 2 parsed)$t16_ntag" \ + || bad "STOLE-BY-CLAIM does not carry the ghost's line-2 attribution$t16_ntag" grep -q "CLAIM .*tok=tok\." "$WORK/recov-all.log" \ - && ok "claim create logged with its per-attempt token (CLAIM ... tok=)" \ - || bad "no CLAIM line with a token in the recovery logs" - [ "$nold" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged" \ - || bad "'STOLE stale lock' shape appeared x$nold — an unserialized steal lane is present" - [ "$ndisp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)" \ - || bad "STEAL-DISPLACED fired x$ndisp — displacement-repair machinery present?" - [ -e "$T16_GRAVESEEN" ] && bad "a move-aside file (.dead.*) existed during recovery — the steal is staged through an intermediate file!" \ - || ok "no move-aside file (.dead.*) ever existed during recovery (sampler)" - [ -e "$LOCK" ] && bad "leftover crash-recovery lock" || ok "no leftover lock" - [ -e "$LOCK.next" ] && bad "leftover claim after recovery" || ok "no leftover claim" + && ok "claim create logged with its per-attempt token (CLAIM ... tok=)$t16_ntag" \ + || bad "no CLAIM line with a token in the recovery logs$t16_ntag" + [ "$nold" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged$t16_ntag" \ + || bad "'STOLE stale lock' shape appeared x$nold$t16_ntag — an unserialized steal lane is present" + [ "$ndisp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)$t16_ntag" \ + || bad "STEAL-DISPLACED fired x$ndisp$t16_ntag — displacement-repair machinery present?" + [ -e "$T16_GRAVESEEN" ] && bad "a move-aside file (.dead.*) existed during recovery$t16_ntag — the steal is staged through an intermediate file!" \ + || ok "no move-aside file (.dead.*) ever existed during recovery (sampler)$t16_ntag" + [ -e "$LOCK" ] && bad "leftover crash-recovery lock$t16_ntag" || ok "no leftover lock$t16_ntag" + [ -e "$LOCK.next" ] && bad "leftover claim after recovery$t16_ntag" || ok "no leftover claim$t16_ntag" else - bad "T16: no clean run under a conclusive backdate in $T16_TRIES attempts (see above)" + bad "T16: no clean run under a conclusive backdate in $T16_TRIES attempts$t16_ntag (see above)" fi +done fi if section "Test 16b: bash claimant vs ps1 claimant racing ONE ghost — one claim winner, cross-impl wire parity"; then diff --git a/tests/git-commit-lock.test.sh b/tests/git-commit-lock.test.sh index 8b1aa08..3bffabd 100755 --- a/tests/git-commit-lock.test.sh +++ b/tests/git-commit-lock.test.sh @@ -184,16 +184,49 @@ if section "Test 2b: crash recovery under CONTENTION — claim-serialized: zero # WINNER'S live lock), the attempt is kept only if its outcome is clean and # otherwise discarded and retried (bounded), instead of failing assertions # the protocol never violated. -T2B_N=4 +# +# Waiter count is swept over $T_AXIS_A (Bucket 6): one iteration at N=4 by +# default (byte-identical to today) and at N=4,12,24 under GCL_TEST_SWEEP=1. +# Every sweep iteration's assertions carry an " at N=" tag so a sweep +# failure says which N broke; that tag is SUPPRESSED in the default (non-sweep) +# run (t2b_ntag empty) so the messages are byte-identical to today — the first +# assertion already names the count via "$T2B_N waiters". The correctness +# invariants asserted here (zero 98, exactly one steal, no move-aside, clean +# final state) stay ok/bad strict (not envelope) at all N — but that requires +# STALE >> the winner's EFFECTIVE hold, which grows with N under load (the +# winner is one of N concurrent processes; oversubscription stretches the wall +# time between its create and release), so STALE is floored to N when sweeping +# (t2b_stale) — at the default floor it is the same 8 as today. The per-waiter +# wall-clock budget scales too: MAX_WAIT = 30*N (=> 120 at N=4, today's value) +# so a wide sweep, where the losing waiters acquire in sequence after the winner +# releases, has time to drain instead of timing out and looking like a product +# failure. T2B_TRIES=3 # per-round attempts; see the backdate_ghost note +for T2B_N in $T_AXIS_A; do +# MAX_WAIT and STALE: today's exact values (120 / 8) in the default (non-sweep) +# run so the env passed to the library is byte-identical; only the sweep's wider +# N raise them. MAX_WAIT scales 30*N (=> 120 at N=4 anyway). STALE floors to N so +# a wide fan-out's load-stretched winner hold (the winner is one of N concurrent +# processes) can never make its own live lock look stale and trigger a +# legitimate-but-unwanted second steal. +if [ "$GCL_TEST_SWEEP" = 1 ]; then + t2b_maxwait=$(( 30 * T2B_N )) + [ "$T2B_N" -gt 8 ] && t2b_stale="$T2B_N" || t2b_stale=8 + t2b_ntag=" at N=$T2B_N" +else + t2b_maxwait=120; t2b_stale=8; t2b_ntag="" +fi t2b_fail=0; t2b_stole=0; t2b_old_shape=0; t2b_disp=0; t2b_98=0; t2b_retried=0 for r in $(seq 1 "$T2B_ROUNDS"); do t2b_valid=0 for try in $(seq 1 "$T2B_TRIES"); do - GHOST="tok.ghost.t2b.$r.$try" + # Ghost token carries an N segment only when sweeping (distinct per N); the + # default keeps today's exact "tok.ghost.t2b.$r.$try" so the lock CONTENT + # the library sees is byte-identical. + if [ "$GCL_TEST_SWEEP" = 1 ]; then GHOST="tok.ghost.t2b.$T2B_N.$r.$try"; else GHOST="tok.ghost.t2b.$r.$try"; fi LOCK="$WORK/recov.$r.lock"; RAN="$WORK/recov.$r.ran"; : > "$RAN" GRAVESEEN="$WORK/recov.$r.graveseen"; SAMPSTOP="$WORK/recov.$r.sampstop" - rm -f "$GRAVESEEN" "$SAMPSTOP" "$LOCK" "$LOCK.next" + rm -f "$GRAVESEEN" "$SAMPSTOP" "$LOCK" "$LOCK.next" "$WORK/recov.$r".*.log fabricate_lock "$LOCK" "$GHOST" "pid=999 host=ghost" # fresh mtime: not yet stale # Move-aside sampler: ANY .dead.* sighting at ANY moment during the round # means the implementation stages the steal through an intermediate file @@ -207,21 +240,21 @@ for r in $(seq 1 "$T2B_ROUNDS"); do done ) & sampler=$! - pids=() + pids=(); waiter_logs=() for i in $(seq 1 "$T2B_N"); do : > "$WORK/recov.$r.$i.log" # per-waiter logs: concurrent appends to one log drop lines - AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov.$r.$i.log" AGENT_LOCK_STALE_SECS=8 \ - AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \ + waiter_logs+=("$WORK/recov.$r.$i.log") + AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/recov.$r.$i.log" AGENT_LOCK_STALE_SECS="$t2b_stale" \ + AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t2b_maxwait" \ bash "$LIB" run -- bash -c 'echo ran >> "$1"; sleep 0.1' _ "$RAN" 2>/dev/null & pids+=($!) done t2b_sync=1 - if ! sync_waiting_fresh "$LOCK" 60 "$WORK/recov.$r.1.log" "$WORK/recov.$r.2.log" \ - "$WORK/recov.$r.3.log" "$WORK/recov.$r.4.log"; then + if ! sync_waiting_fresh "$LOCK" 60 "${waiter_logs[@]}"; then t2b_sync=0 for i in $(seq 1 "$T2B_N"); do grep -q "WAITING for lock" "$WORK/recov.$r.$i.log" 2>/dev/null \ - || echo " round $r: waiter $i never logged WAITING" + || echo " N=$T2B_N round $r: waiter $i never logged WAITING" done fi backdate_ghost "$LOCK" "$GHOST" 9999; bd=$? # all waiters now judge the ghost stale together @@ -230,8 +263,8 @@ for r in $(seq 1 "$T2B_ROUNDS"); do wait "${pids[$((i-1))]}"; rc=$? case "$rc" in 0) ;; - 98) round_98=$((round_98+1)); echo " round $r: waiter $i rc=98 — displacement under the claim protocol" ;; - *) round_badrc=$((round_badrc+1)); echo " round $r: waiter $i rc=$rc (want 0)" ;; + 98) round_98=$((round_98+1)); echo " N=$T2B_N round $r: waiter $i rc=98 — displacement under the claim protocol" ;; + *) round_badrc=$((round_badrc+1)); echo " N=$T2B_N round $r: waiter $i rc=$rc (want 0)" ;; esac done touch "$SAMPSTOP"; wait "$sampler" 2>/dev/null @@ -254,7 +287,7 @@ for r in $(seq 1 "$T2B_ROUNDS"); do { [ -e "$LOCK" ] || [ -e "$LOCK.next" ]; } && round_dirty=1 if [ "$round_dirty" = 1 ]; then t2b_retried=$((t2b_retried+1)) - echo " round $r try $try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying" + echo " N=$T2B_N round $r try $try: non-conclusive backdate AND dirty outcome — attempt discarded, retrying" rm -f "$LOCK" "$LOCK.next" "$RAN" "$GRAVESEEN" "$SAMPSTOP" continue fi @@ -266,38 +299,39 @@ for r in $(seq 1 "$T2B_ROUNDS"); do nran="$(grep -c ran "$RAN")" [ "$nran" = "$T2B_N" ] || { t2b_fail=1 - echo " round $r: only $nran/$T2B_N commands ran" + echo " N=$T2B_N round $r: only $nran/$T2B_N commands ran" } [ -e "$LOCK" ] && { t2b_fail=1 - echo " round $r: leftover lock" + echo " N=$T2B_N round $r: leftover lock" } [ -e "$LOCK.next" ] && { t2b_fail=1 - echo " round $r: leftover claim" + echo " N=$T2B_N round $r: leftover claim" } [ -e "$GRAVESEEN" ] && { t2b_fail=1 - echo " round $r: a move-aside file (.dead.*) existed during recovery — the steal is staged through an intermediate file!" + echo " N=$T2B_N round $r: a move-aside file (.dead.*) existed during recovery — the steal is staged through an intermediate file!" } t2b_stole=$((t2b_stole + $(grep -c "STOLE-BY-CLAIM" "$WORK/recov.$r.all.log"))) t2b_old_shape=$((t2b_old_shape + $(grep -c "STOLE stale lock" "$WORK/recov.$r.all.log"))) t2b_disp=$((t2b_disp + $(grep -c "STEAL-DISPLACED" "$WORK/recov.$r.all.log"))) break done - [ "$t2b_valid" = 1 ] || { t2b_fail=1; echo " round $r: no clean round under a conclusive backdate in $T2B_TRIES attempts"; } + [ "$t2b_valid" = 1 ] || { t2b_fail=1; echo " N=$T2B_N round $r: no clean round under a conclusive backdate in $T2B_TRIES attempts"; } done -[ "$t2b_retried" = 0 ] || echo " note: $t2b_retried discarded attempt(s) — harness backdate race, not a protocol verdict" +[ "$t2b_retried" = 0 ] || echo " note: $t2b_retried discarded attempt(s) at N=$T2B_N — harness backdate race, not a protocol verdict" [ "$t2b_fail" = 0 ] && ok "$T2B_ROUNDS rounds x $T2B_N waiters on one crashed lock: all ran, clean final state, no move-aside file ever existed" \ - || bad "crash-recovery contention failure (see above)" -[ "$t2b_98" = 0 ] && ok "zero spurious 98s — the claim serialized recovery (unserialized: near-certain displacement)" \ - || bad "$t2b_98 waiter(s) exited 98 — displacement happened under the claim protocol" -[ "$t2b_stole" = "$T2B_ROUNDS" ] && ok "exactly one STOLE-BY-CLAIM per recovery (x$t2b_stole/$T2B_ROUNDS rounds)" \ - || bad "STOLE-BY-CLAIM count $t2b_stole != $T2B_ROUNDS rounds (want exactly one steal per recovery)" -[ "$t2b_old_shape" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged" \ - || bad "'STOLE stale lock' line appeared x$t2b_old_shape — an unserialized steal lane is present" -[ "$t2b_disp" = 0 ] && ok "zero STEAL-DISPLACED lines (prevention, not detect-and-repair)" \ - || bad "STEAL-DISPLACED fired x$t2b_disp — displacement-repair machinery present?" + || bad "crash-recovery contention failure$t2b_ntag (see above)" +[ "$t2b_98" = 0 ] && ok "zero spurious 98s$t2b_ntag — the claim serialized recovery (unserialized: near-certain displacement)" \ + || bad "$t2b_98 waiter(s) exited 98$t2b_ntag — displacement happened under the claim protocol" +[ "$t2b_stole" = "$T2B_ROUNDS" ] && ok "exactly one STOLE-BY-CLAIM per recovery$t2b_ntag (x$t2b_stole/$T2B_ROUNDS rounds)" \ + || bad "STOLE-BY-CLAIM count $t2b_stole != $T2B_ROUNDS rounds$t2b_ntag (want exactly one steal per recovery)" +[ "$t2b_old_shape" = 0 ] && ok "unserialized-steal line shape ('STOLE stale lock') never logged$t2b_ntag" \ + || bad "'STOLE stale lock' line appeared x$t2b_old_shape$t2b_ntag — an unserialized steal lane is present" +[ "$t2b_disp" = 0 ] && ok "zero STEAL-DISPLACED lines$t2b_ntag (prevention, not detect-and-repair)" \ + || bad "STEAL-DISPLACED fired x$t2b_disp$t2b_ntag — displacement-repair machinery present?" +done fi if section "Test 3: REGRESSION — EMPTY lock file (crash between create and write) is still stolen"; then @@ -1073,36 +1107,77 @@ if section "Test 20: claim contention — N concurrent stealers, ONE claim winne # N stealers race one ancient ghost: exactly one wins the O_EXCL claim and # steals (one STOLE-BY-CLAIM); the rest lose the claim create and acquire # normally in sequence after the winner releases. No displacement (zero -# LOST/98), no leftovers. STALE=5 keeps a loaded box from re-stealing the -# winner's brief hold. +# LOST/98), no leftovers. STALE keeps a loaded box from re-stealing the +# winner's brief hold — that bound only holds while STALE >> the winner's +# effective hold, which (counter-intuitively) grows with N: the WINNER is one +# of N concurrently-spawned bash processes, so under oversubscription the wall +# time between its create and its release stretches with the contention. So +# STALE must scale with N too (see t20_stale below), keeping "exactly one +# steal" a strict, config-independent correctness invariant at every N. +# +# Waiter count is swept (Bucket 6). Unlike Test 2b/16, this test's floor is NOT +# 4 — it is the MODE-driven $T20_N (5 REDUCED / 10 FULL), the count CI already +# stresses. So instead of iterating the shared T_AXIS_A ("4 ...") it builds its +# own list: just $T20_N by default (byte-identical), and $T20_N plus the sweep's +# higher counts (12, 24) under GCL_TEST_SWEEP=1 — preserving today's per-PR AND +# full-mode coverage while still widening the sweep. MAX_WAIT scales 30*N (the +# workers run `true`, so this is ample headroom, never the floor's behaviour). LOCK="$WORK/contend.lock" -fabricate_lock "$LOCK" "tok.ghost.t20" "pid=888 host=ghost" +T20_FLOOR="$T20_N" +if [ "$GCL_TEST_SWEEP" = 1 ]; then + T20_AXIS="$T20_FLOOR" + for _n in 12 24; do [ "$_n" = "$T20_FLOOR" ] || T20_AXIS="$T20_AXIS $_n"; done +else + T20_AXIS="$T20_FLOOR" +fi +for T20_N in $T20_AXIS; do +# N-tag for assertion messages: empty in the default run (byte-identical), set +# only when sweeping so each N's pass/fail line is attributable. +if [ "$GCL_TEST_SWEEP" = 1 ]; then t20_ntag=" at N=$T20_N"; else t20_ntag=""; fi +# MAX_WAIT and STALE: keep today's exact values (120 / 5) in the default +# (non-sweep) run so the env passed to the library is byte-identical; only the +# sweep's wider N raise them. MAX_WAIT scales 30*N (workers run `true`, ample +# headroom). STALE floors to N so a wide fan-out's load-stretched winner hold +# can NEVER make a live lock look stale -> the "exactly one steal" invariant +# stays true at N=24 just as at the floor. The fixture ghost token likewise +# carries an N segment only when sweeping (distinct tokens per N), so the +# default lock CONTENT the library sees is unchanged too. +if [ "$GCL_TEST_SWEEP" = 1 ]; then + t20_maxwait=$(( 30 * T20_N )) + [ "$T20_N" -gt 5 ] && t20_stale="$T20_N" || t20_stale=5 + t20_ghost="tok.ghost.t20.$T20_N" +else + t20_maxwait=120; t20_stale=5; t20_ghost="tok.ghost.t20" +fi +rm -f "$WORK/contend".*.log "$LOCK" "$LOCK.next" +fabricate_lock "$LOCK" "$t20_ghost" "pid=888 host=ghost" backdate "$LOCK" 9999 pids=(); t20_fail=0 for i in $(seq 1 "$T20_N"); do : > "$WORK/contend.$i.log" - AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/contend.$i.log" AGENT_LOCK_STALE_SECS=5 \ - AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT=120 \ + AGENT_LOCK_PATH="$LOCK" AGENT_LOCK_LOG="$WORK/contend.$i.log" AGENT_LOCK_STALE_SECS="$t20_stale" \ + AGENT_LOCK_CLAIM_STALE_SECS=60 AGENT_LOCK_POLL_SECS=0.05 AGENT_LOCK_MAX_WAIT="$t20_maxwait" \ bash "$LIB" run -- bash -c 'true' 2>/dev/null & pids+=($!) done for i in $(seq 1 "$T20_N"); do wait "${pids[$((i-1))]}"; rc=$? - [ "$rc" = 0 ] || { t20_fail=1; echo " worker $i rc=$rc (want 0)"; } + [ "$rc" = 0 ] || { t20_fail=1; echo " N=$T20_N worker $i rc=$rc (want 0)"; } done cat "$WORK/contend."*.log > "$WORK/contend.all.log" nst="$(grep -c "STOLE-BY-CLAIM" "$WORK/contend.all.log")" nacq="$(grep -c "ACQUIRED" "$WORK/contend.all.log")" nrel="$(grep -c "RELEASED" "$WORK/contend.all.log")" nlost="$(grep -c "lock LOST" "$WORK/contend.all.log")" -[ "$t20_fail" = 0 ] && ok "$T20_N concurrent stealers all completed with rc 0" || bad "claim-contention worker failures (see above)" -[ "$nst" = 1 ] && ok "exactly ONE claim winner stole the ghost (STOLE-BY-CLAIM x$nst)" \ - || bad "STOLE-BY-CLAIM x$nst (want exactly 1 — the claim must serialize stealers)" +[ "$t20_fail" = 0 ] && ok "$T20_N concurrent stealers all completed with rc 0" || bad "claim-contention worker failures$t20_ntag (see above)" +[ "$nst" = 1 ] && ok "exactly ONE claim winner stole the ghost$t20_ntag (STOLE-BY-CLAIM x$nst)" \ + || bad "STOLE-BY-CLAIM x$nst$t20_ntag (want exactly 1 — the claim must serialize stealers)" [ "$nacq" = "$T20_N" ] && [ "$nrel" = "$T20_N" ] && ok "balanced ACQUIRED/RELEASED ($nacq/$nrel of $T20_N)" \ - || bad "ACQUIRED=$nacq RELEASED=$nrel (want $T20_N each)" -[ "$nlost" = 0 ] && ok "zero LOST warnings under claim contention" || bad "$nlost LOST warnings under claim contention" -[ -e "$LOCK" ] && bad "leftover lock after contention" || ok "no leftover lock" -[ -e "$LOCK.next" ] && bad "leftover claim after contention" || ok "no leftover claim" + || bad "ACQUIRED=$nacq RELEASED=$nrel$t20_ntag (want $T20_N each)" +[ "$nlost" = 0 ] && ok "zero LOST warnings under claim contention$t20_ntag" || bad "$nlost LOST warnings under claim contention$t20_ntag" +[ -e "$LOCK" ] && bad "leftover lock after contention$t20_ntag" || ok "no leftover lock$t20_ntag" +[ -e "$LOCK.next" ] && bad "leftover claim after contention$t20_ntag" || ok "no leftover claim$t20_ntag" +done fi if section "Test 21: crashed-claimant and empty-claim orphans age out; steals resume"; then From 792ab90e29b2990e89d67dff26c2256c3316ed6a Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 05:58:02 +1000 Subject: [PATCH 44/76] Bucket 6c: nightly.yml (load matrix + kcov + idempotent issue triage) New scheduled/dispatch workflow for the load tier (non-blocking, triaged): - 6 cells (matrix.include, fail-fast: false): ubuntu cpu/disk/both, macos disk, windows interop-integration/disk, windows unit/both. R=2 oversubscription via with-load.sh; GCL_ENVELOPE_TIER=relax + GCL_TEST_SWEEP=1 + GCL_TEST_FULL=1. Each cell uploads logs + load-manifest on success too. concurrency: nightly. - kcov coverage job (Linux): build kcov v43 from source, run the unit suite FULL strict no-load, gate on a 0.80 line-coverage floor (tracks the achieved ~0.83; ratchet up as Tier-A coverage lands), upload HTML + cobertura (30d). - Issue auto-triage (.github/scripts/nightly-triage.sh, issues: write, if: always()): per-cell ground-truth (cell-conclusion.txt, not the misleading matrix-aggregate result); classes correctness / envelope / infra; idempotent one-issue-per-(date,class); empty-round guard (missing artifact != green). Added the triage script to the shellcheck lint list. actionlint clean; nightly-triage.sh shellcheck -S style + bash -n clean; kcov floor parse verified against the committed 451/543=0.83 fixture. Schedule auto-disables after ~60d inactivity; workflow_dispatch revives it. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/scripts/nightly-triage.sh | 220 +++++++++++++++++++++++ .github/workflows/nightly.yml | 284 ++++++++++++++++++++++++++++++ .github/workflows/tests.yml | 1 + 3 files changed, 505 insertions(+) create mode 100644 .github/scripts/nightly-triage.sh create mode 100644 .github/workflows/nightly.yml diff --git a/.github/scripts/nightly-triage.sh b/.github/scripts/nightly-triage.sh new file mode 100644 index 0000000..485764d --- /dev/null +++ b/.github/scripts/nightly-triage.sh @@ -0,0 +1,220 @@ +#!/usr/bin/env bash +# nightly-triage.sh — classify a nightly stress run's results and file/append a +# single labelled GitHub issue per (date, class), idempotently. +# +# Invoked by the `triage` job in .github/workflows/nightly.yml AFTER it has +# downloaded every matrix cell's `test-output/` artifact (each into a directory +# named `nightly-logs-/`) and written the per-cell job conclusions to a +# JSON file. It reads only files on disk + `gh`; it makes no test decisions of its +# own beyond parsing the preserved logs. +# +# CLASSIFICATION (per the Bucket 6 spec): +# correctness — any `^FAIL:` line in a suite log, OR a cell job concluded +# `failure`. Files/append a `nightly-correctness` issue. The one +# class that demands investigation. +# envelope — no FAIL anywhere, but at least one `WARN[env-relaxed]` line in a +# log of a cell that *succeeded*. Tracked (`nightly-envelope`); the +# three wall-clock envelope assertions stretched under load — by +# design under GCL_ENVELOPE_TIER=relax — so NO investigation action. +# infra — a cell's artifact is missing, the cell job neither succeeded nor +# cleanly failed-on-an-assertion (timeout / cancelled / checkout +# failure / errored before any suite ran), OR — the EMPTY-ROUND +# GUARD — *no* cell produced any log at all. Filed `nightly-infra`. +# Crucially, "0 FAIL across 0 logs" is NEVER read as green: with no +# evidence we classify infra, not success. +# +# Idempotency: one open issue per (run-date, class). We search open issues by a +# stable title prefix + label; if one exists we append a comment, else we create. +# Re-running triage for the same date therefore appends rather than spamming. +# +# All-green (every cell success, no FAIL, no env warn, every artifact present) ⇒ +# NO issue of any kind is filed. +# +# Inputs (environment): +# ARTIFACTS_DIR dir holding the downloaded per-cell artifact directories +# (default: ./artifacts). Each cell dir is `nightly-logs-/`. +# CONCLUSIONS path to a JSON object { "": "", ... } of +# each matrix cell job's `result` (success|failure|cancelled| +# skipped). Read from `/cell-conclusion.txt`, which each +# stress cell writes (always()) into its own artifact — so the +# conclusion is ground truth PER CELL, never a matrix aggregate. +# EXPECTED_CELLS space-separated list of cell ids that were supposed to run +# (default: the six N1..N6 ids). Lets the empty-round / missing- +# artifact guard know what to expect. +# RUN_DATE UTC date stamp for the issue title (default: today, UTC). +# GITHUB_REPOSITORY / GH_TOKEN(GITHUB_TOKEN) the usual `gh` env. +# DRY_RUN=1 print the `gh` actions instead of running them (for local tests). +set -uo pipefail + +ARTIFACTS_DIR="${ARTIFACTS_DIR:-./artifacts}" +EXPECTED_CELLS="${EXPECTED_CELLS:-N1 N2 N3 N4 N5 N6}" +RUN_DATE="${RUN_DATE:-$(date -u +%Y-%m-%d)}" +DRY_RUN="${DRY_RUN:-0}" + +log() { printf '%s\n' "$*" >&2; } + +# A cell's log directory and its suite logs (may be absent ⇒ infra). +cell_logdir() { printf '%s/nightly-logs-%s' "$ARTIFACTS_DIR" "$1"; } + +# ── Read a cell's OWN recorded conclusion from its artifact (ground truth: each +# stress cell writes job.status to cell-conclusion.txt under always()). Absent +# file ⇒ `unknown` (handled like a missing artifact). ────────────────────────── +cell_conclusion() { + local cell="$1" f val="" + f="$(cell_logdir "$cell")/cell-conclusion.txt" + if [ -f "$f" ]; then + val="$(tr -d '[:space:]' < "$f" 2>/dev/null)" + fi + printf '%s' "${val:-unknown}" +} + +# ── Classify each expected cell. Accumulate evidence lines per class. ─────────── +correctness_evidence="" +envelope_evidence="" +infra_evidence="" + +any_log_seen=0 # for the empty-round guard + +for cell in $EXPECTED_CELLS; do + dir="$(cell_logdir "$cell")" + concl="$(cell_conclusion "$cell")" + + # Gather this cell's suite logs (unit/interop/integration *.log under the artifact). + logs=() + if [ -d "$dir" ]; then + while IFS= read -r f; do logs+=("$f"); done \ + < <(find "$dir" -type f -name '*.log' 2>/dev/null) + fi + + if [ "${#logs[@]}" -eq 0 ]; then + # No artifact / no logs for an expected cell. Distinguish: a clean job that + # somehow uploaded nothing is still suspect ⇒ infra (we cannot prove it green). + infra_evidence+="- ${cell}: no logs found (artifact missing or empty; job conclusion='${concl}')"$'\n' + log "[$cell] INFRA: no logs (conclusion=$concl)" + continue + fi + any_log_seen=1 + + # Scan the logs. + cell_fail=0 + cell_envwarn=0 + fail_lines="" + for f in "${logs[@]}"; do + if grep -qE '^FAIL:' "$f" 2>/dev/null; then + cell_fail=1 + # Keep up to 5 FAIL lines per log as evidence. + fail_lines+="$(grep -nE '^FAIL:' "$f" 2>/dev/null | head -5 | sed "s#^# ${f##*/}: #")"$'\n' + fi + if grep -qE 'WARN\[env-relaxed\]' "$f" 2>/dev/null; then + cell_envwarn=1 + fi + done + + if [ "$cell_fail" -eq 1 ] || [ "$concl" = "failure" ]; then + correctness_evidence+="- ${cell}: job='${concl}'" + [ "$cell_fail" -eq 1 ] && correctness_evidence+=", FAIL lines present:"$'\n'"${fail_lines}" || correctness_evidence+=" (job failed; no ^FAIL: in logs — see job log)"$'\n' + log "[$cell] CORRECTNESS (cell_fail=$cell_fail conclusion=$concl)" + elif [ "$concl" != "success" ]; then + # Logs exist but the job did not cleanly succeed and there is no assertion FAIL: + # timeout / cancelled / errored late ⇒ infra, not green. + infra_evidence+="- ${cell}: logs present but job conclusion='${concl}' (timeout/cancel/late error)"$'\n' + log "[$cell] INFRA (conclusion=$concl, no FAIL)" + elif [ "$cell_envwarn" -eq 1 ]; then + envelope_evidence+="- ${cell}: succeeded with WARN[env-relaxed] (envelope assertion(s) stretched under load — expected)"$'\n' + log "[$cell] ENVELOPE (success + env-relaxed warn)" + else + log "[$cell] OK (success, no FAIL, no env warn)" + fi +done + +# ── EMPTY-ROUND GUARD: if not a single expected cell produced any log, the run +# errored before any suite ran (checkout failure, total infra collapse). That is +# INFRA, never green — do not let "0 FAIL across 0 logs" pass as success. ────── +if [ "$any_log_seen" -eq 0 ]; then + empty_msg="EMPTY ROUND: none of the expected cells (${EXPECTED_CELLS}) produced any suite log. The workflow errored before any suite ran (checkout failure / total infra collapse) — this is NOT a passing nightly." + infra_evidence="${empty_msg}"$'\n'"${infra_evidence}" + log "EMPTY-ROUND GUARD fired: no logs from any cell." +fi + +# ── File/append issues, idempotently, one per (date, class). ──────────────────── +# Title prefix is stable per class+date so search-then-append is reliable. +file_issue() { # $1=class-label $2=title $3=body + local label="$1" title="$2" body="$3" existing="" + + if [ "$DRY_RUN" = 1 ]; then + log "DRY_RUN: would search open issues label=$label title~='$title'" + log "DRY_RUN: title='$title'" + log "DRY_RUN: body:"; printf '%s\n' "$body" >&2 + return 0 + fi + + # Search OPEN issues with this label whose title exactly matches (idempotency key). + # `gh issue list --search` uses GitHub search; we additionally filter the JSON by + # exact title to avoid a substring collision. + existing="$(gh issue list --state open --label "$label" \ + --search "$title in:title" --json number,title \ + --jq ".[] | select(.title == \"$title\") | .number" 2>/dev/null | head -1)" + + if [ -n "$existing" ]; then + log "Appending to existing issue #$existing ($label)" + if gh issue comment "$existing" --body "$body" >/dev/null; then + log "Appended comment to #$existing" + else + log "WARN: failed to append to #$existing" + fi + else + log "Creating new issue ($label): $title" + if gh issue create --title "$title" --label "$label" --body "$body" >/dev/null; then + log "Created issue ($label)" + else + log "WARN: failed to create issue ($label)" + fi + fi +} + +run_url="${GITHUB_SERVER_URL:-https://github.com}/${GITHUB_REPOSITORY:-}/actions/runs/${GITHUB_RUN_ID:-}" +filed=0 + +if [ -n "$correctness_evidence" ]; then + body="Nightly stress run on **${RUN_DATE}** has CORRECTNESS failures (a \`FAIL:\` assertion and/or a cell job concluded \`failure\`). **Investigate.** + +$correctness_evidence +Run: ${run_url} + +(Auto-filed by nightly-triage.sh; idempotent per (date, class) — re-runs append.)" + file_issue "nightly-correctness" "Nightly correctness failure — ${RUN_DATE}" "$body" + filed=1 +fi + +if [ -n "$infra_evidence" ]; then + body="Nightly stress run on **${RUN_DATE}** had INFRA issues (missing artifact / timeout / cancel / errored before suites ran). Not a product failure, but the run did not produce trustworthy results — re-dispatch or investigate the runner. + +$infra_evidence +Run: ${run_url} + +(Auto-filed by nightly-triage.sh; idempotent per (date, class).)" + file_issue "nightly-infra" "Nightly infra issue — ${RUN_DATE}" "$body" + filed=1 +fi + +# Envelope is filed ONLY when there is no correctness failure (a correctness issue +# subsumes it — under a red run the env warns are noise). Tracked, no action. +if [ -z "$correctness_evidence" ] && [ -n "$envelope_evidence" ]; then + body="Nightly stress run on **${RUN_DATE}**: no correctness failures, but envelope (wall-clock) assertions were relaxed under load (\`WARN[env-relaxed]\`). This is EXPECTED under GCL_ENVELOPE_TIER=relax — tracked, **no investigation needed** unless it becomes persistent at low load. + +$envelope_evidence +Run: ${run_url} + +(Auto-filed by nightly-triage.sh; idempotent per (date, class).)" + file_issue "nightly-envelope" "Nightly envelope warning — ${RUN_DATE}" "$body" + filed=1 +fi + +if [ "$filed" -eq 0 ]; then + log "ALL GREEN: every expected cell succeeded, no FAIL, no env warn, all artifacts present. No issue filed." +fi + +# Triage itself succeeds whenever it ran to completion — it must not red the +# workflow for finding failures (those are surfaced as issues). It only fails if it +# could not run at all (handled by `set -uo pipefail` on a genuine scripting error). +exit 0 diff --git a/.github/workflows/nightly.yml b/.github/workflows/nightly.yml new file mode 100644 index 0000000..6c72d6a --- /dev/null +++ b/.github/workflows/nightly.yml @@ -0,0 +1,284 @@ +name: nightly + +# Scheduled stress run: the test suites under calibrated background load (the +# `tests/with-load.sh` wrapper) at one oversubscription level R≈2, plus a kcov +# line-coverage gate and auto-triage of the results into labelled issues. +# +# This is NON-BLOCKING: there is no branch protection on this single-dev project +# (decision 2026-06-18), so nightly never gates a PR. Its job is to catch +# load-sensitive flakes and coverage regressions that the per-PR `tests.yml` +# (no-load, strict) cannot. +# +# NOTE for a future maintainer: GitHub auto-DISABLES a `schedule` trigger after +# ~60 days of repo inactivity. If the nightly history is empty, that may mean the +# schedule was disabled (not that every run passed) — re-enable / revive it with a +# manual `workflow_dispatch` run from the Actions tab. Rely on `workflow_dispatch` +# as the always-available manual trigger. + +on: + schedule: + - cron: '23 8 * * *' # 08:23 UTC daily — off-peak (low GitHub-hosted-runner contention) + workflow_dispatch: + +# One nightly at a time; a newer run supersedes an in-flight one. +concurrency: + group: nightly + cancel-in-progress: true + +permissions: + contents: read + +env: + # The suites run at full fan-out, with the envelope (wall-clock) assertions + # RELAXED so an oversubscribed runner cannot turn a latency stretch into a red + # (only correctness assertions can fail the suite under load), and with the + # Axis-A waiter-count sweep {4,12,24} enabled. + GCL_TEST_FULL: 1 + GCL_ENVELOPE_TIER: relax + GCL_TEST_SWEEP: 1 + # One oversubscription level R≈2 (stressors ≈ 2 * nproc per kind, total capped at + # GCL_STRESS_RATIO_MAX * nproc by with-load.sh). + GCL_STRESS_RATIO: 2 + +jobs: + # ── The 6 stress cells. Each runs the relevant suite(s) wrapped in with-load.sh + # under one GCL_STRESS_KIND. `leg` selects which suites run (mirrors tests.yml): + # ubuntu/macos run the full set; windows splits unit vs interop-integration. ── + stress: + name: ${{ matrix.id }} ${{ matrix.os }} (${{ matrix.kind }}${{ matrix.leg != 'all' && format(', {0}', matrix.leg) || '' }}) + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false # every cell's verdict is signal — and triage needs them all + matrix: + include: + - { id: N1, os: ubuntu-24.04, leg: all, kind: cpu, job_timeout: 70 } + - { id: N2, os: ubuntu-24.04, leg: all, kind: disk, job_timeout: 70 } + - { id: N3, os: ubuntu-24.04, leg: all, kind: both, job_timeout: 70 } + - { id: N4, os: macos-15, leg: all, kind: disk, job_timeout: 70 } + - { id: N5, os: windows-2025, leg: interop-integration, kind: disk, job_timeout: 55 } + - { id: N6, os: windows-2025, leg: unit, kind: both, job_timeout: 60 } + timeout-minutes: ${{ matrix.job_timeout }} # generous: load slows everything; backstop only + defaults: + run: + shell: bash # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires + env: + GCL_STRESS_KIND: ${{ matrix.kind }} + steps: + - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0 # v6.0.3, SHA-pinned + with: + persist-credentials: false + + - name: Toolchain versions (for reconstructing failures) + run: | + uname -a + bash --version | head -1 + git --version + command -v stress-ng >/dev/null && stress-ng --version | head -1 || echo "stress-ng: NOT FOUND (with-load.sh uses the portable bash spinner)" + if command -v pwsh >/dev/null; then + pwsh -NoProfile -Command '"pwsh " + $PSVersionTable.PSVersion.ToString()' + else + echo "pwsh: NOT FOUND (interop suite will skip; integration runs bash-only)" + fi + if command -v powershell >/dev/null; then + powershell -NoProfile -Command '"powershell " + $PSVersionTable.PSVersion.ToString()' + else + echo "powershell (Windows PowerShell 5.1): NOT FOUND (interop Test 17 skips; expected on POSIX legs)" + fi + stat --version 2>/dev/null | head -1 || echo "stat: BSD variant" + + - name: Unit suite (under load) + if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }} + timeout-minutes: ${{ matrix.os == 'windows-2025' && 40 || 25 }} # raised: load + the N=24 sweep stretch wall-clock; a step timeout FAILS the step so the upload still runs + env: + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit + run: | + mkdir -p test-output + bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 | tee test-output/unit-suite.log + + - name: Interop suite (under load; bash + pwsh) + if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} # run even if an earlier suite failed — every signal is useful + timeout-minutes: 30 + env: + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop + run: | + mkdir -p test-output + bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 | tee test-output/interop-suite.log + + - name: Integration suite (under load; real concurrent commits) + if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} + timeout-minutes: 20 # its internal AGENT_LOCK_MAX_WAIT cap is 240s; load + sweep stretch it + env: + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration + run: | + mkdir -p test-output + bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 | tee test-output/integration-suite.log + + - name: Record this cell's conclusion (ground truth for triage) + if: ${{ always() }} # capture the cell's own status — even on timeout/cancel — into its artifact + run: | + mkdir -p test-output + # job.status here reflects THIS cell's run so far: success | failure | cancelled. + # A step timeout fails the step, which makes the job status `failure` by the time + # this always() step runs — so a no-FAIL timeout is recorded as `failure`, and the + # triage script (seeing logs present but conclusion!=success and no ^FAIL:) classes + # it infra. The per-cell status file is the authoritative signal triage reads. + printf '%s' "${{ job.status }}" > test-output/cell-conclusion.txt + echo "cell ${{ matrix.id }} conclusion: $(cat test-output/cell-conclusion.txt)" + + - name: Upload cell logs + load-manifest (on success too — we read the positives by the negatives) + if: ${{ always() }} # upload whether the cell passed, failed, or timed out — triage needs every cell's evidence + uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1, SHA-pinned + with: + name: nightly-logs-${{ matrix.id }} # unique per cell; the triage job downloads these by name + path: test-output/ + include-hidden-files: true # lock logs live under the scratch repo's .git/ (hidden); suite-generated, no secrets + if-no-files-found: warn + retention-days: 14 + + # ── kcov line-coverage gate. Linux-only, no load, strict, unit suite at FULL. + # Build kcov v43 from source (no apt package / prebuilt). Gate at 0.80. ────── + kcov: + name: kcov coverage (Linux, no load, strict) + runs-on: ubuntu-24.04 + timeout-minutes: 30 + env: + COVERAGE_FLOOR: '0.80' # tracks achieved (~83%) — RATCHET UP toward ~0.90 as Tier-A tests land; do not let it lead coverage + steps: + - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0 # v6.0.3, SHA-pinned + with: + persist-credentials: false + + - name: Install kcov build dependencies + run: | + sudo apt-get update + sudo apt-get install -y --no-install-recommends \ + cmake g++ make pkg-config \ + libdw-dev libelf-dev binutils-dev libcurl4-openssl-dev zlib1g-dev libiberty-dev + + - name: Build kcov v43 from source + run: | + set -euo pipefail + cd /tmp + curl -fsSL https://github.com/SimonKagstrom/kcov/archive/refs/tags/v43.tar.gz | tar xz + mkdir kcov-build && cd kcov-build + cmake ../kcov-43 + make -j"$(nproc)" + ./src/kcov --version + + - name: Run unit suite under kcov (FULL, strict, no load) + env: + GCL_TEST_FULL: 1 + # GCL_ENVELOPE_TIER unset => strict (we want a true, clean coverage run; no load applied) + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/kcov-unit + run: | + mkdir -p test-output coverage + /tmp/kcov-build/src/kcov --include-path="$(pwd)/git-commit-lock.sh" \ + coverage/kcov-out tests/git-commit-lock.test.sh 2>&1 | tee test-output/kcov-unit-suite.log + + - name: Enforce coverage floor (parse cobertura line-rate) + run: | + set -euo pipefail + # kcov writes a per-binary report under coverage/kcov-out/./ and a + # merged top-level coverage/kcov-out/cobertura.xml. For a single-binary run they + # are equivalent; pick the one with the highest lines-valid (most complete) so + # this is robust either way. + cob="" + best_valid=-1 + while IFS= read -r f; do + v="$(grep -oE 'lines-valid="[0-9]+"' "$f" 2>/dev/null | head -1 | grep -oE '[0-9]+')" + v="${v:-0}" + if [ "$v" -gt "$best_valid" ]; then best_valid="$v"; cob="$f"; fi + done < <(find coverage/kcov-out -name cobertura.xml 2>/dev/null) + if [ -z "$cob" ] || [ ! -f "$cob" ]; then + echo "::error::no cobertura.xml found under coverage/kcov-out — kcov produced no report" + find coverage/kcov-out -maxdepth 3 -type f 2>/dev/null | sed 's/^/ /' + exit 1 + fi + echo "Parsing coverage from: $cob (lines-valid=$best_valid)" + # Prefer the precise lines-covered/lines-valid ratio (exact); fall back to the + # rounded line-rate attribute. Both live on the top-level tag. + covered="$(grep -oE 'lines-covered="[0-9]+"' "$cob" | head -1 | grep -oE '[0-9]+')" + valid="$(grep -oE 'lines-valid="[0-9]+"' "$cob" | head -1 | grep -oE '[0-9]+')" + rate="$(grep -oE 'line-rate="[0-9.]+"' "$cob" | head -1 | grep -oE '[0-9.]+')" + if [ -n "$covered" ] && [ -n "$valid" ] && [ "$valid" -gt 0 ]; then + # exact ratio to 4 dp, integer arithmetic (no bc/python dependency) + rate="$(awk -v c="$covered" -v v="$valid" 'BEGIN { printf "%.4f", c / v }')" + echo "Line coverage: $covered / $valid = $rate" + else + echo "Line coverage (from line-rate attribute): $rate (lines-covered/valid unavailable)" + fi + floor="$COVERAGE_FLOOR" + # Compare rate >= floor with awk (float-safe). + if awk -v r="$rate" -v f="$floor" 'BEGIN { exit !(r + 0 >= f + 0) }'; then + echo "PASS: line coverage $rate >= floor $floor" + echo "NOTE: the floor ($floor) tracks the achieved coverage (~0.83); ratchet it up toward ~0.90 as Bucket-2 Tier-A tests land. The Linux ceiling is ~0.94 (~30 lines are platform-gated)." + else + echo "::error::line coverage $rate is BELOW the floor $floor — coverage regressed" + echo "The floor tracks achieved coverage (~0.83) and should only ratchet UP as tests land. A drop means a test stopped exercising lines it used to. Investigate before lowering the floor." + exit 1 + fi + + - name: Upload coverage report (HTML + cobertura) + if: ${{ !cancelled() }} + uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1, SHA-pinned + with: + name: kcov-coverage + path: | + coverage/kcov-out/ + test-output/kcov-unit-suite.log + include-hidden-files: true + if-no-files-found: warn + retention-days: 30 + + # ── Auto-triage. Downloads every cell's artifact, classifies (correctness / + # envelope / infra), and files/append ONE labelled issue per (date, class). + # Runs always() so a failed/cancelled cell is still triaged; the empty-round + # guard prevents "0 FAIL across 0 logs" being read as green. ───────────────── + triage: + name: Triage nightly results + needs: [stress, kcov] + if: ${{ always() }} + runs-on: ubuntu-24.04 + timeout-minutes: 10 + permissions: + issues: write + contents: read + steps: + - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0 # v6.0.3, SHA-pinned + with: + persist-credentials: false + + - name: Download all cell artifacts + uses: actions/download-artifact@018cc2cf5baa6db3ef3c5f8a56943fffe632ef53 # v6.0.0, SHA-pinned + with: + path: artifacts + # pattern restricts to the per-cell logs (not kcov-coverage); merge-multiple off + # so each lands in its own `nightly-logs-/` dir, as the triage script expects. + pattern: nightly-logs-* + continue-on-error: true # a totally-missing artifact set must reach the empty-round guard, not error the job + + - name: Ensure triage labels exist (idempotent) + env: + GH_TOKEN: ${{ github.token }} + run: | + set -uo pipefail + gh label create nightly-correctness -c '#d73a4a' -d 'Nightly stress: a correctness assertion failed — investigate' --force || true + gh label create nightly-envelope -c '#fbca04' -d 'Nightly stress: wall-clock envelope relaxed under load — expected, tracked' --force || true + gh label create nightly-infra -c '#0e8a16' -d 'Nightly stress: infra issue (missing artifact / timeout / errored) — not a product failure' --force || true + + - name: Classify results and file/append issues + env: + GH_TOKEN: ${{ github.token }} + ARTIFACTS_DIR: artifacts + EXPECTED_CELLS: 'N1 N2 N3 N4 N5 N6' + GITHUB_SERVER_URL: ${{ github.server_url }} + GITHUB_REPOSITORY: ${{ github.repository }} + GITHUB_RUN_ID: ${{ github.run_id }} + run: | + set -uo pipefail + # Each cell's status is ground truth from its OWN artifact + # (nightly-logs-/cell-conclusion.txt, written by the stress job under + # always()), so the script never relies on the misleading matrix-aggregate + # `needs.stress.result`. The empty-round guard fires if NO cell artifact exists. + echo "Artifacts present:"; ls -la artifacts 2>/dev/null || echo " (none)" + bash .github/scripts/nightly-triage.sh diff --git a/.github/workflows/tests.yml b/.github/workflows/tests.yml index 2156133..8ebffcc 100644 --- a/.github/workflows/tests.yml +++ b/.github/workflows/tests.yml @@ -133,6 +133,7 @@ jobs: tests/git-commit-lock.test.sh \ tests/git-commit-lock.interop.test.sh \ tests/git-commit-lock.integration.test.sh \ + .github/scripts/nightly-triage.sh \ install.sh - name: PSScriptAnalyzer (gate at warning severity) From 9cce97d3e5d12d02330167a03213706e77882a9e Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 05:58:04 +1000 Subject: [PATCH 45/76] Bucket 6d: deep-sweep.yml (on-demand deep flake hunt) New workflow_dispatch-only workflow (never gates): inputs stress_kind / stress_load / repeat / envelope_tier (default relax). Per-run-unique concurrency (group: deep-, cancel-in-progress: false) so many parallel dispatches coexist and queue. Matrix mirrors tests.yml's 4 cells with distinct deep-* job names, each wrapping the suites in with-load.sh at FULL + SWEEP. The `repeat` input loops the suite N times, sanitized to a positive int, failing fast (via PIPESTATUS, since set -e is off) on the first bad iteration with the index named. Artifacts uploaded on success too. actionlint clean; YAML valid; repeat loop + concurrency + inputs->with-load.sh reasoned through. Co-Authored-By: Claude Opus 4.8 (1M context) --- .github/workflows/deep-sweep.yml | 187 +++++++++++++++++++++++++++++++ 1 file changed, 187 insertions(+) create mode 100644 .github/workflows/deep-sweep.yml diff --git a/.github/workflows/deep-sweep.yml b/.github/workflows/deep-sweep.yml new file mode 100644 index 0000000..7ac74e9 --- /dev/null +++ b/.github/workflows/deep-sweep.yml @@ -0,0 +1,187 @@ +# deep-sweep — Tier D of the load-testing strategy (docs/load-testing-strategy.md §9). +# +# ON-DEMAND ONLY. This workflow is `workflow_dispatch`-only: it NEVER runs on push +# or pull_request, and it NEVER gates anything (it is not a required check — this is +# a single-dev project with no branch protection; see the Phase-2 build plan's +# Bucket 6 decision box). It exists purely as a deep flake-hunting tool — the +# "50-clean hunt" instrument from the load-testing strategy: dispatch it (often many +# times in parallel), pick a stress kind/magnitude, and repeat the full suite N +# times per job to surface intermittent, scheduling-sensitive flakes that a single +# zero-load per-PR run would never reproduce. +# +# Deep + loaded runs are SLOW (heavy CPU/disk oversubscription stretches every +# wall-clock-derived step), so timeouts here are deliberately generous and the +# envelope tier defaults to `relax` (an oversubscribed runner must not turn a +# latency miss into a red — only a real correctness FAIL should). +# +# The job names are intentionally distinct (`deep-*`). With no branch protection +# there is no required `tests-passed` context to avoid publishing, so this is now +# only cosmetic / for clarity — but kept so a deep run is never confused with the +# per-PR `tests` matrix in the checks UI. + +name: deep-sweep + +on: + workflow_dispatch: + inputs: + stress_kind: + description: 'Background load kind to apply via tests/with-load.sh' + type: choice + options: [none, cpu, disk, both] + default: both + stress_load: + description: 'Raw per-kind hog count override (GCL_STRESS_LOAD). Blank = use the ratio.' + type: string + default: '' + repeat: + description: 'How many times to repeat the suite run within each job (intermittent-flake hunt).' + type: string + default: '1' + envelope_tier: + description: 'GCL_ENVELOPE_TIER — relax (default) warns on latency misses; strict fails them.' + type: string + default: relax + +# Per-run-unique group so MANY parallel dispatches each get their own group and run +# concurrently (a fresh dispatch never cancels or is cancelled by an in-flight one); +# cancel-in-progress:false means a re-dispatch into the same run_id (impossible — +# run_id is unique per run) would still queue rather than cancel. In practice every +# dispatch is its own run, so the deep sweeps fan out freely and accept queue waves. +concurrency: + group: deep-${{ github.run_id }} + cancel-in-progress: false + +permissions: + contents: read + +jobs: + deep: + name: deep-${{ matrix.os }}${{ matrix.leg != 'all' && format(' ({0})', matrix.leg) || '' }} + runs-on: ${{ matrix.os }} + strategy: + fail-fast: false # every cell's verdict is a useful deep signal; let the rest finish + matrix: + # Mirrors the tests.yml 4-cell set (ubuntu all / macos all / windows unit / + # windows interop+integration). Windows stays split because the bash-only + # unit suite is the wall-clock bottleneck there and the suites must not run + # concurrently inside one timing-sensitive 2-core runner. Generous deep + # timeouts: deep + loaded + repeated is far slower than the per-PR gate. + include: + - { os: ubuntu-24.04, leg: all, job_timeout: 180 } + - { os: macos-15, leg: all, job_timeout: 180 } + - { os: windows-2025, leg: unit, job_timeout: 120 } + - { os: windows-2025, leg: interop-integration, job_timeout: 120 } + timeout-minutes: ${{ matrix.job_timeout }} # backstop only: repeat * (loaded suite budgets) + upload headroom + defaults: + run: + shell: bash # on windows-2025 this is Git Bash (MINGW) — what the interop suite requires + env: + GCL_TEST_FULL: 1 # full fan-out — CI runners are dedicated + GCL_TEST_SWEEP: 1 # deep runs exercise the Axis-A waiter-count sweep too + GCL_ENVELOPE_TIER: ${{ inputs.envelope_tier }} + GCL_STRESS_KIND: ${{ inputs.stress_kind }} + GCL_STRESS_LOAD: ${{ inputs.stress_load }} # blank => with-load.sh falls back to the ratio + steps: + - uses: actions/checkout@9f698171ed81b15d1823a05fc7211befd50c8ae0 # v6.0.3, SHA-pinned + with: + persist-credentials: false # no job uses the token after fetch + + - name: Toolchain versions (for reconstructing failures) + run: | + uname -a + bash --version | head -1 + git --version + if command -v pwsh >/dev/null; then + pwsh -NoProfile -Command '"pwsh " + $PSVersionTable.PSVersion.ToString()' + else + echo "pwsh: NOT FOUND (interop suite will skip; integration runs bash-only)" + fi + if command -v powershell >/dev/null; then + powershell -NoProfile -Command '"powershell " + $PSVersionTable.PSVersion.ToString()' + else + echo "powershell (Windows PowerShell 5.1): NOT FOUND (interop Test 17 skips; expected on POSIX legs)" + fi + stat --version 2>/dev/null | head -1 || echo "stat: BSD variant" + command -v stress-ng >/dev/null && stress-ng --version | head -1 || echo "stress-ng: NOT FOUND (with-load.sh uses the portable spinner)" + echo "dispatch inputs: kind=${GCL_STRESS_KIND} load='${GCL_STRESS_LOAD}' repeat=${{ inputs.repeat }} envelope=${GCL_ENVELOPE_TIER}" + + # Each suite is repeated `repeat` times under load. The loop fails fast: the + # first failing iteration `exit 1`s the step (so the step — and job — go red on + # the earliest flake), and every iteration names its index in the log so a + # failure is attributable to a specific repeat. `set -e` is NOT in effect + # (default bash here), so we check with-load.sh's propagated rc explicitly. + - name: Unit suite (deep, looped x repeat, under load) + if: ${{ matrix.leg == 'all' || matrix.leg == 'unit' }} + timeout-minutes: ${{ matrix.os == 'windows-2025' && 100 || 90 }} + env: + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/unit + run: | + mkdir -p test-output + n='${{ inputs.repeat }}' + case "$n" in ''|*[!0-9]*) n=1 ;; esac + [ "$n" -lt 1 ] && n=1 + echo "== unit: repeating $n time(s) under load ==" + for i in $(seq 1 "$n"); do + echo "== unit iteration $i/$n ==" + bash tests/with-load.sh bash tests/git-commit-lock.test.sh 2>&1 \ + | tee "test-output/unit-suite.iter$i.log" + rc=${PIPESTATUS[0]} + if [ "$rc" -ne 0 ]; then + echo "== unit iteration $i/$n FAILED (rc=$rc) — stopping deep sweep ==" + exit 1 + fi + done + + - name: Interop suite (deep, looped x repeat, under load) + if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} # run even if an earlier suite failed — every signal is useful + timeout-minutes: 90 + env: + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/interop + run: | + mkdir -p test-output + n='${{ inputs.repeat }}' + case "$n" in ''|*[!0-9]*) n=1 ;; esac + [ "$n" -lt 1 ] && n=1 + echo "== interop: repeating $n time(s) under load ==" + for i in $(seq 1 "$n"); do + echo "== interop iteration $i/$n ==" + bash tests/with-load.sh bash tests/git-commit-lock.interop.test.sh 2>&1 \ + | tee "test-output/interop-suite.iter$i.log" + rc=${PIPESTATUS[0]} + if [ "$rc" -ne 0 ]; then + echo "== interop iteration $i/$n FAILED (rc=$rc) — stopping deep sweep ==" + exit 1 + fi + done + + - name: Integration suite (deep, looped x repeat, under load) + if: ${{ !cancelled() && (matrix.leg == 'all' || matrix.leg == 'interop-integration') }} + timeout-minutes: 60 # its internal AGENT_LOCK_MAX_WAIT cap is 240s; x repeat under load + env: + GCL_TEST_PRESERVE_DIR: ${{ github.workspace }}/test-output/failed-work/integration + run: | + mkdir -p test-output + n='${{ inputs.repeat }}' + case "$n" in ''|*[!0-9]*) n=1 ;; esac + [ "$n" -lt 1 ] && n=1 + echo "== integration: repeating $n time(s) under load ==" + for i in $(seq 1 "$n"); do + echo "== integration iteration $i/$n ==" + bash tests/with-load.sh bash tests/git-commit-lock.integration.test.sh 2>&1 \ + | tee "test-output/integration-suite.iter$i.log" + rc=${PIPESTATUS[0]} + if [ "$rc" -ne 0 ]; then + echo "== integration iteration $i/$n FAILED (rc=$rc) — stopping deep sweep ==" + exit 1 + fi + done + + - name: Upload deep-sweep artifacts (logs + load manifests, on success too) + if: ${{ always() }} # deep runs want the negatives to read the positives; upload even when green or cancelled + uses: actions/upload-artifact@043fb46d1a93c77aae656e7c1c64a875d1fc6a0a # v7.0.1, SHA-pinned + with: + name: deep-logs-${{ matrix.os }}-${{ matrix.leg }}-${{ inputs.stress_kind }} # unique per (os, leg, kind) + path: test-output/ + include-hidden-files: true # lock logs + the load-manifest live under the scratch .git/ and test-output/; suite-generated, no secrets + if-no-files-found: warn + retention-days: 14 From 309cf3912510c1cf2891297fecd49e588250e0aa Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 06:06:02 +1000 Subject: [PATCH 46/76] docs(failure-modes): mark F1/F2/F4/J1/E3 TESTED; F3 document-only MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit The fault-injection + steering tests now exist (Buckets 2A/2B), so flip the §2 coverage table + §3 prose markers from "test planned" to tested, with citations: - E3 (mtime unreadable -> fail-safe) -> Test 42 - F1 (ENOSPC on create/write) -> Test 50 (Linux+sudo tmpfs; skip elsewhere) - F2 (failing log path) / J1 (logging failure) -> Test 49 (portable ENOTDIR) - F4 (unwritable lock dir -> clean 97) -> Test 48 (POSIX chmod 0555) - F3 (FD/inode exhaustion) -> document-only (no deterministic portable injection) - D3 row: cite Test 37 (rename-refused / wrong-type-at-path mid-steal) 4.5 item 5 gets a "Status (done)" block recording the above; Ben's override rationale preserved. Co-Authored-By: Claude Opus 4.8 (1M context) --- docs/failure-modes.md | 98 +++++++++++++++++++++++++++---------------- 1 file changed, 61 insertions(+), 37 deletions(-) diff --git a/docs/failure-modes.md b/docs/failure-modes.md index a187c15..3f54abe 100644 --- a/docs/failure-modes.md +++ b/docs/failure-modes.md @@ -121,16 +121,16 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | C4 | Leaked claim (unverifiable unlink) | Leaked-token memory keeps ownership discoverable | 1 | ✓ U:1549-1758, U:2013-2164 | **In scope.** Keep. | | D1 | Atomic rename-over (steal install) | `mv -T` / `File.Move(...,true)` / 5.1 unlink+move | 1 (local FS) | ✓ U:212-346, I:16d S:1141 | **In scope on local FS.** Boundary = D-axis. | | D2 | O_EXCL atomic create | `set -C` redirect / `FileMode.CreateNew` | 1 (local FS) | ✓ throughout | **In scope on local FS.** | -| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262, ~(plat) | **In scope.** ps1-on-POSIX residual = accept. | +| D3 | Wrong-type at path (dir/symlink/FIFO/dev/socket) | Never stolen/deleted; loud warn; waiters → 97 | 1 (bash + ps1-on-Win) / 2 (ps1-on-POSIX) | ✓ U:818-892/1156-1262/Test 37 (rename-refused mid-steal), ~(plat) | **In scope.** ps1-on-POSIX residual = accept. | | D4 | Non-lock CONTENT at path (user file) | Never stolen (content guard); warn | 1 | ✓ U:1034-1076 | **In scope.** Two accepted residuals (§D4). | | D5 | Case-insensitive FS path collision | Not handled explicitly | 3 | ✗ | **Likely non-issue;** see §D5. Decide. | | E1 | Network/shared FS (NFS/SMB/9p/Dropbox) | Outside design guarantees (stated) | 3 | ✗ | **Out of scope** (stated). See §E — decide whether to *enforce*. | | E2 | Multi-host clock skew / NTP jump | Implicitly single-clock; **not** addressed in docs | 3 (and a doc gap) | ✗ | **Out of scope** but UNDER-documented. See §E2. | -| E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ○ | **Accept** — fails safe + announced. See §E3. | -| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ○ → test planned | **Add test** (§4.5) + document. See §F1. | -| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ○ → test planned | **Add test** (§4.5); logging best-effort, lock unaffected. | -| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, FD via `ulimit`), document. | -| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ○ → test planned | **Add test** (§4.5, highest-value). See §F4. | +| E3 | mtime probe unreadable (staleness clock broken) | Warns loudly once; treats as not-stale → safe, recovery disabled → 97 | 2 | ✓ U:Test 42 | **Accept** — fails safe + announced. See §E3. | +| F1 | Disk full (ENOSPC) during create/write | Create fails → wait; torn write ages out | 2/3 | ✓ U:Test 50 (Linux+sudo tmpfs; (plat) skip elsewhere) | **Tested** (§4.5) + document. See §F1. | +| F2 | ENOSPC during LOG write | Swallowed (`|| true`); silent log loss | 2 | ✓ U:Test 49 (portable failing-log path) | **Tested** (§4.5); logging best-effort, lock unaffected. | +| F3 | Inode / FD exhaustion | Create fails → wait → 97 | 2 | ○ (document-only) | **Document-only**: no deterministic portable injection. See §F3. | +| F4 | Read-only / unwritable lock dir or parent | `mkdir -p` best-effort; create fails → wait → 97 | 2 | ✓ U:Test 48 (POSIX `chmod 0555`; (plat) skip on Windows) | **Tested** (§4.5, highest-value). See §F4. | | G1 | Lock path = a directory / `$HOME` typo | Never stolen/deleted; loud warn; → 97 | 1 | ✓ U:818-840 | **In scope.** Keep. | | G2 | Garbage numeric config | Falls back to default + stderr note | 1 | ✓ U:695-703, I:554-608 | **In scope.** Keep. | | G3 | `run` outside a git repo, no `AGENT_LOCK_PATH` | Refuses (96) | 1 | ✓ U:705-712 | **In scope.** Keep. | @@ -141,7 +141,7 @@ robust-by-code-but-unverified · S static/grep check · (plat) platform-gated. | H4 | Non-unwinding exit while held (SIGKILL / bash `exec` / `[Environment]::Exit()`) | Skips release → a displaced holder is unwarned (no 98); plain `exit` is safe | 2 | ~ (I:308-334 indirect) | **Document** the no-silent-loss boundary. See §H4. | | I1 | bash⇄pwsh wire/format compatibility | Shared format; token grammar tightened to match | 1 | ✓ I:* throughout | **In scope.** Keep. | | I2 | Mixed-VERSION tree (old unserialized steal) | Prevention degrades to detection (98); `.dead.*` litter | 3 | ✗ | **Out of scope:** "upgrade both together." Residual 4. | -| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ○ → test planned | **Add test** (§4.5, via F2); logging never blocks the lock. | +| J1 | Logging subsystem failure | All log writes `|| true`; 1 MB self-truncate | 2 | ✓ U:Test 49 (via F2) | **Tested** (§4.5, via F2); logging never blocks the lock. | | K1 | Extreme load / CPU oversubscription / slow FS | Correctness holds; wall-clock bounds stretch | 2 | ~ (CI stress) | **Define the envelope.** See §K — the key analytical section. | | K2 | Internal time budgets (poll, MAX_WAIT, read ladder) | Fixed schedules; tunable | 2 | ✓/~ | **In scope** as Tier-2 envelope. See §K. | @@ -452,12 +452,14 @@ stale** — the floor guard `[ "$mt" -gt 946684800 ]` fails closed to "fresh" lock whose age it cannot establish, so no premature steal and no corruption — but **recovery of a genuinely crashed holder is disabled**, and waiters block to MAX_WAIT (97). *Tier 2 (safety held, recovery lost — and loudly announced).* -Untested (no stat-failure injection). **Recommend: accept and document** — it is a +Tested: unit Test 42 shadows the inner mtime probe to return empty on a present, +stale ghost and asserts the fail-safe lane — the "Staleness detection is BROKEN" +warn-once fires, the ghost is NOT stolen (left in place), and the waiter blocks to +MAX_WAIT → 97. **Recommend: accept and document** — it is a host/FS-health failure the tool already detects and announces, and it fails *safe* -(no false steal). Fault injection is low-ROI; the loud warning is the right -behavior. This is also the clean reason recovery is a *Tier-1-within-envelope* -property, not unconditional (see the tier split under §1): it presumes a readable -clock. +(no false steal); the loud warning is the right behavior. This is also the clean +reason recovery is a *Tier-1-within-envelope* property, not unconditional (see the +tier split under §1): it presumes a readable clock. ### F. Resource exhaustion @@ -468,36 +470,47 @@ comment at `:1341-1343`). A created-but-write-failed file is an empty orphan tha ages into the steal lane. A torn write *shorter than `tok.`* (e.g. `to`) is the accepted residual at `:299-304`: non-empty, non-prefixed → never stolen, loud, fixed by one manual `rm`. *Tier 2 (degrades to wait/97) / Tier 3 (the torn-write -manual-fix residual).* Reasoned from code, **not tested** (no ENOSPC fault -injection). **Recommend: document + add a fault-injection test (per §4.5).** ENOSPC -is a host-health failure; the tool degrades safely (no corruption, no false hold) -and the one sharp edge (sub-`tok.` torn write needing manual `rm`) is already -documented. Per Ben's §4.5 decision, add an ENOSPC test where it can be injected -deterministically and portably (e.g. a small dedicated tmpfs/quota); if portable -injection proves impractical, say so in the plan rather than shipping a flaky test. +manual-fix residual).* **Tested** (per §4.5): unit Test 50 mounts a small 64k +tmpfs, fills it to ENOSPC, and asserts the waiter times out at 97 with the wrapped +command never running — no corruption, no false hold. ENOSPC injection needs a full +FS (root via a tmpfs; `ulimit -f` raises SIGXFSZ — the wrong lane), so the test runs +on **Linux with passwordless sudo** (the Linux CI leg) and skips-with-note elsewhere. +ENOSPC is a host-health failure; the tool degrades safely (no corruption, no false +hold) and the one sharp edge (sub-`tok.` torn write needing manual `rm`) is already +documented. **F2 — ENOSPC during a LOG write.** All log writes end in `|| true` (`git-commit-lock.sh:561`); a failed log write is silently lost. *Tier 2.* -**Recommend: accept + add a test (per §4.5)** — logging is best-effort by explicit -design (it must never block or fail the lock); the only downside is reduced -post-mortem signal under disk pressure. Add a test that an unwritable/failing log -path leaves the lock fully working (the write is swallowed) — this also covers J1. +**Tested** (per §4.5): unit Test 49 points `AGENT_LOCK_LOG` at a path *under a +regular file*, so every open/append fails ENOTDIR, and asserts the lock still +acquires + releases cleanly (rc 0), the wrapped command runs, the lock is cleaned +up, and no log file appears — i.e. the failing log write is swallowed and the lock +is unaffected. This is a portable injection (no chmod/perms), and it **also covers +J1**. Logging is best-effort by explicit design (it must never block or fail the +lock); the only downside is reduced post-mortem signal under disk pressure. **F3 — Inode / FD exhaustion.** Same shape as F1: a create that can't get an inode fails → wait → eventually 97. The tool holds at most a couple of FDs -briefly. *Tier 2.* Untested. **Recommend: document + add a test (per §4.5)** as -host-health — an FD-exhaustion test via `ulimit -n` is the deterministic, portable -one; add inode exhaustion only if it can be injected cleanly. +briefly. *Tier 2.* **Document-only — no deterministic portable injection.** A +`ulimit -n` FD cap can't be driven deterministically here: the create needs only +~1 FD, so an FD-exhaustion test would have to pin the process at *exactly* the +limit across a poll loop without starving the harness itself — not portable or +stable. Inode exhaustion needs a full FS the way F1 does (and F1/Test 50 already +exercises the create-fails-→-wait-→-97 lane that F3 shares). So F3 is recorded as +a reasoned-but-untested boundary rather than given a flaky test; the safe-degrade +behaviour is the same as F1, which is tested. **F4 — Read-only / unwritable lock dir or parent.** `lock_acquire` does a best-effort `mkdir -p "$(dirname …)"` (`git-commit-lock.sh:1278`); if the dir is unwritable the create fails every poll and the waiter times out at 97. No corruption, no false hold. A *release* unlink blocked by an unwritable parent -routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* Untested directly. -**Recommend: add a test (per §4.5 — the highest-value one).** An unwritable lock -dir → clean 97 is cheap and deterministic to write. A correct, if blunt, outcome -(97); an *earlier, clearer* error would be nicer but is optional polish, low -priority. +routes to the LEFTOVER lane (`:1699-1711`). *Tier 2.* **Tested** (per §4.5 — the +highest-value one): unit Test 48 `chmod 0555`s the lock-dir parent and asserts the +waiter times out at 97, the wrapped command never runs, no lock file is created, +and the WAITING/TIMEOUT lines are logged — no corruption, no false hold. POSIX-only +(`chmod 0555` is a no-op for writes on Git-Bash/NTFS, so it skips-with-note on +Windows; the Linux/macOS CI legs exercise it). A correct, if blunt, outcome (97); an +*earlier, clearer* error would be nicer but is optional polish, low priority. **F5 — Memory exhaustion.** The scripts allocate trivially (a few shell vars; the leaked-token list is "almost always empty"). Not a meaningful failure surface. @@ -632,11 +645,12 @@ than rotating (`git-commit-lock.sh:554-562`). A broken log never blocks or fails the lock. Under a redirected git dir, log *content* (the owner line) is attacker-influenceable — one-line text spoofing, no execution; the tool itself writes only its token, owner line, and protocol events, never secrets -(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Recommend: accept + covered by the -F2 log-failure test (per §4.5)** — logging is best-effort by design, which is the -right call for a lock that must keep working when the disk is full or the log path -is bad. The follow-on (unchanged): don't build automation that *trusts* log text -from an untrusted repo (already documented). +(`docs/git-commit-lock.md:543-551`). *Tier 2.* **Tested — covered by the F2 +log-failure test (per §4.5): unit Test 49** proves a failing log path leaves the +lock fully working. Logging is best-effort by design, which is the right call for a +lock that must keep working when the disk is full or the log path is bad. The +follow-on (unchanged): don't build automation that *trusts* log text from an +untrusted repo (already documented). ### K. Behavior under extreme load / scheduling pressure, and internal time budgets @@ -787,7 +801,7 @@ edge cases make the tool more maintainable and give future users confidence), ra 5. **Untested-but-robust-by-code lanes (resource exhaustion F1/F3/F4, log-write failure F2/J1).** These degrade safely (wait/97, or silent best-effort log - loss) but have **no fault-injection tests** — they are reasoned-correct, not + loss) but had **no fault-injection tests** — they were reasoned-correct, not verified. *Decision (Ben — overrides the prior "accept untested"):* **add test coverage** for these lanes. Rationale: actually-tested edge cases make the project easier to maintain and give future users confidence, versus @@ -800,6 +814,16 @@ edge cases make the tool more maintainable and give future users confidence), ra FDs). Flag in the plan any lane that proves genuinely impractical to fault-inject portably, rather than forcing a flaky test. + *Status (done):* coverage added — **F4** unit Test 48 (POSIX `chmod 0555`, + skip-with-note on Windows), **F2/J1** unit Test 49 (portable failing-log path via + ENOTDIR), **F1** unit Test 50 (Linux + passwordless-sudo 64k tmpfs filled to + ENOSPC; skip-with-note elsewhere). **F3** (inode/FD exhaustion) proved impractical + to fault-inject deterministically and portably — the create needs only ~1 FD, so a + `ulimit -n` cap can't be driven deterministically across a poll loop without + starving the harness, and inode exhaustion needs a full FS the way F1 does (F1/Test + 50 already exercises the shared create-fails-→-wait-→-97 lane). Per the "flag any + impractical lane" instruction above, F3 stays **document-only**, not a flaky test. + 6. **Mixed-version tree (§I2) and case-insensitive FS (§D5) — out of scope, confirm.** The first degrades to detection (98), never silent, and is covered by the "upgrade both together" note. The second is a non-issue. *Recommendation:* From d6d643f4e8627de181290e11155f9b88d3158f66 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 06:24:27 +1000 Subject: [PATCH 47/76] Plan: Windows unit-suite CI sharding subplan (Phase 2, under review) Subplan for splitting the windows-2025 unit leg (the ~2x CI bottleneck) into two parallel shards via a GCL_TEST_SHARD=i/n round-robin gate in section(). Records the mechanism, the by-construction partition guarantee + a self-contained per-shard expected-count guard, alternatives rejected, CI wiring (windows-unit only; not kcov/nightly/interop), edge cases, phasing, and a logging design. Awaiting review convergence + Ben's go. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...06-18-ci-stress-windows-unit-shard-plan.md | 176 ++++++++++++++++++ 1 file changed, 176 insertions(+) create mode 100644 .plans/2026-06-18-ci-stress-windows-unit-shard-plan.md diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md new file mode 100644 index 0000000..1ada89d --- /dev/null +++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md @@ -0,0 +1,176 @@ +# Subplan: split the Windows unit CI leg into parallel shards + +Status: **PROPOSAL (Phase 2) — under review.** A small follow-on to the Bucket-6 CI +work, building on the `section()`/selector machinery (commit `4ee5899`) and the shared +`tests/_harness.sh` (`b8e2951`). No implementation until the review converges and Ben +gives the go. + +## Review issues (record at top; do not renumber on resolution) +*(reviewers: add numbered findings here; resolutions noted inline)* + +--- + +## Motivation +The `windows-2025 unit` leg is the CI wall-clock bottleneck: a full reduced unit run is +~4m38s and the Windows leg is ~2× every other leg (interop ~100s, integration ~28s). A +measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on +the 2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So +running the unit suite as **two parallel shards on two runners ~halves** that leg's +wall-clock and speeds up the per-PR dev-feedback loop. **CI-only** — local dev runs are +unaffected (sharding is opt-in via an env var, unset by default). + +## Decision context +- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell + has **zero required-context fallout** — no aggregator, no gating concern. `tests.yml` + reports per-cell contexts directly. +- The enabling work is already done: every unit test is a `section "Test N: …"`-gated + block, proven individually selectable with no cross-test ordering dependencies (the + `GCL_TEST_ONLY` selector work). A shard is just "run the subset of sections assigned to + me," which slots into the same `section()` gate. + +## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, inside `section()` +A new opt-in env var `GCL_TEST_SHARD=/` (e.g. `1/2`) read in `tests/_harness.sh` +alongside the existing `GCL_TAP`/`GCL_TEST_ONLY`/`GCL_TEST_SWEEP` reads. Implementation +(~10 lines in `_harness.sh`): + +- **A monotonic section index** `SECTION_IDX`, bumped in `section()` on **every** call + (every test, in file order), *independent of* whether the test runs. This is the stable + shard-assignment key — it does not depend on `GCL_TEST_ONLY`/`GCL_TEST_SWEEP`. +- **Parse + validate** `GCL_TEST_SHARD` once at suite top: split `i/n`; require `n` a + positive integer and `1 ≤ i ≤ n`; on malformed, **bail loudly** (`exit 1`) rather than + silently running all/none (same spirit as the zero-match guard). +- **Shard gate** in `section()`: a test runs iff `(SECTION_IDX-1) % n == (i-1)` + (round-robin). Composed with the existing `GCL_TEST_ONLY` gate by **AND** (both must + pass to run); `SECTIONS_RUN` still bumps only when the test actually runs. + +```sh +# in _harness.sh, near the GCL_* reads: +GCL_TEST_SHARD="${GCL_TEST_SHARD:-}" +SHARD_I=0; SHARD_N=0; SECTION_IDX=0 +if [ -n "$GCL_TEST_SHARD" ]; then + case "$GCL_TEST_SHARD" in + */*) SHARD_I=${GCL_TEST_SHARD%/*}; SHARD_N=${GCL_TEST_SHARD#*/} ;; + *) echo "Bail out! GCL_TEST_SHARD must be i/n (got '$GCL_TEST_SHARD')" >&2; exit 1 ;; + esac + case "$SHARD_I$SHARD_N" in *[!0-9]*) echo "Bail out! GCL_TEST_SHARD i/n must be integers" >&2; exit 1 ;; esac + if [ "$SHARD_N" -lt 1 ] || [ "$SHARD_I" -lt 1 ] || [ "$SHARD_I" -gt "$SHARD_N" ]; then + echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need 1<=i<=n, n>=1)" >&2; exit 1 + fi +fi + +section() { + SECTION_IDX=$((SECTION_IDX + 1)) + echo "== $1 ==" + # GCL_TEST_ONLY gate (unchanged) + if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi + # GCL_TEST_SHARD gate (round-robin partition) + if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then + return 1 + fi + SECTIONS_RUN=$((SECTIONS_RUN + 1)); return 0 +} +``` + +## Why round-robin (alternatives rejected) +- **Round-robin by index (CHOSEN):** auto-balancing and **zero-maintenance** — new tests + distribute themselves; nothing to hand-edit. Measured imbalance ~10% at n=2 (well within + "roughly halve"). The heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are + scattered through the file, so interleaving balances them naturally. +- **Contiguous halves:** ~17%+ imbalance (worse, because the heavy tests aren't evenly + placed) and still needs the same machinery. Rejected. +- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** works today with no code, but + **fails the maintainability bar** — a new test that matches neither list silently runs in + *no* shard (a coverage hole). Rejected for the standing config. +- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck + entries. Rejected. + +## Coverage safety (the cardinal risk + the guarantee) +The risk: a shard scheme that drops a test reads as green → silent coverage hole. + +- **Primary guarantee — partition by construction.** Round-robin over a single stable + ordering (`SECTION_IDX` in file order) assigns every section index to **exactly one** + residue class. So for any `n`, the shards are a true partition: union == full suite, no + overlap, no drops — *by construction*, as long as every test goes through `section()` + (all 57 do). +- **Self-contained per-shard guard (belt-and-suspenders).** In the suite verdict (extend + `selector_report` in `_harness.sh`), when `GCL_TEST_SHARD` is set, compute the + **expected** run-count from the totals the shard already has — + `expected = number of k in 1..SECTION_IDX with (k-1)%n == (i-1)` — and assert + `SECTIONS_RUN == expected`; **bail loudly** otherwise. This catches a modulo bug or a + `section()` regression *within a single shard* (no cross-job artifacts needed). It does + not need an unsharded baseline (each shard sees all `SECTION_IDX` section calls). +- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that + dies early bails), the `1..$TAPN` plan line (partial-but-correct per shard), and the + zero-match-style guard (a shard that legitimately runs 0 sections — only possible when + `n` > section count — is a misconfiguration and bails). +- **Local union proof (build phase, one-time):** run all `n` shards for `n∈{2,3}` and + assert the concatenation of run-section labels equals the unsharded run's set, with no + duplicates. This validates the implementation before wiring CI. (Belt-and-suspenders on + top of the by-construction argument.) + +## Interaction with existing machinery +- **`GCL_TEST_ONLY` + `GCL_TEST_SHARD`:** AND semantics (run iff selected *and* in-shard). + Independent gates; `SECTION_IDX` counts all sections regardless, so a sharded selector + run is well-defined. +- **`GCL_TEST_FULL` / reduced:** sharding is orthogonal — it partitions *which* sections + run, not *how* each runs. The per-shard expected-count guard uses the shard's own + `SECTION_IDX` total, which is identical full vs reduced (same 57 sections), so the guard + is mode-independent. +- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests + *that land in its shard*. Fine for nightly (not sharded; see scope) and harmless if ever + combined. +- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario) and + already note-and-ignores `GCL_TEST_ONLY`; it must **note-and-ignore `GCL_TEST_SHARD`** + the same way (loud stderr note, run the whole scenario). Add `GCL_TEST_SHARD` to that note. + +## CI wiring (`.github/workflows/tests.yml`) — Windows unit only +- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` matrix cell with + **two** cells carrying `shard: 1` / `shard: 2` (same `job_timeout`, or slightly lower + since each runs ~half — keep generous to avoid flakiness; a half-run finishes well within + 20 min). +- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` (unset on cells without a `shard:` key, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged). +- **Artifact name** must include the shard (`test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}`) — v4+ rejects duplicate artifact names. +- The job-name template already includes `leg`; extend it to include the shard so the two + Windows-unit jobs are distinguishable in the checks list. +- **Scope:** Windows unit **only**. Do **not** shard: the fast legs (interop ~100s, + integration ~28s, all of ubuntu/macos — not bottlenecks), `nightly.yml` (background, not + dev-blocking; optional future), or the **kcov** job (coverage needs the whole suite in + one process — sharding would break it). +- **Runner budget:** today's matrix is ~5 jobs (3 OS legs split into 4 + lint); going to 5 + test jobs + lint is well under GitHub's concurrency ceiling — no queueing. + +## Logging / observability (per engineering practices) +- Each sharded run logs a single greppable line at the verdict: + `GCL_TEST_SHARD=i/n: ran R of T sections (expected E)` — captured in the CI suite log + (`tee test-output/unit-suite.log`) and the uploaded artifact, so a future agent can + reconstruct which shard ran which tests. +- The partition guard's failure message is a loud `Bail out! shard i/n ran R, expected E` + → the step fails and the artifact (with the per-test `== Test N ==` headers, which + `section()` echoes for *every* test, run or skipped) shows exactly which tests landed + where. The per-shard CI job name (`… (unit, shard 1)`) makes a red attributable. + +## Phasing (implementation) +1. **`_harness.sh`:** add the `GCL_TEST_SHARD` parse/validate + `SECTION_IDX` + the + `section()` shard gate + the `selector_report` expected-count guard. Integration suite: + add `GCL_TEST_SHARD` to its note-and-ignore. +2. **Local union proof:** confirm (a) default (no shard) byte-identical — unit 315/0, + interop 141/0; (b) `GCL_TEST_SHARD=1/2` + `=2/2` run disjoint halves whose section sets + union to the full 57 and whose assertion counts sum to the unsharded 315; (c) the + expected-count guard fires on a deliberately-broken modulo; (d) malformed + `GCL_TEST_SHARD` bails; (e) `shellcheck -S style` clean. Also confirm `GCL_TEST_SHARD` + composes with `GCL_TEST_ONLY` (AND) and is orthogonal to `GCL_TEST_FULL`/`GCL_TEST_SWEEP`. +3. **`tests.yml`:** split the windows-unit cell into shard 1/2 (env + artifact name + job + name). `actionlint -shellcheck=` clean. +4. **CI verification:** dispatch `tests.yml`; confirm both Windows-unit shards are green, + each runs ~half (~halved wall-clock), artifact names are unique, and the full legs + (ubuntu/macos/windows-interop) are unchanged. +5. Commit incrementally under the lock; this ships with the ci-stress branch and lands on + `main` via the same merge PR. + +## Out of scope +- Sharding the interop/integration suites or the nightly/deep-sweep tiers (interop is not + the bottleneck; nightly is background). Notable only as a possible future `n>2` or + cross-OS extension. +- Cost-aware (greedy bin-packing) sharding — ~0% imbalance but needs a maintained per-test + cost table; round-robin's ~10% is sufficient and maintenance-free. +- Any product-code change. This is test-harness + CI only. From 849ed827e69dea16701d56f82be300c17432f299 Mon Sep 17 00:00:00 2001 From: Ben Toner Date: Thu, 18 Jun 2026 06:34:44 +1000 Subject: [PATCH 48/76] Plan: fold round-1 review of windows-shard subplan (3 reviewers) MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Round 1 = 2 fresh Claude reviewers + independent Codex. Two blocking defects in the original snippet + two Codex caught the Claude pair missed, all folded: - malformed input (empty component / leading-zero octal trap) now rejected by a single regex ^([1-9][0-9]*)/([1-9][0-9]*)$; - GCL_TEST_ONLY and GCL_TEST_SHARD made MUTUALLY EXCLUSIVE (simpler than AND + guard-fallback; no real use case combines them); - GCL_TEST_SHARD parsed LAZILY (first section() call) so the integration suite (no section() blocks) never parses/bails — it note-and-ignores; - union proof + per-shard logging use run-only PASS/FAIL signals, not the == Test N == headers (which section() prints before gating, so skipped too); - guard asserts expected>=1 (catches n>section-count) and its rationale reworded (catches a section()-coverage regression, not a correlated modulo bug). Confirm round (fresh reviewer + Codex) pending before declaring converged. Co-Authored-By: Claude Opus 4.8 (1M context) --- ...06-18-ci-stress-windows-unit-shard-plan.md | 311 ++++++++++-------- 1 file changed, 177 insertions(+), 134 deletions(-) diff --git a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md index 1ada89d..dbbe1f2 100644 --- a/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md +++ b/.plans/2026-06-18-ci-stress-windows-unit-shard-plan.md @@ -1,69 +1,106 @@ # Subplan: split the Windows unit CI leg into parallel shards -Status: **PROPOSAL (Phase 2) — under review.** A small follow-on to the Bucket-6 CI -work, building on the `section()`/selector machinery (commit `4ee5899`) and the shared -`tests/_harness.sh` (`b8e2951`). No implementation until the review converges and Ben -gives the go. +Status: **PROPOSAL (Phase 2) — round-1 review folded; confirm round pending.** A small +follow-on to the Bucket-6 CI work, building on the `section()`/selector machinery (commit +`4ee5899`) and the shared `tests/_harness.sh` (`b8e2951`). No implementation until the review +converges and Ben gives the go. ## Review issues (record at top; do not renumber on resolution) -*(reviewers: add numbered findings here; resolutions noted inline)* + +**Round 1 (2026-06-18)** — 2 fresh Claude reviewers (correctness/coverage; CI/simplicity) + +independent Codex. Dispositions (all FIXED in the body below; a confirm round still follows): + +1. **[blocking — FIXED] Malformed `GCL_TEST_SHARD` not rejected → mid-suite crash.** The old + combined `case "$SHARD_I$SHARD_N"` digit check passed `1/`/`/2`/`/` (empty component), then + `[ "" -lt 1 ]`/`% ""` errored falsy under `set -uo pipefail` (no `set -e`) instead of + bailing. Codex also flagged **leading zeros** (`08/10`) as a bash-arithmetic **octal** trap. + **Fix:** validate with a single regex `^([1-9][0-9]*)/([1-9][0-9]*)$` (rejects empty, + non-digit, leading-zero, extra slashes in one shot), then the `i ≤ n` range check. +2. **[blocking — FIXED] Guard vs `GCL_TEST_ONLY` composition.** The plan advertised AND + semantics, but the exact-count guard ignored the selector → false bail; both Claude-A and + Codex flagged it. Codex offered the simpler resolution, adopted: **`GCL_TEST_ONLY` and + `GCL_TEST_SHARD` are now mutually exclusive** (bail if both set). There is no real use case + for combining them, and it removes the guard-fallback edge case entirely — the exact-count + guard then *always* applies in shard mode. +3. **[blocking (Codex, NEW) — FIXED] Eager parse bails the integration suite.** Parsing/bailing + `GCL_TEST_SHARD` at `_harness.sh` source-time runs for *all* suites, including integration + (which sources the harness before its note-and-ignore) — so malformed input would `exit 1` + integration instead of being ignored. **Fix: parse lazily** on the first `section()` call. + Integration never calls `section()`, so it neither parses nor bails; its note-and-ignore + just prints a notice if the var is set. +4. **[non-blocking (Codex, NEW) — FIXED] `== Test N ==` headers are NOT a run-set.** + `section()` echoes the header *before* gating, so skipped sections print one too. The union + proof / per-shard logging must use **run-only** signals (the `PASS:`/`FAIL:` lines, which a + skipped test never emits) — optionally a run-only `RAN:` marker for attribution. +5. **[FIXED] Guard must assert `expected ≥ 1`** — `n` > section-count (e.g. `58/58`) yields + `expected==0` which `0==0` would pass silently green. Also: the *existing* `selector_report` + zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT cover pure-shard mode + — the new guard's `expected ≥ 1` does. +6. **[FIXED] Unsharded runs stay byte-identical.** All shard logic gated on + `[ -n "$GCL_TEST_SHARD" ]`; the interop suite (shares `section()`/`selector_report`, never + sharded, on every leg) and unit-on-ubuntu/macos run exactly as today. +7. **[FIXED] Guard rationale reworded.** It catches a **`section()`-coverage regression** (a + test added *outside* the gate), NOT a "modulo bug" (a wrong `%` would be *correlated* between + `section()` and the guard). The union proof is a one-time implementation sanity check (n=2), + secondary to the by-construction guarantee. +8. **[FIXED] Job-count prose:** 4 test cells (+`lint`) = 5 jobs → 5 test cells (+`lint`) = 6 + jobs; well under the concurrency ceiling. + +Round-1 verdicts: Reviewer A *needs-changes (1,2)*; Codex *not-sound-yet (1,2,3)*; Reviewer B +*sound-to-implement*. All folded. **Confirm round (fresh reviewer) pending before declaring +converged.** --- ## Motivation The `windows-2025 unit` leg is the CI wall-clock bottleneck: a full reduced unit run is ~4m38s and the Windows leg is ~2× every other leg (interop ~100s, integration ~28s). A -measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on -the 2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So -running the unit suite as **two parallel shards on two runners ~halves** that leg's -wall-clock and speeds up the per-PR dev-feedback loop. **CI-only** — local dev runs are -unaffected (sharding is opt-in via an env var, unset by default). +measured run shows `sys` time > `user` time → the cost is **process-spawn overhead** on the +2-core Windows runner (each test spawns `bash $LIB` many times), not compute. So running the +unit suite as **two parallel shards on two runners ~halves** that leg's wall-clock and speeds +the per-PR dev-feedback loop. **CI-only** — sharding is opt-in via an env var, unset by default, +so local dev runs are unaffected. ## Decision context -- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell - has **zero required-context fallout** — no aggregator, no gating concern. `tests.yml` - reports per-cell contexts directly. -- The enabling work is already done: every unit test is a `section "Test N: …"`-gated - block, proven individually selectable with no cross-test ordering dependencies (the - `GCL_TEST_ONLY` selector work). A shard is just "run the subset of sections assigned to - me," which slots into the same `section()` gate. - -## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, inside `section()` -A new opt-in env var `GCL_TEST_SHARD=/` (e.g. `1/2`) read in `tests/_harness.sh` -alongside the existing `GCL_TAP`/`GCL_TEST_ONLY`/`GCL_TEST_SWEEP` reads. Implementation -(~10 lines in `_harness.sh`): - -- **A monotonic section index** `SECTION_IDX`, bumped in `section()` on **every** call - (every test, in file order), *independent of* whether the test runs. This is the stable - shard-assignment key — it does not depend on `GCL_TEST_ONLY`/`GCL_TEST_SWEEP`. -- **Parse + validate** `GCL_TEST_SHARD` once at suite top: split `i/n`; require `n` a - positive integer and `1 ≤ i ≤ n`; on malformed, **bail loudly** (`exit 1`) rather than - silently running all/none (same spirit as the zero-match guard). -- **Shard gate** in `section()`: a test runs iff `(SECTION_IDX-1) % n == (i-1)` - (round-robin). Composed with the existing `GCL_TEST_ONLY` gate by **AND** (both must - pass to run); `SECTIONS_RUN` still bumps only when the test actually runs. +- **No branch protection** (Ben, 2026-06-18; single-dev project). So adding a matrix cell has + **zero required-context fallout** — no aggregator, no gating concern; `tests.yml` reports + per-cell contexts directly. +- The enabling work is done: every unit test is a `section "Test N: …"`-gated block, proven + individually selectable with no cross-test ordering deps (the `GCL_TEST_ONLY` selector work). + A shard is just "run the subset of sections assigned to me," which slots into the same gate. + +## Mechanism: `GCL_TEST_SHARD=i/n`, round-robin, lazy-parsed in `section()` +A new opt-in env var `GCL_TEST_SHARD=/` (e.g. `1/2`) handled in `tests/_harness.sh`. +Key design choices (from review): **lazy parse** (so non-`section()` suites ignore it), +**mutually exclusive** with `GCL_TEST_ONLY`, **regex-validated** (rejects empty/non-digit/ +leading-zero). ~15 lines: ```sh -# in _harness.sh, near the GCL_* reads: +# declarations near the GCL_* reads (NO eager parse — keeps integration unaffected): GCL_TEST_SHARD="${GCL_TEST_SHARD:-}" -SHARD_I=0; SHARD_N=0; SECTION_IDX=0 -if [ -n "$GCL_TEST_SHARD" ]; then - case "$GCL_TEST_SHARD" in - */*) SHARD_I=${GCL_TEST_SHARD%/*}; SHARD_N=${GCL_TEST_SHARD#*/} ;; - *) echo "Bail out! GCL_TEST_SHARD must be i/n (got '$GCL_TEST_SHARD')" >&2; exit 1 ;; - esac - case "$SHARD_I$SHARD_N" in *[!0-9]*) echo "Bail out! GCL_TEST_SHARD i/n must be integers" >&2; exit 1 ;; esac - if [ "$SHARD_N" -lt 1 ] || [ "$SHARD_I" -lt 1 ] || [ "$SHARD_I" -gt "$SHARD_N" ]; then - echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need 1<=i<=n, n>=1)" >&2; exit 1 +SHARD_I=0; SHARD_N=0; SECTION_IDX=0; SHARD_PARSED=0 + +_shard_init() { # runs once, lazily, on the first section() call + SHARD_PARSED=1 + [ -z "$GCL_TEST_SHARD" ] && return 0 + if [ -n "${GCL_TEST_ONLY:-}" ]; then # mutually exclusive (review #2) + echo "Bail out! GCL_TEST_ONLY and GCL_TEST_SHARD are mutually exclusive" >&2; exit 1 + fi + if [[ "$GCL_TEST_SHARD" =~ ^([1-9][0-9]*)/([1-9][0-9]*)$ ]]; then # review #1 (no empty/zero/octal) + SHARD_I=${BASH_REMATCH[1]}; SHARD_N=${BASH_REMATCH[2]} + else + echo "Bail out! GCL_TEST_SHARD must be i/n positive integers (got '$GCL_TEST_SHARD')" >&2; exit 1 fi -fi + if [ "$SHARD_I" -gt "$SHARD_N" ]; then + echo "Bail out! GCL_TEST_SHARD=$GCL_TEST_SHARD out of range (need i<=n)" >&2; exit 1 + fi +} section() { - SECTION_IDX=$((SECTION_IDX + 1)) + [ "$SHARD_PARSED" = 1 ] || _shard_init # lazy: only suites that call section() parse + SECTION_IDX=$((SECTION_IDX + 1)) # file-order index, bumped for EVERY test echo "== $1 ==" - # GCL_TEST_ONLY gate (unchanged) if [ -n "${GCL_TEST_ONLY:-}" ] && ! [[ "$1" =~ $GCL_TEST_ONLY ]]; then return 1; fi - # GCL_TEST_SHARD gate (round-robin partition) if [ -n "$GCL_TEST_SHARD" ] && [ $(( (SECTION_IDX - 1) % SHARD_N )) -ne $(( SHARD_I - 1 )) ]; then return 1 fi @@ -71,106 +108,112 @@ section() { } ``` +(`SECTION_IDX` bumps unconditionally in file order — independent of `GCL_TEST_ONLY`/ +`GCL_TEST_SWEEP`/`GCL_TEST_FULL` — so it is the stable shard-assignment key.) + ## Why round-robin (alternatives rejected) -- **Round-robin by index (CHOSEN):** auto-balancing and **zero-maintenance** — new tests - distribute themselves; nothing to hand-edit. Measured imbalance ~10% at n=2 (well within - "roughly halve"). The heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are - scattered through the file, so interleaving balances them naturally. -- **Contiguous halves:** ~17%+ imbalance (worse, because the heavy tests aren't evenly - placed) and still needs the same machinery. Rejected. -- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** works today with no code, but - **fails the maintainability bar** — a new test that matches neither list silently runs in - *no* shard (a coverage hole). Rejected for the standing config. -- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck - entries. Rejected. +- **Round-robin by index (CHOSEN):** auto-balancing, **zero-maintenance** — new tests + distribute themselves. Measured imbalance ~10% at n=2 (well within "roughly halve"); the + heavy tests (Test 22 ~34s, 25, 1, 31, 33, 21, 2b, 17d) are scattered, so interleaving + balances them. +- **Contiguous halves:** ~17%+ imbalance (heavy tests unevenly placed), same machinery. Rejected. +- **Two explicit `GCL_TEST_ONLY` regex lists in the matrix:** a new test matching neither list + silently runs in no shard (coverage hole). Rejected. +- **Splitting the file:** duplicates shared `clone_fn`/fixtures, doubles shellcheck entries. Rejected. ## Coverage safety (the cardinal risk + the guarantee) -The risk: a shard scheme that drops a test reads as green → silent coverage hole. +The risk: a shard scheme that drops a test reads green → silent coverage hole. -- **Primary guarantee — partition by construction.** Round-robin over a single stable - ordering (`SECTION_IDX` in file order) assigns every section index to **exactly one** - residue class. So for any `n`, the shards are a true partition: union == full suite, no - overlap, no drops — *by construction*, as long as every test goes through `section()` - (all 57 do). +- **Primary guarantee — partition by construction.** Round-robin over the single stable + `SECTION_IDX` ordering assigns every section index to **exactly one** residue class. For any + `n`, the shards are a true partition (union == full, no overlap, no drops) — by construction, + as long as every test goes through `section()` (all 57 do). - **Self-contained per-shard guard (belt-and-suspenders).** In the suite verdict (extend - `selector_report` in `_harness.sh`), when `GCL_TEST_SHARD` is set, compute the - **expected** run-count from the totals the shard already has — - `expected = number of k in 1..SECTION_IDX with (k-1)%n == (i-1)` — and assert - `SECTIONS_RUN == expected`; **bail loudly** otherwise. This catches a modulo bug or a - `section()` regression *within a single shard* (no cross-job artifacts needed). It does - not need an unsharded baseline (each shard sees all `SECTION_IDX` section calls). -- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that - dies early bails), the `1..$TAPN` plan line (partial-but-correct per shard), and the - zero-match-style guard (a shard that legitimately runs 0 sections — only possible when - `n` > section count — is a misconfiguration and bails). -- **Local union proof (build phase, one-time):** run all `n` shards for `n∈{2,3}` and - assert the concatenation of run-section labels equals the unsharded run's set, with no - duplicates. This validates the implementation before wiring CI. (Belt-and-suspenders on - top of the by-construction argument.) + `selector_report`), when `GCL_TEST_SHARD` is set, compute the **expected** run-count the + shard already has all the info for — `expected = #{k in 1..SECTION_IDX : (k-1)%n == (i-1)}` + — and assert `SECTIONS_RUN == expected` **and `expected ≥ 1`**; **bail loudly** otherwise. + Because `GCL_TEST_ONLY` and `GCL_TEST_SHARD` are mutually exclusive, this exact-count assert + *always* applies in shard mode (no selector-composition fallback needed). **What it actually + catches:** a **`section()`-coverage regression** — a test added *outside* the `section()` + gate, so it stops bumping `SECTION_IDX` (NOT a "modulo bug": a wrong `%` would be *correlated* + between `section()` and this guard, which recomputes the same arithmetic). No cross-job + artifacts, no unsharded baseline (each shard sees all `SECTION_IDX` calls). The `expected ≥ 1` + clause also catches the `n` > section-count misconfiguration (e.g. `58/58`). +- **Existing guards still apply per shard:** the `finish`/`DONE` sentinel (a shard that dies + early bails) and the `1..$TAPN` plan line (partial-but-correct per shard). Note the *existing* + `selector_report` zero-match guard is gated on `GCL_TEST_ONLY` non-empty, so it does NOT fire + in pure-shard mode — the new `expected ≥ 1` clause is what covers an empty shard. +- **Local union proof (one-time implementation sanity check; secondary to the by-construction + guarantee).** Once during implementation, run `GCL_TEST_SHARD=1/2` and `=2/2` and assert their + **`PASS:`/`FAIL:` line sets** (run-only — a *skipped* test emits none; the `== Test N ==` + headers do NOT work here because `section()` prints them before gating) union to the full + unsharded set with no duplicates. Not a standing CI step. ## Interaction with existing machinery -- **`GCL_TEST_ONLY` + `GCL_TEST_SHARD`:** AND semantics (run iff selected *and* in-shard). - Independent gates; `SECTION_IDX` counts all sections regardless, so a sharded selector - run is well-defined. -- **`GCL_TEST_FULL` / reduced:** sharding is orthogonal — it partitions *which* sections - run, not *how* each runs. The per-shard expected-count guard uses the shard's own - `SECTION_IDX` total, which is identical full vs reduced (same 57 sections), so the guard - is mode-independent. -- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests - *that land in its shard*. Fine for nightly (not sharded; see scope) and harmless if ever - combined. -- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario) and - already note-and-ignores `GCL_TEST_ONLY`; it must **note-and-ignore `GCL_TEST_SHARD`** - the same way (loud stderr note, run the whole scenario). Add `GCL_TEST_SHARD` to that note. +- **`GCL_TEST_ONLY` vs `GCL_TEST_SHARD`: mutually exclusive** (bail if both set). No real use + case combines them, and exclusivity removes the guard's hardest edge case. +- **`GCL_TEST_FULL` / reduced:** orthogonal — sharding partitions *which* sections run, not + *how*. The `SECTION_IDX` total (57) is identical full vs reduced, so the partition + guard are + mode-independent. +- **`GCL_TEST_SWEEP` (Axis-A):** orthogonal — a sharded run still sweeps the Axis-A tests in its + shard. (Not combined in CI; harmless if ever combined.) +- **Integration suite:** has no `section()`-wrapped blocks (one indivisible scenario). With + **lazy parse**, it never calls `section()` → never parses/bails `GCL_TEST_SHARD`. It should + **note-and-ignore** the var the same way it does `GCL_TEST_ONLY` (loud stderr note if set, + *without* parsing), using the harness-initialized `GCL_TEST_SHARD` (pre-set `""` so no + `set -u` trap). +- **Unsharded runs stay byte-identical.** All shard logic is gated on `[ -n "$GCL_TEST_SHARD" ]`, + so the interop suite (shares the helpers, never sharded — every leg) and unit-on-ubuntu/macos + (`leg: all`, full) run exactly as today. ## CI wiring (`.github/workflows/tests.yml`) — Windows unit only -- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` matrix cell with - **two** cells carrying `shard: 1` / `shard: 2` (same `job_timeout`, or slightly lower - since each runs ~half — keep generous to avoid flakiness; a half-run finishes well within - 20 min). -- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` (unset on cells without a `shard:` key, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged). -- **Artifact name** must include the shard (`test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}`) — v4+ rejects duplicate artifact names. -- The job-name template already includes `leg`; extend it to include the shard so the two - Windows-unit jobs are distinguishable in the checks list. -- **Scope:** Windows unit **only**. Do **not** shard: the fast legs (interop ~100s, - integration ~28s, all of ubuntu/macos — not bottlenecks), `nightly.yml` (background, not - dev-blocking; optional future), or the **kcov** job (coverage needs the whole suite in - one process — sharding would break it). -- **Runner budget:** today's matrix is ~5 jobs (3 OS legs split into 4 + lint); going to 5 - test jobs + lint is well under GitHub's concurrency ceiling — no queueing. +- Replace the single `{ os: windows-2025, leg: unit, job_timeout: 20 }` cell with **two** cells + carrying `shard: 1` / `shard: 2` (same `job_timeout`; keep the existing step timeout — a + half-run finishes well within it; generous-over-tight matches the repo's "backstop only" + philosophy and avoids flakiness). +- The Unit-suite step sets `GCL_TEST_SHARD: ${{ matrix.shard && format('{0}/2', matrix.shard) || '' }}` — yields `1/2`/`2/2` on the shard cells and `''` (effectively unset, per the harness's `${GCL_TEST_SHARD:-}`) on every other cell, so ubuntu/macos `leg: all` and the windows interop-integration cell run the **full** unit suite unchanged. (`/2` is hardcoded; the harness is `n`-generic, so only this one CI string ties to 2 — easy to extend later. NB GHA treats `0` as falsy, so keep shard indices 1-based.) +- **Artifact name** gains the shard: `test-logs-${{ matrix.os }}-${{ matrix.leg }}${{ matrix.shard && format('-{0}', matrix.shard) || '' }}` → `…-unit-1`/`…-unit-2` (v4+ rejects duplicate names); other cells' names are byte-identical to today. +- The job-name template (already includes `leg`) gains the shard so the two unit jobs are distinguishable. +- **Scope:** Windows unit **only**. Do NOT shard the fast legs (interop, integration, all of + ubuntu/macos), `nightly.yml` (background, not dev-blocking; optional future), or the **kcov** + job (coverage needs the whole suite in one process — sharding would break it). +- **Runner budget:** 4 test cells + `lint` = 5 jobs today → 5 test cells + `lint` = 6 jobs; + well under GitHub's concurrency ceiling — no queueing. ## Logging / observability (per engineering practices) -- Each sharded run logs a single greppable line at the verdict: - `GCL_TEST_SHARD=i/n: ran R of T sections (expected E)` — captured in the CI suite log - (`tee test-output/unit-suite.log`) and the uploaded artifact, so a future agent can - reconstruct which shard ran which tests. -- The partition guard's failure message is a loud `Bail out! shard i/n ran R, expected E` - → the step fails and the artifact (with the per-test `== Test N ==` headers, which - `section()` echoes for *every* test, run or skipped) shows exactly which tests landed - where. The per-shard CI job name (`… (unit, shard 1)`) makes a red attributable. +- Each sharded run logs one greppable verdict line: `GCL_TEST_SHARD=i/n: ran R of T sections + (expected E)` — captured in the CI suite log (`tee … unit-suite.log`) and the uploaded + artifact, so a future agent can reconstruct which shard ran what. +- For per-test attribution, `section()` emits a **run-only** marker (e.g. `RAN: