feat(buzz-acp): elect a single leader instance to gate autonomous prompting by wpfleger96 · Pull Request #1062 · block/buzz

wpfleger96 · 2026-06-15T22:16:18Z

Summary

When multiple Buzz dev instances share an agent keypair, the relay fans every matching event out to all of them (NIP-01) and each instance prompts its agent — duplicate replies for a single @mention. This adds a per-agent-key leader lock so only the leader instance acts autonomously on the wire. Non-leaders still receive and render events (the queue stays ungated for UI) but suppress every path that would emit under the shared identity.

This PR carries the NIP-LE specification (docs/nips/NIP-LE.md), the read side (every instance derives leadership from the lock file), and the writer + auto-claim lifecycle (the harness self-elects on launch, fails over on leader death, releases on shutdown). The result is a working headless election: under one shared key, the first instance up leads and every other instance observes, with no UI. The desktop claim/steal UI is a separate stacked PR.

The invariant

A single agent identity may have N subscribed instances but exactly one prompter (the leader). Non-leaders suppress all three surfaces that would otherwise act under the shared key:

(a) the prompt/dispatch path — non-leaders promote nothing queued to a prompt.
(b) the pre-dispatch 👀 reaction — fires at queue-acceptance time, before dispatch, so the dispatch gate alone would let every sharing instance emit a redundant 👀.
(c) the autonomous heartbeat prompt path — the periodic self-prompt that, when enabled, has the agent buzz messages send / buzz workflows approve on its own; non-leaders must not fire it, or N instances each act for the same key.

Each surface maps 1:1 to a gate in the code. docs/nips/NIP-LE.md is the normative spec for this invariant and the lock contract.

How leadership is decided (read side)

Every instance reads ~/.buzz/leader-locks/<pubkey-hex>.lock and resolves:

absent lock → leader (solo dev is unaffected — no lock, no suppression)
present + instance_id matches this process's election id → leader
present + foreign instance_id → observer
no election id → leader (solo CLI with the env unset)
unreadable or malformed → fail safe to leader: any IO error (permission, mid-write truncation) or parse failure resolves to leader, so a corrupt lock never silences the only responder

Status is cached per agent key and refreshed in place by refresh(). Leadership is read-derived, never acquire-derived: is_leader independently reads the file, so an acquire that fails open on an IO error can never force a process to lead next to a live foreign leader.

How a leader is elected (writer + auto-claim)

The harness self-mints a process-unique election id ({pid}-{nanos}) when BUZZ_INSTANCE_ELECTION_ID is unset, and honors the env verbatim if set (future desktop per-spawn injection). The same id is both written into the lock and used by the read check, so a process's writes match its own reads. This needs zero desktop diff — Phase 2 is pure buzz-acp Rust, headlessly testable with two buzz-acp processes.

Startup — acquire(agent_pubkey): take an exclusive flock, then read-decide-write under the held lock so the read→decide→write TOCTOU window is closed. The lock is takeable iff free (absent / empty / malformed), already ours, the owner's pid is dead (kill(pid, None) → ESRCH), or the owner's claimed_at is stale (older than STALE_CLAIM = 10s = 2× the refresh). On success it writes {instance_id, pid, claimed_at}; a live, fresh foreign owner → observe.
Failover — no new timer. The existing 5s leader_refresh tick calls acquire before refresh, so a leader re-stamps its own claimed_at every tick (its claim is always fresher than the bound), and a survivor re-reads a dead-or-stale lock and takes over within 5s.
Shutdown — release empties the lock file (never unlinks → no detached-inode race) only if we still own it, so a co-located sibling takes over immediately and a successor's claim is never stomped.

Dead-pid vs recycled-pid. The ESRCH fast path catches the common crash-without-release case instantly. The claimed_at staleness arm is the backstop for the rarer case where the OS recycles the crashed leader's pid onto an unrelated live process before any survivor ticks: the recycled pid reads as alive, but with no live leader re-stamping the claim it ages past the bound and becomes takeable — closing what would otherwise be a permanent leaderless wedge. The only owner the staleness arm can evict is a live-pid process that has missed two consecutive refresh ticks, which requires a full-runtime stall (turns run on a spawned pool, so a normal heavy turn never blocks the refresh) — in which state the leader is non-functional and takeover is correct, self-healing to a transient duplicate at worst.

Portability

The writer is #[cfg(unix)] (flock is Unix); #[cfg(not(unix))] is an always-leader no-op acquire/release, mirroring the existing kill_process_group cfg pattern. Desktop targets macOS/Linux; Windows is not a leader-election target. The flock feature is added to the existing cfg(unix) nix dependency; claimed_at parsing reuses the already-present chrono workspace dep (Cargo.lock unchanged).

Election identity

The lock keys on a per-process election identity behind a single named constant (ELECTION_ID_ENV = BUZZ_INSTANCE_ELECTION_ID). It is deliberately not the Tauri bundle identifier (BUZZ_MANAGED_AGENT): the bundle id collides across same-class windows (DMG + dev share xyz.block.buzz.app(.dev); a worktree falls back to the shared dev id if swift generate-dev-icon fails), which would let two windows both match the lock and both lead. Reaper identity partitions by app-class; election identity must be unique per process.

Scope boundaries

In this PR: NIP-LE spec, read-side leadership, the flock-guarded writer (acquire/release), dead-pid + stale-claim failover, auto-claim-on-launch wired into the harness lifecycle.
Out of scope (stacked PR): the desktop/Tauri sidebar claim/steal UI and the explicit hard-steal gesture. Phase 2 leaves the steal mechanism reachable but wires no user trigger. queue.push() gating is left untouched so non-leader UI still renders.

What changed

docs/nips/NIP-LE.md (new) — the NIP-LE spec: the exactly-one-prompter invariant (surfaces a/b/c), the local-filesystem lock contract, claim semantics (auto-on-launch acquire-if-unowned plus explicit re-claim), the flock TOCTOU guard, and dead-pid + stale-claim failover.
crates/buzz-acp/src/leader.rs — the LeaderCheck trait and FileLeaderCheck: the read side (above), plus the writer half — acquire/release, lock_is_takeable (ours || dead-pid || stale), pid_is_alive (ESRCH-only-dead; EPERM → alive so a live leader is never stolen from), claim_is_stale, and from_env_or_mint (honor env if set, else mint {pid}-{nanos}).
crates/buzz-acp/src/pool.rs — PromptContext carries leader: Arc<dyn LeaderCheck>.
crates/buzz-acp/src/lib.rs — constructs from_env_or_mint; auto-acquire on startup; re-acquire before refresh in the 5s leader_refresh tick; release first in shutdown teardown; gates (a) dispatch_pending, (b) the queue-push 👀 reaction_add, (c) dispatch_heartbeat.

Tests

Twenty-one unit tests, all behavior-focused.

Reader (6): absent → leader, self-owned → leader, foreign-owned → observer, malformed → fail-safe leader, no-election-id → leader even with a foreign lock, and refresh flips status when the lock changes. Ids are opaque per-window strings (window-a/window-b) so the suite can't enshrine bundle-id == election-id.

Writer (11): unowned-acquire writes self, creates the lock dir when absent, two-writer race yields exactly one winner, dead-pid lock is taken over, live+fresh foreign lock blocks acquire, recycled-pid-with-stale-claim is taken over (the failover-stall reproduction — fails without the staleness arm), reacquire-own is idempotent, release frees the lock for another writer, release does not stomp a successor, no-election-id acquire is a no-op (always leader), and an 8-thread barrier-synced concurrent acquire yields exactly one winner (proving flock serialization, not just the happy path).

Gates (4): a leader/non-leader pair over the heartbeat gate (c) — a non-leader tick claims no agent and never marks in-flight, a companion leader tick claims an idle agent and fires; and a leader/non-leader pair over the dispatch gate (a) — a non-leader promotes nothing queued, a leader promotes one queued event — proving each negative test exercises its gate, not a dead path.

cargo test -p buzz-acp → 330 passed. cargo fmt --check / cargo clippy --all-targets clean.

Stack: this PR → claim/steal UI (stacked).

When multiple Buzz dev instances share an agent keypair, the relay fans each mention out to every connected buzz-acp (NIP-01) and all of them respond. Add a per-agent-key leader lock so only the leader instance promotes queued events to prompts; non-leaders keep their queue populated for UI rendering but stay silent. Reads ~/.buzz/leader-locks/<pubkey>.lock: absent or self-owned means leader, foreign-owned means observer, malformed fails safe to leader so a corrupt lock never silences the only responder. Solo dev has no lock file, so behavior is unchanged. The 👀 reaction fires at queue-push time before dispatch, so it is gated separately from dispatch_pending. The election identity is sourced from BUZZ_INSTANCE_ELECTION_ID behind a single named constant. It is deliberately NOT the Tauri bundle identifier (BUZZ_MANAGED_AGENT): that collides across same-class windows (DMG + dev, or worktrees whose icon-gen fell back to the shared dev id), which would let two windows both lead. Phase 2 writes a process-unique value; Phase 1 reads whatever that env carries and defaults to leader when it is unset. Phase 1 of 3: read side only. Lock acquire/steal/failover (Phase 2) and claim UI (Phase 3) follow. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

run_prompt_task has a second caller besides dispatch_pending: dispatch_heartbeat. The heartbeat prompt acts autonomously on the wire (sends messages, approves workflows), so an ungated heartbeat let every non-leader instance sharing one agent key self-prompt and act — the same duplicate-actor bug the dispatch gate prevents, arriving through the heartbeat door. Gating before pool.try_claim closes it; solo dev is unaffected since an absent lock means leader. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

Folds the NIP-LE spec into PR #1062 alongside the implementation so the spec and the code it describes ship together. Reconciles one wording gap against the shipped code: the draft's invariant named two suppression surfaces for non-leaders (dispatch path, 👀 reaction) but the implementation gates a third — the autonomous heartbeat prompt path (dispatch_heartbeat). Adds bullet (c) so all three suppression surfaces in the spec map to a gate in the code. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…ument fail-safe scope The primary suppression surface — dispatch_pending — had no test; only the heartbeat gate was covered, so a refactor of the shared leader gate could silently ship a non-leader that still dispatches. Add the missing leader/non-leader test pair. Read-side mutex locks now read through poison rather than panic the event loop, removing a latent footgun for when Phase 2 extends the critical sections. Module-doc and NIP-LE now state that any IO error (not just an absent lock) fails safe to leader, that this can produce a bounded ≤5s transient duplicate during a concurrent rewrite, and that pid/claimed_at are read-side-ignored. Co-authored-by: Will Pfleger <wpfleger@block.xyz> Signed-off-by: Will Pfleger <wpfleger@block.xyz>

…ailover The read side elected nobody: with no writer, every instance read an absent lock and defaulted to leader, so a shared agent key produced duplicate responders. Add the writer half so one instance actually wins. The harness self-mints a process-unique election id (from_env_or_mint), acquires the lock on startup, re-attempts the claim on the existing 5s refresh tick (failover when a leader dies without releasing), and releases on graceful shutdown. acquire holds an exclusive flock across the read-decide-write window to close the TOCTOU race; a dead-pid lock is reclaimable, a live foreign lock is not. release empties the file rather than unlinking it to avoid racing a fresh claim onto a detached inode. Failover tolerates pid recycling: a crashed leader's pid may be reused by an unrelated live process, which a bare pid-liveness probe would mistake for the original leader and never take over, wedging the agent leaderless. A claim is takeable when its pid is dead OR its claimed_at is older than a staleness bound (2x the 5s refresh); a live leader rewrites claimed_at every tick, so only an abandoned claim ages out, without evicting an active leader. Writer is Unix-only (flock); non-Unix gets an always-leader no-op, mirroring the kill_process_group cfg fallback. NIP-LE amended to final state: claim is auto-on-launch plus explicit re-claim. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>

…tive (#1076) Signed-off-by: Will Pfleger <pfleger.will@gmail.com> Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>

wpfleger96 force-pushed the duncan/leader-election-core branch from c6496b8 to 09422e6 Compare June 15, 2026 22:18

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 15, 2026 18:29

wpfleger96 mentioned this pull request Jun 15, 2026

docs(nips): add NIP-LE leader election draft #1063

Closed

wpfleger96 marked this pull request as draft June 15, 2026 22:53

npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 and others added 2 commits June 16, 2026 15:31

wpfleger96 force-pushed the duncan/leader-election-core branch from f00829b to 45a6de5 Compare June 16, 2026 21:16

wpfleger96 changed the title ~~feat(buzz-acp): gate prompting on client-side leader check~~ feat(buzz-acp): elect a single leader instance to gate autonomous prompting Jun 16, 2026

wpfleger96 mentioned this pull request Jun 16, 2026

feat(buzz-acp): add cooperative leadership steal via stand-down primitive #1076

Merged

wpfleger96 mentioned this pull request Jun 17, 2026

feat(desktop): surface managed-agent leadership and cooperative steal #1078

Draft

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(buzz-acp): elect a single leader instance to gate autonomous prompting#1062

feat(buzz-acp): elect a single leader instance to gate autonomous prompting#1062
wpfleger96 wants to merge 6 commits into
mainfrom
duncan/leader-election-core

wpfleger96 commented Jun 15, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wpfleger96 commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

The invariant

How leadership is decided (read side)

How a leader is elected (writer + auto-claim)

Portability

Election identity

Scope boundaries

What changed

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wpfleger96 commented Jun 15, 2026 •

edited

Loading