feat(buzz-acp): elect a single leader instance to gate autonomous prompting#1062
Draft
wpfleger96 wants to merge 6 commits into
Draft
feat(buzz-acp): elect a single leader instance to gate autonomous prompting#1062wpfleger96 wants to merge 6 commits into
wpfleger96 wants to merge 6 commits into
Conversation
When multiple Buzz dev instances share an agent keypair, the relay fans each mention out to every connected buzz-acp (NIP-01) and all of them respond. Add a per-agent-key leader lock so only the leader instance promotes queued events to prompts; non-leaders keep their queue populated for UI rendering but stay silent. Reads ~/.buzz/leader-locks/<pubkey>.lock: absent or self-owned means leader, foreign-owned means observer, malformed fails safe to leader so a corrupt lock never silences the only responder. Solo dev has no lock file, so behavior is unchanged. The 👀 reaction fires at queue-push time before dispatch, so it is gated separately from dispatch_pending. The election identity is sourced from BUZZ_INSTANCE_ELECTION_ID behind a single named constant. It is deliberately NOT the Tauri bundle identifier (BUZZ_MANAGED_AGENT): that collides across same-class windows (DMG + dev, or worktrees whose icon-gen fell back to the shared dev id), which would let two windows both lead. Phase 2 writes a process-unique value; Phase 1 reads whatever that env carries and defaults to leader when it is unset. Phase 1 of 3: read side only. Lock acquire/steal/failover (Phase 2) and claim UI (Phase 3) follow. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
c6496b8 to
09422e6
Compare
run_prompt_task has a second caller besides dispatch_pending: dispatch_heartbeat. The heartbeat prompt acts autonomously on the wire (sends messages, approves workflows), so an ungated heartbeat let every non-leader instance sharing one agent key self-prompt and act — the same duplicate-actor bug the dispatch gate prevents, arriving through the heartbeat door. Gating before pool.try_claim closes it; solo dev is unaffected since an absent lock means leader. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
Folds the NIP-LE spec into PR #1062 alongside the implementation so the spec and the code it describes ship together. Reconciles one wording gap against the shipped code: the draft's invariant named two suppression surfaces for non-leaders (dispatch path, 👀 reaction) but the implementation gates a third — the autonomous heartbeat prompt path (dispatch_heartbeat). Adds bullet (c) so all three suppression surfaces in the spec map to a gate in the code. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
…ument fail-safe scope The primary suppression surface — dispatch_pending — had no test; only the heartbeat gate was covered, so a refactor of the shared leader gate could silently ship a non-leader that still dispatches. Add the missing leader/non-leader test pair. Read-side mutex locks now read through poison rather than panic the event loop, removing a latent footgun for when Phase 2 extends the critical sections. Module-doc and NIP-LE now state that any IO error (not just an absent lock) fails safe to leader, that this can produce a bounded ≤5s transient duplicate during a concurrent rewrite, and that pid/claimed_at are read-side-ignored. Co-authored-by: Will Pfleger <wpfleger@block.xyz> Signed-off-by: Will Pfleger <wpfleger@block.xyz>
…ailover The read side elected nobody: with no writer, every instance read an absent lock and defaulted to leader, so a shared agent key produced duplicate responders. Add the writer half so one instance actually wins. The harness self-mints a process-unique election id (from_env_or_mint), acquires the lock on startup, re-attempts the claim on the existing 5s refresh tick (failover when a leader dies without releasing), and releases on graceful shutdown. acquire holds an exclusive flock across the read-decide-write window to close the TOCTOU race; a dead-pid lock is reclaimable, a live foreign lock is not. release empties the file rather than unlinking it to avoid racing a fresh claim onto a detached inode. Failover tolerates pid recycling: a crashed leader's pid may be reused by an unrelated live process, which a bare pid-liveness probe would mistake for the original leader and never take over, wedging the agent leaderless. A claim is takeable when its pid is dead OR its claimed_at is older than a staleness bound (2x the 5s refresh); a live leader rewrites claimed_at every tick, so only an abandoned claim ages out, without evicting an active leader. Writer is Unix-only (flock); non-Unix gets an always-leader no-op, mirroring the kill_process_group cfg fallback. NIP-LE amended to final state: claim is auto-on-launch plus explicit re-claim. Co-authored-by: Will Pfleger <pfleger.will@gmail.com> Signed-off-by: Will Pfleger <pfleger.will@gmail.com>
f00829b to
45a6de5
Compare
…tive (#1076) Signed-off-by: Will Pfleger <pfleger.will@gmail.com> Co-authored-by: npub1mn7jgtj4w2pd0g0zeuhxsa6jy6p0rewxz4kujt98my82ahfmp72sxjexk7 <dcfd242e557282d7a1e2cf2e6877522682f1e5c6156dc92ca7d90eaedd3b0f95@sprout-oss.stage.blox.sqprod.co>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
When multiple Buzz dev instances share an agent keypair, the relay fans every matching event out to all of them (NIP-01) and each instance prompts its agent — duplicate replies for a single
@mention. This adds a per-agent-key leader lock so only the leader instance acts autonomously on the wire. Non-leaders still receive and render events (the queue stays ungated for UI) but suppress every path that would emit under the shared identity.This PR carries the NIP-LE specification (
docs/nips/NIP-LE.md), the read side (every instance derives leadership from the lock file), and the writer + auto-claim lifecycle (the harness self-elects on launch, fails over on leader death, releases on shutdown). The result is a working headless election: under one shared key, the first instance up leads and every other instance observes, with no UI. The desktop claim/steal UI is a separate stacked PR.The invariant
A single agent identity may have N subscribed instances but exactly one prompter (the leader). Non-leaders suppress all three surfaces that would otherwise act under the shared key:
👀reaction — fires at queue-acceptance time, before dispatch, so the dispatch gate alone would let every sharing instance emit a redundant 👀.buzz messages send/buzz workflows approveon its own; non-leaders must not fire it, or N instances each act for the same key.Each surface maps 1:1 to a gate in the code.
docs/nips/NIP-LE.mdis the normative spec for this invariant and the lock contract.How leadership is decided (read side)
Every instance reads
~/.buzz/leader-locks/<pubkey-hex>.lockand resolves:instance_idmatches this process's election id → leaderinstance_id→ observerStatus is cached per agent key and refreshed in place by
refresh(). Leadership is read-derived, never acquire-derived:is_leaderindependently reads the file, so an acquire that fails open on an IO error can never force a process to lead next to a live foreign leader.How a leader is elected (writer + auto-claim)
The harness self-mints a process-unique election id (
{pid}-{nanos}) whenBUZZ_INSTANCE_ELECTION_IDis unset, and honors the env verbatim if set (future desktop per-spawn injection). The same id is both written into the lock and used by the read check, so a process's writes match its own reads. This needs zero desktop diff — Phase 2 is purebuzz-acpRust, headlessly testable with twobuzz-acpprocesses.acquire(agent_pubkey): take an exclusiveflock, then read-decide-write under the held lock so the read→decide→write TOCTOU window is closed. The lock is takeable iff free (absent / empty / malformed), already ours, the owner's pid is dead (kill(pid, None)→ESRCH), or the owner'sclaimed_atis stale (older thanSTALE_CLAIM= 10s = 2× the refresh). On success it writes{instance_id, pid, claimed_at}; a live, fresh foreign owner → observe.leader_refreshtick callsacquirebeforerefresh, so a leader re-stamps its ownclaimed_atevery tick (its claim is always fresher than the bound), and a survivor re-reads a dead-or-stale lock and takes over within 5s.releaseempties the lock file (never unlinks → no detached-inode race) only if we still own it, so a co-located sibling takes over immediately and a successor's claim is never stomped.Dead-pid vs recycled-pid. The
ESRCHfast path catches the common crash-without-release case instantly. Theclaimed_atstaleness arm is the backstop for the rarer case where the OS recycles the crashed leader's pid onto an unrelated live process before any survivor ticks: the recycled pid reads as alive, but with no live leader re-stamping the claim it ages past the bound and becomes takeable — closing what would otherwise be a permanent leaderless wedge. The only owner the staleness arm can evict is a live-pid process that has missed two consecutive refresh ticks, which requires a full-runtime stall (turns run on a spawned pool, so a normal heavy turn never blocks the refresh) — in which state the leader is non-functional and takeover is correct, self-healing to a transient duplicate at worst.Portability
The writer is
#[cfg(unix)](flockis Unix);#[cfg(not(unix))]is an always-leader no-op acquire/release, mirroring the existingkill_process_groupcfg pattern. Desktop targets macOS/Linux; Windows is not a leader-election target. Theflockfeature is added to the existingcfg(unix)nixdependency;claimed_atparsing reuses the already-presentchronoworkspace dep (Cargo.lock unchanged).Election identity
The lock keys on a per-process election identity behind a single named constant (
ELECTION_ID_ENV=BUZZ_INSTANCE_ELECTION_ID). It is deliberately not the Tauri bundle identifier (BUZZ_MANAGED_AGENT): the bundle id collides across same-class windows (DMG + dev sharexyz.block.buzz.app(.dev); a worktree falls back to the shared dev id ifswift generate-dev-iconfails), which would let two windows both match the lock and both lead. Reaper identity partitions by app-class; election identity must be unique per process.Scope boundaries
flock-guarded writer (acquire/release), dead-pid + stale-claim failover, auto-claim-on-launch wired into the harness lifecycle.queue.push()gating is left untouched so non-leader UI still renders.What changed
docs/nips/NIP-LE.md(new) — the NIP-LE spec: the exactly-one-prompter invariant (surfaces a/b/c), the local-filesystem lock contract, claim semantics (auto-on-launch acquire-if-unowned plus explicit re-claim), theflockTOCTOU guard, and dead-pid + stale-claim failover.crates/buzz-acp/src/leader.rs— theLeaderChecktrait andFileLeaderCheck: the read side (above), plus the writer half —acquire/release,lock_is_takeable(ours || dead-pid || stale),pid_is_alive(ESRCH-only-dead;EPERM→ alive so a live leader is never stolen from),claim_is_stale, andfrom_env_or_mint(honor env if set, else mint{pid}-{nanos}).crates/buzz-acp/src/pool.rs—PromptContextcarriesleader: Arc<dyn LeaderCheck>.crates/buzz-acp/src/lib.rs— constructsfrom_env_or_mint; auto-acquireon startup; re-acquirebeforerefreshin the 5sleader_refreshtick;releasefirst in shutdown teardown; gates (a)dispatch_pending, (b) the queue-push👀reaction_add, (c)dispatch_heartbeat.Tests
Twenty-one unit tests, all behavior-focused.
Reader (6): absent → leader, self-owned → leader, foreign-owned → observer, malformed → fail-safe leader, no-election-id → leader even with a foreign lock, and
refreshflips status when the lock changes. Ids are opaque per-window strings (window-a/window-b) so the suite can't enshrinebundle-id == election-id.Writer (11): unowned-acquire writes self, creates the lock dir when absent, two-writer race yields exactly one winner, dead-pid lock is taken over, live+fresh foreign lock blocks acquire, recycled-pid-with-stale-claim is taken over (the failover-stall reproduction — fails without the staleness arm), reacquire-own is idempotent, release frees the lock for another writer, release does not stomp a successor, no-election-id acquire is a no-op (always leader), and an 8-thread barrier-synced concurrent acquire yields exactly one winner (proving
flockserialization, not just the happy path).Gates (4): a leader/non-leader pair over the heartbeat gate (c) — a non-leader tick claims no agent and never marks in-flight, a companion leader tick claims an idle agent and fires; and a leader/non-leader pair over the dispatch gate (a) — a non-leader promotes nothing queued, a leader promotes one queued event — proving each negative test exercises its gate, not a dead path.
cargo test -p buzz-acp→ 330 passed.cargo fmt --check/cargo clippy --all-targetsclean.Stack: this PR → claim/steal UI (stacked).