docs(adr): ADR-0037 concurrent durable pause — multi-instance now, token tree later#1706
Merged
Conversation
Proposes generalizing the engine's single-program-counter (SuspendedRun.nodeId, one position) to a token/scope-tree runtime model so a run can pause in more than one place at once — unblocking durable pause inside parallel branches (parallel approvals) and loop iterations (batch approvals), which ADR-0019 M1 explicitly forbids today. The token/scope tree is the runtime dual of ADR-0031's structured regions (a region instance is a scope) and the established BPMN-engine model (Camunda/Flowable); replay-based models (Temporal) are rejected because ADR-0018's open node registry breaks their determinism precondition. Authoring model and DAG invariant are explicitly preserved (tokens are runtime-only); the single-token degenerate case keeps today's flows bit-for-bit unchanged. Status: Proposed — decision doc for review before the core-engine work. Phased roadmap (2a internal refactor → 2b parallel pause → 2c loop pause → 2d cancellation/boundary-event foundation). Adds forward-references from ADR-0019 and ADR-0031. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Self-review against the actual engine (flat shared variable Map; suspend = a thrown FlowSuspendSignal unwinding the stack; runRegion bans region pause; traverseNext already fans out unconditional edges via Promise.all) and against Camunda/Zeebe/Flowable/Step Functions found the first draft adopted the token tree as a data structure while missing the execution model that makes it work: - recursion + throw-to-suspend must become an explicit token scheduler (a thrown exception cannot pause branch A while branch B keeps running); - the flat shared variable Map must become hierarchical scope variables (read up the tree) — NOT the copy-on-write/merge scheme the first draft invented, which no major engine uses; - sibling token resumes must serialize per run (Camunda's per-instance optimistic locking) — so the concurrency is logical, not parallel execution. Given that true cost, the revision splits the decision into two tracks and recommends Track A first: - Track A (now, no engine-core change): multi-instance / aggregating nodes — parallel approvals as one `approval` node aggregating N decisions, batch approvals as a `map` node. Camunda multi-instance / Step Functions Map shape. - Track B (deferred, recorded): the general token/scope tree + scheduler + hierarchical scoping, started only when a flow needs arbitrary-position concurrent pause that multi-instance can't express. Keeps the correct parts (reject Temporal replay per ADR-0018 open registry; authoring model + DAG invariant preserved; single-token = back-compat anchor). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Building A2 surfaced a real error in the ADR: it claimed Track A's `map` / multi-instance node needs "no change to the engine's execution model." Examined against the resume/bubble code, that is false for any map that serves batch approval (each item can pause): - concurrent map needs durable N:1 aggregation + per-parent serialization + completion-ordering handling (part of Track B's hard concurrency, confined to one node); - sequential map needs resume-INTO-the-node (next item) instead of the engine's resume-past-the-node default (the DAG has no back-edge to loop the node); - only a synchronous, non-pausing map is engine-free, and that does not serve batch approval. Splits Track A into A1 (aggregating approval — truly free, shipped #1708) and A2 (map node — a bounded, separately-justified engine task, design-first). A1 covers parallel approvals at zero engine cost; A2 is not a free rider on it. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary (revised after self-review)
ADR-0037 records the decision for concurrent durable pause — pausing in more than one place in a run at once (parallel approvals, batch approvals), which the engine forbids today (
SuspendedRun.nodeIdis a single position; suspend is a thrown stack-unwind;runRegionbans region pause).A code + industry self-review (Camunda / Zeebe / Flowable / Step Functions vs. the actual engine) changed the recommendation. The tempting "adopt a token/scope tree" answer is the right long-term model but is a full engine-core rewrite, because the tree only works paired with three things the engine doesn't have:
Map→ hierarchical scope variables (read up the tree) — not the copy-on-write/merge scheme the first draft invented, which no major engine uses;Decision: two tracks
approvalnode aggregating N decisions (unanimous/first_response/threshold), batch approvals as amapnode. Camunda multi-instance / Step FunctionsMapshape. The run keeps one program counter; the node owns fan-out + aggregation. Covers the demand that motivated the ADR.Kept correct from the first draft: reject Temporal replay (ADR-0018 open node registry breaks determinism); authoring model + DAG invariant preserved (tokens runtime-only); single-token = back-compat anchor.
Status
Proposed — decision doc for review before any engine work. No code changes.
🤖 Generated with Claude Code