Skip to content

docs(adr): ADR-0037 concurrent durable pause — multi-instance now, token tree later#1706

Merged
os-zhuang merged 3 commits into
mainfrom
adr-0037-token-scope-tree
Jun 11, 2026
Merged

docs(adr): ADR-0037 concurrent durable pause — multi-instance now, token tree later#1706
os-zhuang merged 3 commits into
mainfrom
adr-0037-token-scope-tree

Conversation

@os-zhuang

@os-zhuang os-zhuang commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

Summary (revised after self-review)

ADR-0037 records the decision for concurrent durable pause — pausing in more than one place in a run at once (parallel approvals, batch approvals), which the engine forbids today (SuspendedRun.nodeId is a single position; suspend is a thrown stack-unwind; runRegion bans region pause).

A code + industry self-review (Camunda / Zeebe / Flowable / Step Functions vs. the actual engine) changed the recommendation. The tempting "adopt a token/scope tree" answer is the right long-term model but is a full engine-core rewrite, because the tree only works paired with three things the engine doesn't have:

  1. recursion + throw-to-suspend → an explicit token scheduler (a thrown exception can't pause branch A while branch B keeps running);
  2. the flat shared variable Maphierarchical scope variables (read up the tree) — not the copy-on-write/merge scheme the first draft invented, which no major engine uses;
  3. sibling resumes → per-run serialization (Camunda's per-instance locking) — so the concurrency is logical, not parallel execution.

Decision: two tracks

  • Track A (now, no engine-core change): multi-instance / aggregating nodes — parallel approvals as one approval node aggregating N decisions (unanimous/first_response/threshold), batch approvals as a map node. Camunda multi-instance / Step Functions Map shape. The run keeps one program counter; the node owns fan-out + aggregation. Covers the demand that motivated the ADR.
  • Track B (deferred, recorded): the general token/scope tree + scheduler + hierarchical scoping, started only when a flow needs arbitrary-position concurrent pause that multi-instance can't express.

Kept correct from the first draft: reject Temporal replay (ADR-0018 open node registry breaks determinism); authoring model + DAG invariant preserved (tokens runtime-only); single-token = back-compat anchor.

Status

Proposed — decision doc for review before any engine work. No code changes.

🤖 Generated with Claude Code

Proposes generalizing the engine's single-program-counter (SuspendedRun.nodeId,
one position) to a token/scope-tree runtime model so a run can pause in more
than one place at once — unblocking durable pause inside parallel branches
(parallel approvals) and loop iterations (batch approvals), which ADR-0019 M1
explicitly forbids today.

The token/scope tree is the runtime dual of ADR-0031's structured regions
(a region instance is a scope) and the established BPMN-engine model
(Camunda/Flowable); replay-based models (Temporal) are rejected because
ADR-0018's open node registry breaks their determinism precondition. Authoring
model and DAG invariant are explicitly preserved (tokens are runtime-only); the
single-token degenerate case keeps today's flows bit-for-bit unchanged.

Status: Proposed — decision doc for review before the core-engine work. Phased
roadmap (2a internal refactor → 2b parallel pause → 2c loop pause → 2d
cancellation/boundary-event foundation). Adds forward-references from ADR-0019
and ADR-0031.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@vercel

vercel Bot commented Jun 11, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
spec Ready Ready Preview, Comment Jun 11, 2026 4:14am

Request Review

@github-actions github-actions Bot added documentation Improvements or additions to documentation size/m labels Jun 11, 2026
Self-review against the actual engine (flat shared variable Map; suspend = a
thrown FlowSuspendSignal unwinding the stack; runRegion bans region pause;
traverseNext already fans out unconditional edges via Promise.all) and against
Camunda/Zeebe/Flowable/Step Functions found the first draft adopted the token
tree as a data structure while missing the execution model that makes it work:

- recursion + throw-to-suspend must become an explicit token scheduler (a
  thrown exception cannot pause branch A while branch B keeps running);
- the flat shared variable Map must become hierarchical scope variables (read
  up the tree) — NOT the copy-on-write/merge scheme the first draft invented,
  which no major engine uses;
- sibling token resumes must serialize per run (Camunda's per-instance
  optimistic locking) — so the concurrency is logical, not parallel execution.

Given that true cost, the revision splits the decision into two tracks and
recommends Track A first:
- Track A (now, no engine-core change): multi-instance / aggregating nodes —
  parallel approvals as one `approval` node aggregating N decisions, batch
  approvals as a `map` node. Camunda multi-instance / Step Functions Map shape.
- Track B (deferred, recorded): the general token/scope tree + scheduler +
  hierarchical scoping, started only when a flow needs arbitrary-position
  concurrent pause that multi-instance can't express.

Keeps the correct parts (reject Temporal replay per ADR-0018 open registry;
authoring model + DAG invariant preserved; single-token = back-compat anchor).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@os-zhuang os-zhuang changed the title docs(adr): ADR-0037 token/scope-tree execution (proposed) docs(adr): ADR-0037 concurrent durable pause — multi-instance now, token tree later Jun 11, 2026
Building A2 surfaced a real error in the ADR: it claimed Track A's `map` /
multi-instance node needs "no change to the engine's execution model." Examined
against the resume/bubble code, that is false for any map that serves batch
approval (each item can pause):

- concurrent map needs durable N:1 aggregation + per-parent serialization +
  completion-ordering handling (part of Track B's hard concurrency, confined to
  one node);
- sequential map needs resume-INTO-the-node (next item) instead of the engine's
  resume-past-the-node default (the DAG has no back-edge to loop the node);
- only a synchronous, non-pausing map is engine-free, and that does not serve
  batch approval.

Splits Track A into A1 (aggregating approval — truly free, shipped #1708) and
A2 (map node — a bounded, separately-justified engine task, design-first). A1
covers parallel approvals at zero engine cost; A2 is not a free rider on it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@os-zhuang os-zhuang merged commit fb49b87 into main Jun 11, 2026
12 checks passed
@os-zhuang os-zhuang deleted the adr-0037-token-scope-tree branch June 11, 2026 04:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size/m

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant