Skip to content

fix: improve timed-resume throughput at high concurrency#475

Open
yaythomas wants to merge 2 commits into
mainfrom
fix/timer-and-operations
Open

fix: improve timed-resume throughput at high concurrency#475
yaythomas wants to merge 2 commits into
mainfrom
fix/timer-and-operations

Conversation

@yaythomas

@yaythomas yaythomas commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Description of changes:
This branch carries two independent concurrency changes to the core SDK, one commit each.

fix: concurrency drain timer resumes outside lock

See #473 for full description with repro code.

Map/parallel branches that suspend on a timer were resumed by the TimerScheduler while it held its lock across a blocking checkpoint, serializing every timed resume behind one network round trip and causing repeated function timeouts at high concurrency. This drains all due resumes under the lock, keeping the heap pop and the PENDING transition atomic, then resubmits outside the lock with one shared state refresh per wave.

AWS validation at 300 branches: 222s / 4 invocations / 3 timeouts before, 38s / 1 invocation / 0 timeouts after. All branches complete in both.

fix: lock operations access in ExecutionState

track_replay iterated ExecutionState.operations under _replay_status_lock while the background checkpoint thread updated the same map under _operations_lock, a dictionary changed size during iteration race. This makes operations private with a read-only snapshot property and reads it under _operations_lock in track_replay and get_execution_operation. Adds a regression test that fails on the old code and passes here.

Verification

.github/scripts/ci-checks.sh passes: 1273 core tests, types clean, lint clean, commit-lint clean on both commits.

Issue #, if available:
Closes #473

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

A map or parallel stays in the current invocation while at least one
branch is still running. When one iteration keeps it alive and the
other branches wait on short timers (a wait, a wait_for_condition poll,
or a step retry backoff), those branches resume in-process as their
timers come due.

Before this commit, a single timer thread runs those resumes one at a
time and holds its lock across a blocking checkpoint that refreshes
state. Every resume costs one network round trip under the lock, and
every branch trying to register its next wait queues behind it. When
many timers come due together the timer thread falls behind, the
invocation reaches its function timeout, and the backend reinvokes, so
a map that should finish in seconds runs for minutes across several
timed-out invocations. Holding the lock across the submit also allowed
a latent self-deadlock, where a branch that finished inline reacquired
the same lock on the timer thread through its done-callback.

This commit holds the lock only long enough to take all due timers off
the queue and mark them pending, then releases it and runs one shared
refresh for the whole wave before handing the branches back to the
worker pool. One round trip now serves the whole wave instead of one
per resume, and new waits no longer queue behind a network call. The
take and the mark stay atomic so a branch never looks parked while it
is about to resume, which would otherwise suspend the whole operation
by mistake. If the refresh fails, which happens only when the
checkpoint subsystem has already failed and is terminal, the timer
thread records that one error and re-raises it from execute() so the
platform retries from the last checkpoint.

Closes #473
- Make operations private (_operations); add a read-only snapshot
  property to preserve the public attribute
- Read operations under _operations_lock in track_replay and
  get_execution_operation, closing a dictionary-changed-size race
  against the concurrent checkpoint update path
- Add regression test for concurrent track_replay and update
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve map/parallel timed-resume throughput at high concurrency

2 participants