fix: improve timed-resume throughput at high concurrency#475
Open
yaythomas wants to merge 2 commits into
Open
Conversation
A map or parallel stays in the current invocation while at least one branch is still running. When one iteration keeps it alive and the other branches wait on short timers (a wait, a wait_for_condition poll, or a step retry backoff), those branches resume in-process as their timers come due. Before this commit, a single timer thread runs those resumes one at a time and holds its lock across a blocking checkpoint that refreshes state. Every resume costs one network round trip under the lock, and every branch trying to register its next wait queues behind it. When many timers come due together the timer thread falls behind, the invocation reaches its function timeout, and the backend reinvokes, so a map that should finish in seconds runs for minutes across several timed-out invocations. Holding the lock across the submit also allowed a latent self-deadlock, where a branch that finished inline reacquired the same lock on the timer thread through its done-callback. This commit holds the lock only long enough to take all due timers off the queue and mark them pending, then releases it and runs one shared refresh for the whole wave before handing the branches back to the worker pool. One round trip now serves the whole wave instead of one per resume, and new waits no longer queue behind a network call. The take and the mark stay atomic so a branch never looks parked while it is about to resume, which would otherwise suspend the whole operation by mistake. If the refresh fails, which happens only when the checkpoint subsystem has already failed and is terminal, the timer thread records that one error and re-raises it from execute() so the platform retries from the last checkpoint. Closes #473
- Make operations private (_operations); add a read-only snapshot property to preserve the public attribute - Read operations under _operations_lock in track_replay and get_execution_operation, closing a dictionary-changed-size race against the concurrent checkpoint update path - Add regression test for concurrent track_replay and update
ParidelPooya
approved these changes
Jun 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description of changes:
This branch carries two independent concurrency changes to the core SDK, one commit each.
fix: concurrency drain timer resumes outside lock
See #473 for full description with repro code.
Map/parallel branches that suspend on a timer were resumed by the
TimerSchedulerwhile it held its lock across a blocking checkpoint, serializing every timed resume behind one network round trip and causing repeated function timeouts at high concurrency. This drains all due resumes under the lock, keeping the heap pop and the PENDING transition atomic, then resubmits outside the lock with one shared state refresh per wave.AWS validation at 300 branches: 222s / 4 invocations / 3 timeouts before, 38s / 1 invocation / 0 timeouts after. All branches complete in both.
fix: lock operations access in ExecutionState
track_replayiteratedExecutionState.operationsunder_replay_status_lockwhile the background checkpoint thread updated the same map under_operations_lock, adictionary changed size during iterationrace. This makesoperationsprivate with a read-only snapshot property and reads it under_operations_lockintrack_replayandget_execution_operation. Adds a regression test that fails on the old code and passes here.Verification
.github/scripts/ci-checks.shpasses: 1273 core tests, types clean, lint clean, commit-lint clean on both commits.Issue #, if available:
Closes #473
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.