fix: improve timed-resume throughput at high concurrency by yaythomas · Pull Request #475 · aws/aws-durable-execution-sdk-python

yaythomas · 2026-06-16T21:31:08Z

Description of changes:
This branch carries two independent concurrency changes to the core SDK, one commit each.

fix: concurrency drain timer resumes outside lock

See #473 for full description with repro code.

Map/parallel branches that suspend on a timer were resumed by the TimerScheduler while it held its lock across a blocking checkpoint, serializing every timed resume behind one network round trip and causing repeated function timeouts at high concurrency. This drains all due resumes under the lock, keeping the heap pop and the PENDING transition atomic, then resubmits outside the lock with one shared state refresh per wave.

AWS validation at 300 branches: 222s / 4 invocations / 3 timeouts before, 38s / 1 invocation / 0 timeouts after. All branches complete in both.

fix: lock operations access in ExecutionState

track_replay iterated ExecutionState.operations under _replay_status_lock while the background checkpoint thread updated the same map under _operations_lock, a dictionary changed size during iteration race. This makes operations private with a read-only snapshot property and reads it under _operations_lock in track_replay and get_execution_operation. Adds a regression test that fails on the old code and passes here.

Verification

.github/scripts/ci-checks.sh passes: 1273 core tests, types clean, lint clean, commit-lint clean on both commits.

Issue #, if available:
Closes #473

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

A map or parallel stays in the current invocation while at least one branch is still running. When one iteration keeps it alive and the other branches wait on short timers (a wait, a wait_for_condition poll, or a step retry backoff), those branches resume in-process as their timers come due. Before this commit, a single timer thread runs those resumes one at a time and holds its lock across a blocking checkpoint that refreshes state. Every resume costs one network round trip under the lock, and every branch trying to register its next wait queues behind it. When many timers come due together the timer thread falls behind, the invocation reaches its function timeout, and the backend reinvokes, so a map that should finish in seconds runs for minutes across several timed-out invocations. Holding the lock across the submit also allowed a latent self-deadlock, where a branch that finished inline reacquired the same lock on the timer thread through its done-callback. This commit holds the lock only long enough to take all due timers off the queue and mark them pending, then releases it and runs one shared refresh for the whole wave before handing the branches back to the worker pool. One round trip now serves the whole wave instead of one per resume, and new waits no longer queue behind a network call. The take and the mark stay atomic so a branch never looks parked while it is about to resume, which would otherwise suspend the whole operation by mistake. If the refresh fails, which happens only when the checkpoint subsystem has already failed and is terminal, the timer thread records that one error and re-raises it from execute() so the platform retries from the last checkpoint. Closes #473

- Make operations private (_operations); add a read-only snapshot property to preserve the public attribute - Read operations under _operations_lock in track_replay and get_execution_operation, closing a dictionary-changed-size race against the concurrent checkpoint update path - Add regression test for concurrent track_replay and update

yaythomas added 2 commits June 16, 2026 21:14

ParidelPooya approved these changes Jun 16, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: improve timed-resume throughput at high concurrency#475

fix: improve timed-resume throughput at high concurrency#475
yaythomas wants to merge 2 commits into
mainfrom
fix/timer-and-operations

yaythomas commented Jun 16, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

yaythomas commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

fix: concurrency drain timer resumes outside lock

fix: lock operations access in ExecutionState

Verification

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

yaythomas commented Jun 16, 2026 •

edited

Loading