Skip to content

Graph network resilience: node retry + resumable failures#5

Merged
senamakel merged 4 commits into
mainfrom
feat/graph-resilience-resumable-failures
Jul 2, 2026
Merged

Graph network resilience: node retry + resumable failures#5
senamakel merged 4 commits into
mainfrom
feat/graph-resilience-resumable-failures

Conversation

@senamakel

@senamakel senamakel commented Jul 2, 2026

Copy link
Copy Markdown
Member

Summary

Makes the graph runtime and model calls survive transient network problems and be restartable / continuable on user feedback after a hard failure. The primitives mostly existed (RetryPolicy, FallbackPolicy, checkpointing, interrupt/resume) — this wires them into the graph's node-execution and failure paths.

Scope decisions (confirmed up front): full story, with a resumable-abort default on failure (not an automatic human interrupt).

What changed

1. Model backoff actually sleeps now (opt-in)

  • RetryPolicy gains a backoff_sleep flag + with_backoff_sleep(true) builder + async sleep_backoff(attempt) helper. The agent-loop retry loop and RetryMiddleware previously computed backoff and discarded it for test determinism — they now wait a real, growing delay when opted in. Default stays sleep-free so existing deterministic tests are unaffected.

2. Graph node-level retry

  • CompiledGraph::with_node_retry(RetryPolicy): a node whose handler fails with a transient (Model/Tool) error is re-run from its start up to the attempt cap, emitting GraphEvent::NodeRetryScheduled. Both the sequential and parallel runners drive handlers through one shared run_node_with_retry helper (which also applies the per-node timeout).

3. Resumable failures (restart / continue on feedback)

  • When a handler fails past the retry budget (or non-retryably) on a checkpointed thread, the executor folds completed-branch progress into committed state and persists a failure-boundary checkpoint (failed node scheduled as next_nodes, successful branches as completed_tasks, error + failed node in metadata). The Failed status carries that checkpoint id.
  • CompiledGraph::retry(thread) restarts from the failure checkpoint; editing state via update_state first lets a human continue on feedback. Non-checkpointed runs abort exactly as before.

Deliberately not done: an unconditional initial (input) checkpoint — it would break existing checkpoint-count assertions, and the failure checkpoint already makes a caught step-1 failure resumable. Only an uncaught process crash at step 1 stays unrecoverable (flagged as future work).

API / behavior changes

  • New public: RetryPolicy::{backoff_sleep, with_backoff_sleep, sleep_backoff}, CompiledGraph::{with_node_retry, retry}, GraphEvent::NodeRetryScheduled.
  • Behavior: on a checkpointed thread a node failure now leaves a resumable checkpoint and a Failed status carrying its id (previously the run aborted with no checkpoint). Non-checkpointed behavior is unchanged.

Commands run locally

  • cargo fmt --check — clean
  • cargo clippy --all-targets -- -D warnings — clean
  • cargo test983 passed, 0 failed
  • cargo run --example resilient_graph — demonstrates both mechanisms end to end

Docs / examples

  • New examples/resilient_graph.rs
  • Rewrote docs/modules/graph/fault-tolerance.md; added resilience sections to the graph module docs + wiki/Graph-Runtime.md; listed the example in README.md / wiki/Examples.md

https://claude.ai/code/session_01NRL2bim9Gz3UN3SRtpeXEX

Summary by CodeRabbit

  • New Features

    • Added a new resilience example showing retryable node failures and restart after an outage clears.
    • Graph runs can now retry failed nodes and resume from a saved failure checkpoint.
    • Retry settings can optionally wait between attempts with backoff sleep.
  • Documentation

    • Expanded fault-tolerance docs to explain retry, resume behavior, checkpoint recovery, and state edits before retry.
  • Bug Fixes

    • Preserved completed work during partial failures and improved event reporting for scheduled retries.

senamakel added 4 commits July 2, 2026 12:45
RetryPolicy gains a backoff_sleep flag (off by default) and a sleep_backoff
helper. The agent-loop retry loop and RetryMiddleware now sleep for the
computed backoff when opted in via with_backoff_sleep, so transient
provider/network failures are retried after a real, growing delay instead of
back-to-back. Default stays sleep-free to keep tests deterministic.

Claude-Session: https://claude.ai/code/session_01NRL2bim9Gz3UN3SRtpeXEX
Add opt-in CompiledGraph::with_node_retry(RetryPolicy): each node handler is
re-run from its start on a retryable (model/tool) error, up to the policy cap,
emitting GraphEvent::NodeRetryScheduled and sleeping the opt-in backoff. Both
the sequential and parallel runners drive handlers through the shared
run_node_with_retry helper (which also applies the per-node timeout).

When a handler fails after retries (or a non-retryable error hits), the runner
now hands back partial progress instead of discarding it: completed branches'
updates are folded into committed state and a resumable failure-boundary
checkpoint is persisted (failed node scheduled as next_nodes, successful
branches as completed_tasks, error + failed node in metadata). The Failed
status carries that checkpoint id, so a crashed/exhausted run can be resumed or
continued rather than lost. Non-checkpointed runs abort exactly as before.

Claude-Session: https://claude.ai/code/session_01NRL2bim9Gz3UN3SRtpeXEX
Add CompiledGraph::retry(thread) — shorthand for resume with an empty command —
to re-run a failed run's failure-boundary checkpoint, and document the
inspect/update_state/retry feedback loop for continuing on user input.

Tests: node retry recovers a transient failure (one NodeRetryScheduled per
blip); exhausted retries leave a resumable checkpoint that retry() completes; a
failure with no retry policy is still resumable (resumable-abort default);
edit-state-then-retry runs against edited state; parallel partial progress is
preserved (successful branch folded, only failed branch rescheduled). Plus a
paused-time test that backoff sleep fires only when opted in.

Claude-Session: https://claude.ai/code/session_01NRL2bim9Gz3UN3SRtpeXEX
Add examples/resilient_graph.rs demonstrating node-level retry absorbing a
transient fetch blip and a resumable failure checkpoint that retry() restarts
after an outage clears. Rewrite docs/modules/graph/fault-tolerance.md to
document the implemented behavior, add a resilience section to the graph module
docs and wiki, and list the example in README/Examples.

Claude-Session: https://claude.ai/code/session_01NRL2bim9Gz3UN3SRtpeXEX
@coderabbitai

coderabbitai Bot commented Jul 2, 2026

Copy link
Copy Markdown

Review Change Stack

📝 Walkthrough

Walkthrough

Adds opt-in node-level retry (CompiledGraph::with_node_retry) and resumable failure-boundary checkpointing to the graph executor, a new NodeRetryScheduled event, opt-in backoff sleeping in RetryPolicy used by agent loop and retry middleware, a new example, and rewritten fault-tolerance documentation.

Changes

Node retry and resumable failure checkpoints

Layer / File(s) Summary
StepFailure/StepRun types and event
src/graph/compiled/mod.rs, src/graph/compiled/types.rs, src/graph/stream/types.rs
Adds StepFailure, failure field on StepRun, node_retry field on CompiledGraph, and GraphEvent::NodeRetryScheduled variant with kind/step handling.
RetryPolicy backoff_sleep opt-in
src/harness/retry/types.rs, src/harness/retry/mod.rs, src/harness/retry/test.rs, src/harness/agent_loop/mod.rs, src/harness/middleware/library/mod.rs
Adds backoff_sleep flag, with_backoff_sleep builder, and sleep_backoff helper; agent loop and retry middleware now await sleep_backoff during retries.
Node retry execution wrapper
src/graph/compiled/mod.rs
Adds with_node_retry builder and run_node_with_retry wrapper that retries retryable errors with backoff and timeout per attempt.
Failure-boundary checkpoint persistence
src/graph/compiled/mod.rs
Updates fail_run to accept checkpoint id, adds persist_failure_checkpoint, and updates sequential/parallel superstep paths to capture failures and persist resumable checkpoints instead of aborting immediately.
Tests
src/graph/compiled/test.rs
Adds flaky_graph helper and tests covering retry recovery, exhausted-retry checkpoints, no-policy resumability, edited-state retry, and parallel partial progress.
Example and docs
examples/resilient_graph.rs, README.md, docs/modules/graph/fault-tolerance.md
Adds a resilient_graph example and rewrites fault-tolerance docs to describe retry, resumable failures, restart flow, and error taxonomy.

Estimated code review effort: 4 (Complex) | ~60 minutes

Sequence Diagram(s)

sequenceDiagram
  participant Caller
  participant CompiledGraph
  participant NodeHandler
  participant Checkpointer

  Caller->>CompiledGraph: run(thread)
  CompiledGraph->>NodeHandler: run_node_with_retry
  NodeHandler-->>CompiledGraph: retryable error
  CompiledGraph->>CompiledGraph: emit NodeRetryScheduled
  CompiledGraph->>NodeHandler: retry attempt
  NodeHandler-->>CompiledGraph: hard failure
  CompiledGraph->>Checkpointer: persist_failure_checkpoint(next_nodes)
  CompiledGraph->>Caller: Failed status with checkpoint id
  Caller->>CompiledGraph: retry(thread)
  CompiledGraph->>Checkpointer: load checkpoint
  CompiledGraph->>NodeHandler: re-run failed node
  NodeHandler-->>CompiledGraph: success
  CompiledGraph->>Caller: Completed status
Loading

Related Issues: None found.
Related PRs: None found.
Suggested labels: enhancement, graph, documentation
Suggested reviewers: None found.

Poem

A node fell down, then tried again,
With backoff naps and checkpoint pen,
Commit held steady, fetch bounced back,
Retry hopped in along the track,
🐇 Hooray — the graph resumes its track!

🚥 Pre-merge checks | ✅ 5
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title clearly summarizes the main feature: graph resilience via node retry and resumable failures.
Docstring Coverage ✅ Passed No functions found in the changed files to evaluate docstring coverage. Skipping docstring coverage check.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch

Comment @coderabbitai help to get the list of available commands.

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 17cec870c7

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread src/graph/compiled/mod.rs
completed: Vec<NodeId>,
/// Nodes to re-run on resume: the failed node first, then the not-yet-folded
/// members of the step (recorded as the checkpoint's next nodes).
pending: Vec<NodeId>,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Preserve Send payloads in failure checkpoints

When a failed activation came from Command::send, storing only NodeIds in the failure checkpoint drops the activation's send_arg; resume_from later rebuilds pending work with Activation::node, so retrying a failed map/fan-out worker reruns it with ctx.send_arg == None instead of the original payload. This breaks resumable failures for Send-based graphs, including repeated activations of the same node that require distinct payloads.

Useful? React with 👍 / 👎.

Comment thread src/harness/retry/mod.rs
if !self.backoff_sleep {
return;
}
let backoff = self.backoff_for_attempt(attempt);

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Honor jitter when sleeping for backoff

With RetryPolicy::with_jitter(true).with_backoff_sleep(true), this calls backoff_for_attempt, which delegates to backoff_for_attempt_with(attempt, 0.0); because jitter multiplies by that value, the computed backoff is always zero and the retry loop never sleeps. In production configurations that enable jitter to avoid retry storms, retries will run back-to-back despite opting into real backoff sleep.

Useful? React with 👍 / 👎.

@senamakel senamakel merged commit 7c29ed7 into main Jul 2, 2026
2 of 3 checks passed

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
src/harness/agent_loop/mod.rs (1)

763-765: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Stale docstring contradicts the new opt-in sleep behavior.

The doc comment on invoke_model_resolving still states the backoff duration is "intentionally not slept on," but the code below it (816-819) now calls self.policy.retry.sleep_backoff(attempt).await, which does sleep when RetryPolicy::with_backoff_sleep(true) is set. This directly contradicts the correctly-updated module-level doc (lines 53-58).

📝 Proposed fix
-    /// Retries are governed by [`RunPolicy::retry`][crate::harness::runtime::RunPolicy]
-    /// and apply only to retryable errors (see [`is_retryable`]); each scheduled
-    /// retry emits [`AgentEvent::RetryScheduled`]. When retries are exhausted
-    /// (or the error is non-retryable) and a [`crate::harness::retry::FallbackPolicy`]
-    /// is configured, the next model in the chain is tried. The computed backoff
-    /// duration is intentionally not slept on (see the module docs).
+    /// Retries are governed by [`RunPolicy::retry`][crate::harness::runtime::RunPolicy]
+    /// and apply only to retryable errors (see [`is_retryable`]); each scheduled
+    /// retry emits [`AgentEvent::RetryScheduled`]. When retries are exhausted
+    /// (or the error is non-retryable) and a [`crate::harness::retry::FallbackPolicy`]
+    /// is configured, the next model in the chain is tried. The backoff is slept
+    /// only when the policy opts in via
+    /// [`RetryPolicy::with_backoff_sleep`] (see the module docs).

Also applies to: 816-819

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/harness/agent_loop/mod.rs` around lines 763 - 765, The doc comment on
invoke_model_resolving is stale and contradicts the new opt-in backoff sleep
behavior; update that method-level documentation to match the actual retry flow.
Specifically, revise the note about the computed backoff duration so it no
longer says it is intentionally not slept on, and instead reflects that
self.policy.retry.sleep_backoff(attempt).await may sleep when
RetryPolicy::with_backoff_sleep(true) is enabled. Keep the wording consistent
with the module-level docs and the retry/backoff logic in
invoke_model_resolving.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@src/graph/compiled/mod.rs`:
- Around line 987-1033: In the failure-handling path in compiled/mod.rs around
the `failure` branch, `persist_failure_checkpoint(...)` currently uses `?`,
which can replace the original node error and skip `fail_run` entirely if
checkpoint persistence fails. Update this path so `fail_run` is always invoked
and the original `error` from `StepFailure` is preserved as the terminal
failure, while any checkpoint-write error is handled separately (for example by
logging or attaching it without overriding the root cause). Make the fix in the
`failure` branch that calls `persist_failure_checkpoint`, `fail_run`, and
returns `Err(error)` so the run still records `Failed` status and emits the
failure event even when checkpoint storage is unavailable.

In `@src/harness/retry/mod.rs`:
- Around line 69-95: sleep_backoff currently drops jitter by calling
backoff_for_attempt(), which hardcodes a zero jitter value and makes
with_jitter(true) ineffective. Update RetryPolicy::sleep_backoff to take a
caller-supplied jitter value or RNG and use it when computing the delay, so the
backoff sleep preserves the randomized timing instead of always becoming
deterministic. Keep the behavior gated by backoff_sleep and continue returning
immediately when sleeping is disabled or the computed delay is zero.

---

Outside diff comments:
In `@src/harness/agent_loop/mod.rs`:
- Around line 763-765: The doc comment on invoke_model_resolving is stale and
contradicts the new opt-in backoff sleep behavior; update that method-level
documentation to match the actual retry flow. Specifically, revise the note
about the computed backoff duration so it no longer says it is intentionally not
slept on, and instead reflects that
self.policy.retry.sleep_backoff(attempt).await may sleep when
RetryPolicy::with_backoff_sleep(true) is enabled. Keep the wording consistent
with the module-level docs and the retry/backoff logic in
invoke_model_resolving.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: a58c9311-405a-4dac-bb8d-19c3bbd932c8

📥 Commits

Reviewing files that changed from the base of the PR and between a86a59f and 17cec87.

📒 Files selected for processing (12)
  • README.md
  • docs/modules/graph/fault-tolerance.md
  • examples/resilient_graph.rs
  • src/graph/compiled/mod.rs
  • src/graph/compiled/test.rs
  • src/graph/compiled/types.rs
  • src/graph/stream/types.rs
  • src/harness/agent_loop/mod.rs
  • src/harness/middleware/library/mod.rs
  • src/harness/retry/mod.rs
  • src/harness/retry/test.rs
  • src/harness/retry/types.rs

Comment thread src/graph/compiled/mod.rs
Comment on lines +987 to +1033
// Node-handler failure (survived any node-retry policy): the updates
// of the branches that completed before it are already folded into
// `state` above, so persist a resumable failure-boundary checkpoint
// scheduling the failed node (and the not-yet-run tail) for a later
// `resume`/`retry`, record a `Failed` status carrying the error and
// that checkpoint, and abort. Without a checkpointer/thread the
// checkpoint is a no-op and the run aborts exactly as before.
if let Some(fail) = failure {
let StepFailure {
failed_node,
completed,
pending,
error,
} = fail;
let checkpoint_id = self
.persist_failure_checkpoint(
&thread_id,
&run_id,
&state,
&pending,
&completed,
parent_checkpoint.clone(),
steps,
&failed_node,
&error,
&recursion_meta,
&child_runs_meta,
)
.await?;
self.fail_run(
&run_id,
&thread_id,
started_at,
steps,
&error,
checkpoint_id,
)
.await;
return Err(error);
}

// Interrupt: persist a checkpoint whose next nodes are the
// not-yet-completed members of this step (interrupted node first),
// then return control to the caller.
if let Some((index, emitted)) = interrupt {
if let Err(err) = self.require_interrupt_durability(&thread_id) {
self.fail_run(&run_id, &thread_id, started_at, steps, &err)
self.fail_run(&run_id, &thread_id, started_at, steps, &err, None)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Checkpoint-persist failure masks the original node error and skips fail_run.

If persist_failure_checkpoint(...) itself returns Err (e.g. the checkpointer's put fails — a transient storage/network error), the ? at line ~1015 immediately propagates that error, bypassing fail_run entirely. Two consequences:

  • The original node error (the actual root cause) is discarded and replaced by the checkpoint-write error.
  • No RunFailed event is emitted and no terminal Failed status is ever recorded via save_status, leaving any status-store observer stuck showing Running.

This is especially ironic given the point of this code path is resilience under failure — a checkpoint write hiccup at exactly the moment a node fails will silently drop the failure lifecycle bookkeeping.

🐛 Proposed fix: always call `fail_run` and preserve the original error
-                let checkpoint_id = self
-                    .persist_failure_checkpoint(
-                        &thread_id,
-                        &run_id,
-                        &state,
-                        &pending,
-                        &completed,
-                        parent_checkpoint.clone(),
-                        steps,
-                        &failed_node,
-                        &error,
-                        &recursion_meta,
-                        &child_runs_meta,
-                    )
-                    .await?;
+                let checkpoint_id = self
+                    .persist_failure_checkpoint(
+                        &thread_id,
+                        &run_id,
+                        &state,
+                        &pending,
+                        &completed,
+                        parent_checkpoint.clone(),
+                        steps,
+                        &failed_node,
+                        &error,
+                        &recursion_meta,
+                        &child_runs_meta,
+                    )
+                    .await
+                    .unwrap_or(None);
                 self.fail_run(
                     &run_id,
                     &thread_id,
                     started_at,
                     steps,
                     &error,
                     checkpoint_id,
                 )
                 .await;
                 return Err(error);
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// Node-handler failure (survived any node-retry policy): the updates
// of the branches that completed before it are already folded into
// `state` above, so persist a resumable failure-boundary checkpoint
// scheduling the failed node (and the not-yet-run tail) for a later
// `resume`/`retry`, record a `Failed` status carrying the error and
// that checkpoint, and abort. Without a checkpointer/thread the
// checkpoint is a no-op and the run aborts exactly as before.
if let Some(fail) = failure {
let StepFailure {
failed_node,
completed,
pending,
error,
} = fail;
let checkpoint_id = self
.persist_failure_checkpoint(
&thread_id,
&run_id,
&state,
&pending,
&completed,
parent_checkpoint.clone(),
steps,
&failed_node,
&error,
&recursion_meta,
&child_runs_meta,
)
.await?;
self.fail_run(
&run_id,
&thread_id,
started_at,
steps,
&error,
checkpoint_id,
)
.await;
return Err(error);
}
// Interrupt: persist a checkpoint whose next nodes are the
// not-yet-completed members of this step (interrupted node first),
// then return control to the caller.
if let Some((index, emitted)) = interrupt {
if let Err(err) = self.require_interrupt_durability(&thread_id) {
self.fail_run(&run_id, &thread_id, started_at, steps, &err)
self.fail_run(&run_id, &thread_id, started_at, steps, &err, None)
// Node-handler failure (survived any node-retry policy): the updates
// of the branches that completed before it are already folded into
// `state` above, so persist a resumable failure-boundary checkpoint
// scheduling the failed node (and the not-yet-run tail) for a later
// `resume`/`retry`, record a `Failed` status carrying the error and
// that checkpoint, and abort. Without a checkpointer/thread the
// checkpoint is a no-op and the run aborts exactly as before.
if let Some(fail) = failure {
let StepFailure {
failed_node,
completed,
pending,
error,
} = fail;
let checkpoint_id = self
.persist_failure_checkpoint(
&thread_id,
&run_id,
&state,
&pending,
&completed,
parent_checkpoint.clone(),
steps,
&failed_node,
&error,
&recursion_meta,
&child_runs_meta,
)
.await
.unwrap_or(None);
self.fail_run(
&run_id,
&thread_id,
started_at,
steps,
&error,
checkpoint_id,
)
.await;
return Err(error);
}
// Interrupt: persist a checkpoint whose next nodes are the
// not-yet-completed members of this step (interrupted node first),
// then return control to the caller.
if let Some((index, emitted)) = interrupt {
if let Err(err) = self.require_interrupt_durability(&thread_id) {
self.fail_run(&run_id, &thread_id, started_at, steps, &err, None)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/graph/compiled/mod.rs` around lines 987 - 1033, In the failure-handling
path in compiled/mod.rs around the `failure` branch,
`persist_failure_checkpoint(...)` currently uses `?`, which can replace the
original node error and skip `fail_run` entirely if checkpoint persistence
fails. Update this path so `fail_run` is always invoked and the original `error`
from `StepFailure` is preserved as the terminal failure, while any
checkpoint-write error is handled separately (for example by logging or
attaching it without overriding the root cause). Make the fix in the `failure`
branch that calls `persist_failure_checkpoint`, `fail_run`, and returns
`Err(error)` so the run still records `Failed` status and emits the failure
event even when checkpoint storage is unavailable.

Comment thread src/harness/retry/mod.rs
Comment on lines +69 to +95
/// Enables or disables actually sleeping for the computed backoff between
/// retries.
///
/// Off by default so tests stay deterministic and fast. Enable it in
/// production so a transient failure is retried after a real, growing delay
/// rather than back-to-back. See [`RetryPolicy::backoff_sleep`].
pub fn with_backoff_sleep(mut self, sleep: bool) -> Self {
self.backoff_sleep = sleep;
self
}

/// Sleeps for this policy's backoff before the given retry `attempt`, but
/// only when [`RetryPolicy::backoff_sleep`] is enabled.
///
/// A single, reusable helper so every retry loop that honors a
/// [`RetryPolicy`] gets identical, opt-in backoff behavior. A no-op (returns
/// immediately) when sleeping is disabled or the computed backoff is zero.
pub async fn sleep_backoff(&self, attempt: usize) {
if !self.backoff_sleep {
return;
}
let backoff = self.backoff_for_attempt(attempt);
if backoff > Duration::ZERO {
tokio::time::sleep(backoff).await;
}
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether a rand-like crate is already a dependency.
rg -n '^rand|"rand"|rand ' Cargo.toml
rg -n 'jitter' -g '*.rs' src

Repository: tinyhumansai/tinyagents

Length of output: 1824


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- Cargo.toml deps ---'
sed -n '1,220p' Cargo.toml | nl -ba | sed -n '1,220p' | rg -n 'rand|tokio|fastrand|rng|uuid|small_rng|thread_rng|getrandom'

echo
echo '--- retry mod implementation ---'
sed -n '1,220p' src/harness/retry/mod.rs | nl -ba | sed -n '1,220p'

echo
echo '--- sleep_backoff usages ---'
rg -n 'sleep_backoff\(' src

Repository: tinyhumansai/tinyagents

Length of output: 221


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo '--- Cargo.toml deps ---'
rg -n '^(rand|fastrand|getrandom|tokio|uuid)\s*=|rand\s*=|fastrand\s*=|getrandom\s*=' Cargo.toml || true

echo
echo '--- src/harness/retry/mod.rs ---'
sed -n '1,220p' src/harness/retry/mod.rs | cat -n

echo
echo '--- src/harness/retry/types.rs ---'
sed -n '1,180p' src/harness/retry/types.rs | cat -n

echo
echo '--- sleep_backoff usages ---'
rg -n 'sleep_backoff\(' src

Repository: tinyhumansai/tinyagents

Length of output: 16137


sleep_backoff must preserve jitter. backoff_for_attempt() always uses 0.0, so with_backoff_sleep(true) + with_jitter(true) turns the delay into 0ms and retries run back-to-back. Pass a caller-supplied [0, 1) value or RNG into sleep_backoff instead of using the deterministic helper here.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@src/harness/retry/mod.rs` around lines 69 - 95, sleep_backoff currently drops
jitter by calling backoff_for_attempt(), which hardcodes a zero jitter value and
makes with_jitter(true) ineffective. Update RetryPolicy::sleep_backoff to take a
caller-supplied jitter value or RNG and use it when computing the delay, so the
backoff sleep preserves the randomized timing instead of always becoming
deterministic. Keep the behavior gated by backoff_sleep and continue returning
immediately when sleeping is disabled or the computed delay is zero.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant