Skip to content

feat(grpc): add exponential backoff for reconnection attempts#789

Open
Molter73 wants to merge 7 commits into
mainfrom
mauro/feat/reconnect-backoff
Open

feat(grpc): add exponential backoff for reconnection attempts#789
Molter73 wants to merge 7 commits into
mainfrom
mauro/feat/reconnect-backoff

Conversation

@Molter73

@Molter73 Molter73 commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator

Description

If connection to a gRPC server fails, fact was into a loop trying to reconnect every second which is pretty noisy and not really needed. With this change we add an exponential backoff algorithm that will try to reconnect with some leeway.

In the interest of making it obvious fact is failing to connect on k8s, we probably want to crash the application if we reach the maximum backoff, but there is currently no easy way to do that, so I'll leave it to a follow up.

Checklist

  • Patch has a change log entry OR does not need one.
  • Investigated and inspected CI test results
  • Updated documentation accordingly

Automated testing

  • Added unit tests
  • Added integration tests
  • Added regression tests

If any of these don't apply, please comment below.

Testing Performed

Without this change, reconnection attempts every second
[DEBUG 2026-06-10T11:35:37Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/something: inode_key_t { inode: 202489561, dev: 39 }
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/test: inode_key_t { inode: 196873539, dev: 39 }
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files: inode_key_t { inode: 126382982, dev: 39 }
[DEBUG 2026-06-10T11:35:37Z] (fact/src/host_scanner.rs:139) Host scan done
[INFO  2026-06-10T11:35:38Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:38Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:38Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[INFO  2026-06-10T11:35:39Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:39Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:39Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[INFO  2026-06-10T11:35:40Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:40Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:40Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
[INFO  2026-06-10T11:35:41Z] (fact/src/output/grpc.rs:126) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:35:41Z] (fact/src/output/grpc.rs:116) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:35:41Z] (fact/src/output/grpc.rs:130) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111)
With this change, exponential attempts
[DEBUG 2026-06-10T11:41:03Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 1s
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/my-dir/nested: inode_key_t { inode: 194573388, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/my-dir/nested/something: inode_key_t { inode: 194573419, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/my-dir/something: inode_key_t { inode: 200442630, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/something: inode_key_t { inode: 202489561, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files/test: inode_key_t { inode: 196873539, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:184) Added entry for /etc/sensitive-files: inode_key_t { inode: 126382982, dev: 39 }
[DEBUG 2026-06-10T11:41:03Z] (fact/src/host_scanner.rs:139) Host scan done
[INFO  2026-06-10T11:41:04Z] (fact/src/output/grpc.rs:163) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:41:04Z] (fact/src/output/grpc.rs:152) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:41:04Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 2s
[INFO  2026-06-10T11:41:06Z] (fact/src/output/grpc.rs:163) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:41:06Z] (fact/src/output/grpc.rs:152) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:41:06Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 4s
[INFO  2026-06-10T11:41:10Z] (fact/src/output/grpc.rs:163) Attempting to connect to gRPC server...
[WARN  2026-06-10T11:41:10Z] (fact/src/output/grpc.rs:152) Using unencrypted gRPC channel
[DEBUG 2026-06-10T11:41:10Z] (fact/src/output/grpc.rs:168) Failed to connect to server: transport error

Caused by:
    0: tcp connect error
    1: tcp connect error
    2: Connection refused (os error 111), retrying in 8s

Summary by CodeRabbit

  • New Features

    • Configurable gRPC backoff (initial/max durations, multiplier, jitter) via YAML, CLI flags, and environment variables.
  • Improvements

    • gRPC reconnection uses exponential backoff with reset and logging instead of a fixed retry delay.
    • scan_interval and backoff durations accept richer numeric inputs and enforce validation (non-negative/positive, finite, multiplier > 1).
  • Tests

    • Expanded parsing, validation, merge/update, defaults, env-var, and error-case tests for scan_interval and gRPC backoff.
  • Documentation

    • Changelog updated noting the gRPC backoff change.

@Molter73 Molter73 requested a review from a team as a code owner June 10, 2026 11:42
@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown

Review Change Stack

No actionable comments were generated in the recent review. 🎉

ℹ️ Recent review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 332a5f88-3fee-4cfd-82ba-199c2b5ca5a7

📥 Commits

Reviewing files that changed from the base of the PR and between 37a1ae1 and 1acb809.

📒 Files selected for processing (2)
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
🚧 Files skipped from review as they are similar to previous changes (1)
  • fact/src/config/tests.rs

📝 Walkthrough

Walkthrough

Centralized duration parsing added; BackoffConfig introduced and wired into GrpcConfig and CLI; scan_interval parsing moved to Duration; gRPC reconnect loop uses exponential backoff; tests, changelog, and rand dependency updated.

Changes

gRPC Exponential Backoff Configuration and Duration Parsing

Layer / File(s) Summary
Duration parsing infrastructure
fact/src/config/mod.rs
Introduces yaml_to_duration_secs and parse_duration_secs helpers that validate non-negative/finite (and strictly positive where required) durations and adjust derives.
scan_interval YAML parsing & GrpcConfig.update()
fact/src/config/mod.rs
scan_interval YAML parsing now uses the shared YAML duration helper and stores Duration; GrpcConfig::update() merges nested backoff settings.
BackoffConfig type and YAML parsing
fact/src/config/mod.rs
Adds BackoffConfig with optional initial/max, jitter, and multiplier, accessor methods, and TryFrom<&yaml::Hash> parsing that validates values and treats zero durations as unset.
CLI flags and FactCli wiring
fact/src/config/mod.rs
FactCli gains FACT_GRPC_BACKOFF_INITIAL, FACT_GRPC_BACKOFF_MAX, FACT_GRPC_BACKOFF_MULTIPLIER, and jitter flags; FACT_SCAN_INTERVAL now parses to Option<Duration>; CLI fields are wired into GrpcConfig.backoff.
Backoff implementation and gRPC retry loop
fact/src/output/grpc.rs
Adds private Backoff with exponential multiplier, optional jitter, capping at max, and reset(); gRPC run loop uses backoff.next() on failures and backoff.reset() on success instead of a fixed 1s delay.
Tests, changelog, and dependency updates
fact/src/config/tests.rs, fact/src/output/grpc.rs, CHANGELOG.md, Cargo.toml, fact/Cargo.toml
Extensive tests for parsing, errors, merging, defaults, env-var overrides, and Backoff unit tests; CHANGELOG entry added; rand added to workspace and crate deps.

Sequence Diagram(s)

sequenceDiagram
  participant GRPCRunLoop
  participant ChannelCreator
  participant Backoff
  GRPCRunLoop->>ChannelCreator: create_channel()
  ChannelCreator-->>GRPCRunLoop: Err
  GRPCRunLoop->>Backoff: next()
  Backoff-->>GRPCRunLoop: delay (Duration)
  GRPCRunLoop->>GRPCRunLoop: sleep(delay)
  GRPCRunLoop->>ChannelCreator: create_channel()
  ChannelCreator-->>GRPCRunLoop: Ok(Channel)
  GRPCRunLoop->>Backoff: reset()
Loading

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~20 minutes

Possibly related PRs

  • stackrox/fact#788: Also modifies the gRPC reconnect loop (re-fetches TLS connector/certificates on each retry), related to reconnect/retry code.

Poem

🐰 I hop through YAML, CLI, and time,

parsing seconds, making backoff rhyme,
multipliers grow, jitter keeps it light,
retries wait patient, then reconnects right,
a rabbit cheers for stable network might.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 36.11% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The PR title 'feat(grpc): add exponential backoff for reconnection attempts' accurately describes the main change: implementing exponential backoff for gRPC reconnection instead of the prior fixed 1-second retry loop.
Description check ✅ Passed The PR description covers the problem (noisy 1s reconnection loop), the solution (exponential backoff), includes testing evidence (before/after logs), and addresses the template checklist including changelog and unit tests; only documentation updates are not addressed.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch mauro/feat/reconnect-backoff

Comment @coderabbitai help to get the list of available commands and usage tips.

Molter73 added 4 commits June 10, 2026 13:44
Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
…d env vars

Add BackoffConfig to GrpcConfig with YAML (grpc.backoff.initial,
grpc.backoff.max), CLI (--backoff-initial, --backoff-max), and env var
(FACT_GRPC_BACKOFF_INITIAL, FACT_GRPC_BACKOFF_MAX) support.

Extract yaml_to_duration_secs and parse_duration_secs helpers, reusing
them for scan_interval parsing as well.

Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
@Molter73 Molter73 force-pushed the mauro/feat/reconnect-backoff branch from ddb3e74 to fc69ffe Compare June 10, 2026 11:45
@codecov-commenter

codecov-commenter commented Jun 10, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 92.81768% with 13 lines in your changes missing coverage. Please review.
✅ Project coverage is 31.92%. Comparing base (d4bf87c) to head (1acb809).

Files with missing lines Patch % Lines
fact/src/output/grpc.rs 85.88% 12 Missing ⚠️
fact/src/config/mod.rs 98.95% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #789      +/-   ##
==========================================
+ Coverage   27.96%   31.92%   +3.95%     
==========================================
  Files          21       21              
  Lines        2596     2763     +167     
  Branches     2596     2763     +167     
==========================================
+ Hits          726      882     +156     
- Misses       1867     1878      +11     
  Partials        3        3              

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@Molter73 Molter73 enabled auto-merge (squash) June 10, 2026 13:58

@vladbologa vladbologa left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some high level feedback, as I don't think I should be reviewing rust code. 😅

Why not use this existing crate?

And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.

@Molter73

Copy link
Copy Markdown
Collaborator Author

Just some high level feedback, as I don't think I should be reviewing rust code. 😅

You are not the only one in the team thinking this, but I need to start getting people into it because I can't be the only one 🤣

Why not use this existing crate?

With all the supply chain attacks going on these days I've become quite paranoid and would prefer we keep dependencies to a minimum. I've never heard of that crate, it looks interesting, but for a straightforward backoff implementation that is only used once (which is what I was going for), I didn't even consider looking for another dependency.

And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.

Sounds good, I'll modify the defaults and try to add jitter as well (probably make the multiplier and jitter configurable as well).

@Molter73

Copy link
Copy Markdown
Collaborator Author

And also, I checked the stackrox repo and there we use 10s initial, 5 minute max, and a 1.5 multiplier (and also jitter). So you could use these settings for consistency.

Sounds good, I'll modify the defaults and try to add jitter as well (probably make the multiplier and jitter configurable as well).

Actually, as I went to change the deafults I realized for a component that does runtime security, we probably want to keep the reconnection as fast as possible, otherwise we will start to drop events. I will add jitter and change the multiplier though.

Add configurable jitter to the exponential backoff using rand crate
for uniform distribution over [0, delay]. Jitter is enabled by default
and can be toggled via YAML (grpc.backoff.jitter), CLI (--backoff-jitter
/ --no-backoff-jitter), or env var (FACT_GRPC_NO_BACKOFF_JITTER).

Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
@Molter73 Molter73 disabled auto-merge June 11, 2026 11:35
Add multiplier field to BackoffConfig (default 1.5x). Configurable via
YAML (grpc.backoff.multiplier), CLI (--backoff-multiplier), and env var
(FACT_GRPC_BACKOFF_MULTIPLIER). Must be > 1.0.

Drops Eq derive from config types in favor of PartialEq to support
storing multiplier as f64 directly.

Assisted-by: claude-opus-4-6@default <noreply@opencode.ai>
@Molter73 Molter73 requested review from a team and rhacs-bot as code owners June 11, 2026 11:39

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
fact/src/config/mod.rs (1)

425-433: 💤 Low value

Consider extracting multiplier validation into a shared helper.

The multiplier validation logic (finite, > 1.0) is duplicated between parse_multiplier (line 42) and this YAML parsing. While functionally correct, extracting a shared validation function would reduce duplication and keep the validation rules in one place.

♻️ Optional refactor to reduce duplication
+fn validate_multiplier(mult: f64) -> anyhow::Result<f64> {
+    if !mult.is_finite() || mult <= 1.0 {
+        bail!("multiplier must be > 1.0, got {mult}");
+    }
+    Ok(mult)
+}
+
 fn parse_multiplier(s: &str) -> anyhow::Result<f64> {
     let mult = s.parse::<f64>()?;
-    if !mult.is_finite() || mult <= 1.0 {
-        bail!("multiplier must be > 1.0, got {mult}");
-    }
-    Ok(mult)
+    validate_multiplier(mult)
 }

Then simplify the YAML parsing:

                 "multiplier" => {
-                    let multiplier = match v.as_f64().or_else(|| v.as_i64().map(|v| v as f64)) {
-                        Some(m) if !m.is_finite() || m <= 1.0 => {
-                            bail!("invalid grpc.backoff.multiplier: {v:?}")
-                        }
-                        None => {
-                            bail!("invalid grpc.backoff.multiplier: {v:?}")
-                        }
-                        Some(m) => m,
-                    };
+                    let Some(m) = v.as_f64().or_else(|| v.as_i64().map(|v| v as f64)) else {
+                        bail!("invalid grpc.backoff.multiplier: {v:?}")
+                    };
+                    let multiplier = validate_multiplier(m)
+                        .with_context(|| format!("invalid grpc.backoff.multiplier: {v:?}"))?;
                     backoff.multiplier = Some(multiplier);
                 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@fact/src/config/mod.rs` around lines 425 - 433, The YAML parsing block
duplicates multiplier validation logic; extract the shared validation into a
helper (e.g., parse_multiplier or validate_multiplier) and call it from both
places (the existing parse_multiplier caller and the YAML parsing code that
currently binds `multiplier`). Implement a function that accepts a
serde_json/serde_yaml Value or f64 and checks is_finite() and > 1.0, returns
Result<f64, Error>, then replace the match in the YAML branch with a call to
that helper to eliminate duplication and keep rules centralized.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@fact/src/config/mod.rs`:
- Around line 607-617: The env var and visibility are inverted for the backoff
jitter flags: move the env attribute from the negative flag to the positive flag
and swap visibility so the positive flag is the public env-controlled option.
Concretely, on the backoff_jitter field (symbol backoff_jitter) add env =
"FACT_GRPC_BACKOFF_JITTER" and remove hide = true (make it exposed), and on
no_backoff_jitter remove the env attribute and add hide = true (hide the
negative flag); keep the overrides_with relationships (backoff_jitter
overrides_with = "no_backoff_jitter" and vice versa) so behavior stays
consistent but follows the established positive-flag env pattern.

---

Nitpick comments:
In `@fact/src/config/mod.rs`:
- Around line 425-433: The YAML parsing block duplicates multiplier validation
logic; extract the shared validation into a helper (e.g., parse_multiplier or
validate_multiplier) and call it from both places (the existing parse_multiplier
caller and the YAML parsing code that currently binds `multiplier`). Implement a
function that accepts a serde_json/serde_yaml Value or f64 and checks
is_finite() and > 1.0, returns Result<f64, Error>, then replace the match in the
YAML branch with a call to that helper to eliminate duplication and keep rules
centralized.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yml

Review profile: CHILL

Plan: Enterprise

Run ID: 5206ea5e-dd20-4dac-8c98-2e989a4b7c5b

📥 Commits

Reviewing files that changed from the base of the PR and between fc69ffe and 37a1ae1.

⛔ Files ignored due to path filters (1)
  • Cargo.lock is excluded by !**/*.lock
📒 Files selected for processing (5)
  • Cargo.toml
  • fact/Cargo.toml
  • fact/src/config/mod.rs
  • fact/src/config/tests.rs
  • fact/src/output/grpc.rs
✅ Files skipped from review due to trivial changes (1)
  • fact/Cargo.toml
🚧 Files skipped from review as they are similar to previous changes (2)
  • fact/src/output/grpc.rs
  • fact/src/config/tests.rs

Comment thread fact/src/config/mod.rs Outdated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants