feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes#3663
feat(consensus): mock_chain_validation replay build + memIAVL state-sync restore fixes#3663bdchatham wants to merge 3 commits into
Conversation
PR SummaryHigh Risk Overview Consensus / upgrade: New memIAVL: State-sync import/rewrite paths are idempotent on re-offered snapshots (stale tmp cleanup, adopt existing Tests / CI: Halt-only tests moved behind Reviewed by Cursor Bugbot for commit b88231b. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #3663 +/- ##
==========================================
- Coverage 59.15% 58.49% -0.66%
==========================================
Files 2262 2196 -66
Lines 187031 180333 -6698
==========================================
- Hits 110632 105485 -5147
+ Misses 66455 65474 -981
+ Partials 9944 9374 -570
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
A well-isolated, build-tag-gated ConsensusPolicy that leaves the default/production build byte-for-byte unchanged, plus two prod-reachable memIAVL state-sync restore fixes that look correct. No production blockers; a few shadow-build-scoped notes and a missing unit test are worth addressing.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Missing unit test for buildLastCommitInfo's newly-reachable best-effort path (commit/valset size mismatch under mock_chain_validation). The PR acknowledges this as a reviewer-sanctioned deferred follow-up; it's shadow-tag-only and doesn't touch the EVM oracle, but the bounds-checked loop is exactly the kind of off-by-one that a small table test would lock down.
- Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output. Codex's review was present and its two findings are incorporated below.
- mock_block_validation relaxation surface is widened by this PR in two places. ErrUpgradeBeforeTrigger is explicitly and intentionally added (documented). However the syncer.go ErrAppHash routing also silently extends mock_block_validation to swallow state-sync snapshot appHash mismatches — an undocumented side effect (see inline comment).
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
There was a problem hiding this comment.
I didn't find any bugs, but this PR introduces a new build-tag-selected ConsensusPolicy framework, swallows ErrLastCommitVerify under mock_chain_validation, gates buildLastCommitInfo's panic invariant on TolerateLastCommitMismatch(), and routes x/upgrade + state-sync + blocksync through the policy — even with default-build isolation, the consensus-safety surface and the two flagged semantic changes warrant a human review.
Extended reasoning...
Overview
The PR touches 16 files across consensus-critical and state-sync subsystems: a new build-tag-selected ConsensusPolicy framework in sei-tendermint/types (default / mock_block_validation / mock_chain_validation variants), routing call sites in x/upgrade BeginBlocker, blocksync reactor's last-commit verify, statesync syncer's appHash check, and execution.go's buildLastCommitInfo; plus prod-reachable memIAVL state-sync restore fixes (idempotent restore, defer-release of import flock). New sentinel ErrUpgradeBeforeTrigger added.
Security risks
The PR is consensus-relaxing by design under shadow tags. The default-build isolation is the load-bearing safety guarantee: ConsensusPolicy.HandleError returns every error unchanged, TolerateLastCommitMismatch() is false, and the swallow code is not linked into a default binary. Two production-reachable changes still need scrutiny: (1) buildLastCommitInfo no longer unconditionally panics on commit/valset size mismatch — the panic is now gated by a policy call (still always-panics in default, but the invariant moved from a hard guard to a policy contract); (2) the memIAVL restore fixes affect every node doing state-sync, not just shadow builds. No injection / auth / crypto surface changes.
Level of scrutiny
High. This is consensus-critical and security-sensitive code. Even with airtight build-tag isolation, the policy framework introduces a new contract (HandleError + TolerateLastCommitMismatch) that future call sites must use correctly, and the swallow-allowlist semantics under mock_chain_validation determine what a replay-shadow can mask. The author themselves flagged two semantic changes for explicit sign-off and acknowledged a deferred unit test for the newly-reachable best-effort path in buildLastCommitInfo.
Other factors
The PR has thorough test coverage for the policy matrix across all three build tags, a per-sentinel swallow-matrix test, wrapped-error and label-mapping tests, and the memIAVL restore changes are accompanied by file-existence/ENOTEMPTY handling and defer-based lock release. The description notes an independent multi-specialist review converged to unanimous resolution. Only cursor[bot] has commented; no human review yet. Given the scope and the consensus-relaxing semantics, a human consensus/security reviewer should give explicit sign-off.
165c9e3 to
641b257
Compare
|
Thanks — addressed the non-blocking items in the latest push (force-update to
The empty |
There was a problem hiding this comment.
A well-structured, thoroughly-tested change that makes the build-tag-gated mock_chain_validation replay-shadow policy first-class and lands two prod-reachable memIAVL state-sync restore fixes. Default/production behavior is provably unchanged (zero-sized, tag-selected ConsensusPolicy); the only substantive note is a defensive gap in the idempotent-restore path that adopts an existing snapshot-<h> path without confirming it is a directory.
Findings: 0 blocking | 6 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Inconsistency between the two memIAVL fixes:
RewriteSnapshot(db.go) keepsos.Renameand only adopts the existing target onfs.ErrExist/ENOTEMPTY, so renaming onto a regular file (ENOTDIR) still falls through to the error/cleanup path — the safe behavior.Close(import.go) instead stat-checks first and unconditionally adopts any existing path. Consider aligning the two so a non-directory target is treated as an error in both, or at least document why the divergence is intentional. REVIEW_GUIDELINES.md(pulled from the base branch) is empty/blank, so no repo-specific review standards could be applied in this pass.- The Cursor second-opinion file (
cursor-review.md) is empty — that review pass produced no output. The OpenAI Codex pass did produce output (theimport.goIsDirconcern), which is incorporated here. - Codex noted it could not run tests because the Go 1.25.6 toolchain download was network-blocked in its environment; tests were not independently re-run here either. The PR claims
go build/go testpass for default and both mock tags, plus memIAVL package tests — worth confirming CI shows green for all three build configurations. - Minor: the deferred
buildLastCommitInfobest-effort test the PR description lists as a follow-up appears to already be included (execution_lastcommit_shadow_test.go); the description's 'deferred follow-up' note is now stale and could be updated. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
641b257 to
7fe93ad
Compare
There was a problem hiding this comment.
A well-isolated change that makes the mock_chain_validation replay-shadow build first-class and lands two prod-reachable memIAVL state-sync restore fixes. Default-build behavior is preserved at every routed call site; the only notable finding is a shadow-build-only signature-attribution concern in buildLastCommitInfo that does not affect production or the audit oracle.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- REVIEW_GUIDELINES.md was empty/missing, so no repo-specific review standards could be applied.
- The Cursor second-opinion review (cursor-review.md) was empty — that pass produced no output.
- The PR explicitly defers a unit test for buildLastCommitInfo's best-effort path beyond the size-mismatch case (e.g. the by-address attribution behavior). Reviewers classified this low-risk/cuttable; reasonable to track as follow-up.
- Default-build isolation, the load-bearing safety guarantee, rests on link-level exclusion of the swallow path and the build-tag test suites. Consider a CI check asserting the default binary does not link the swallowing ConsensusPolicy variants, to keep this guarantee from silently regressing.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
7fe93ad to
74035bd
Compare
There was a problem hiding this comment.
A carefully-scoped change that routes a fixed allowlist of consensus-validation failures through a build-tag-selected ConsensusPolicy (swallowed only under mock_chain_validation/mock_block_validation) plus two prod-reachable memIAVL state-sync restore fixes. The load-bearing guarantee — byte-for-byte unchanged production behavior — holds: every routed default-build call site retains its original halt. No blockers; only minor non-blocking notes.
Findings: 0 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- cursor-review.md is empty (Cursor produced no output) and codex-review.md reports no material findings; REVIEW_GUIDELINES.md is empty, so no repo-specific standards were applied. Noting per the review process.
- Behavioral change to the mock_block_validation build: ErrUpgradeBeforeTrigger is now swallowed there too (consensus_policy_mock_block_validation.go). This is intentional per the PR description but widens that tag's relaxation set — worth explicit maintainer sign-off.
- Good defensive testing: the mock_chain_validation swallow-matrix test asserts ErrDataHash/ErrEvidenceHash/ErrPerEvidenceValidateBasic are never in the allowlist, and the allowlist is halt-by-default for future sentinels.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
74035bd to
cb5fe23
Compare
There was a problem hiding this comment.
A well-scoped, well-tested change that promotes the mock_chain_validation replay build to a reviewed capability and lands two prod-reachable memIAVL state-sync restore fixes; production default-build behavior is preserved (all routed call sites still halt) and the swallow allowlist is halt-by-default with a build-failing guard protecting peer-content integrity checks. No blocking issues found.
Findings: 0 blocking | 3 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Both second-opinion passes produced no findings to merge: the Codex review states "No material findings" (and notes it could not run tests due to a Go 1.24 vs required 1.25.6 mismatch with no network access), and the Cursor review file is empty. REVIEW_GUIDELINES.md is also empty, so no repo-specific standards were applied.
- memIAVL Close()/RewriteSnapshot adopt an existing snapshot- directory as complete without validating its contents. This is sound given the import flock serialization + atomic-rename invariant (the dir only exists if a prior Close fully renamed it), but it relies entirely on that invariant holding; worth a brief note for future maintainers who might reach these paths outside the flock-protected restore flow.
- Under mock_chain_validation, ErrTooMuchEvidence is in the swallow allowlist while ErrEvidenceHash/ErrPerEvidenceValidateBasic correctly still halt. Swallowing the evidence-quantity bound is acceptable for a non-validating replay build whose oracle is the out-of-band digest comparator, but it is the one allowlist entry not strictly tied to migration/validator-set drift — confirm this is intentional.
cb5fe23 to
4952ead
Compare
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 4952ead. Configure here.
There was a problem hiding this comment.
A carefully scoped, well-tested change that makes the mock_chain_validation replay build first-class and lands two prod-reachable memIAVL state-sync restore fixes. Production behavior is byte-for-byte unchanged (the swallowing ConsensusPolicy compiles only under mock tags), and the memIAVL idempotency/lock-release fixes are correct. No blocking issues found.
Findings: 0 blocking | 3 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion passes produced no usable output: Cursor's review file (cursor-review.md) is empty, and Codex (codex-review.md) reported no findings but explicitly could not run the targeted Go tests because the sandbox couldn't download Go 1.25.6. This review therefore rests on the PR's own CI run and manual analysis; consider confirming the green CI run (default + both mock tags) before merge.
- Behavioral widening worth an explicit sign-off (already called out in the PR description): routing the state-sync appHash check in syncer.go through HandleError means the mock_block_validation tag now also swallows ErrAppHash mismatches during state-sync verifyApp, not just block validation. This is consistent with that tag's existing ErrAppHash relaxation and affects only a mock build, but it is a real expansion of that tag's scope.
- On the swallow path in statesync/syncer.go, a swallowed appHash mismatch continues silently with no log line (logging only happens on the halting branch). This is intentional per the design comment, but a single debug/info breadcrumb on the swallow path would aid post-mortem of a replay that diverged.
08eaf03 to
c9f9b42
Compare
There was a problem hiding this comment.
The mock_chain_validation policy refactor and memIAVL restore fixes are well-structured and keep production behavior unchanged, but routing the state-sync appHash check through the policy leaves the untagged TestSyncer_verifyApp "invalid hash" case failing under both mock tags, and a safety-critical package doc is now stale.
Findings: 1 blocking | 3 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
- 1 blocking issue(s) flagged inline on specific lines.
Non-blocking
- Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output, so its perspective is absent from this synthesis.
- PR description claims
go testpasses for default, mock_chain_validation, and mock_block_validation, but the statesync package's TestSyncer_verifyApp would fail under both mock tags (see inline). Either the tagged statesync tests weren't actually run, or the claim should be scoped to compilation only — worth clarifying. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
c9f9b42 to
1707513
Compare
There was a problem hiding this comment.
A well-engineered, thoroughly-tested change that makes the mock_chain_validation replay build a first-class capability and lands two prod-reachable memIAVL state-sync restore fixes; default-build (production) behavior is preserved byte-for-byte at every routed call site, and no correctness or security issues were found. Verdict is neutral only because the two second-opinion passes added no usable signal (Cursor empty; Codex could not run tests) and a couple of minor non-blocking suggestions remain.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Cursor's second-opinion review file (cursor-review.md) is empty — that pass produced no output, so only this review plus Codex's (which also found nothing) cover the change.
- Codex's review reported "No material findings" but explicitly could NOT run
go test: its sandbox has Go 1.24.13 while the module requires 1.25.6 and the toolchain download was blocked. So Codex's pass did not exercise the test suites; rely on CI for the default + both-tag runs. - Two deliberate semantic relaxations are called out in the PR body and warrant explicit maintainer sign-off (not code defects): (a)
ErrLastCommitVerifyis now swallowed under mock_chain_validation, makingbuildLastCommitInfobuild positionally-approximate LastCommitInfo; (b) mock_block_validation now also relaxesErrUpgradeBeforeTriggerand the state-syncErrAppHashrouting. - Minor: the
DefaultConsensusPolicy()constructor name is slightly misleading — under mock build tags it returns the swallowing policy, not a "default/production" one. Pre-existing naming, but worth a doc note since this PR adds more call sites. - 1 suggestion(s)/nit(s) flagged inline on specific lines.
Superseded: latest AI review found no blocking issues.
aad1310 to
47eae02
Compare
There was a problem hiding this comment.
The PR cleanly gates all consensus-relaxing behavior behind build tags — the default (production) build's HandleError provably returns every error unchanged, so prod halting semantics are byte-for-byte preserved, and tests are well-scoped per build. The one substantive finding is a narrow but prod-reachable idempotency gap in the memIAVL restore path (stale current-tmp not cleared), confirming Codex's P1.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Restore idempotency is still incomplete for the symlink swap step (Codex P1, valid):
updateCurrentSymlink(db.go:1247) createscurrent-tmpviaos.Symlinkand never clears a stale one. If a prior restore crashed after creatingcurrent-tmpbut before the atomic rename tocurrent, a re-offer now adopts the existingsnapshot-<h>dir and then fails inupdateCurrentSymlinkwith EEXIST — re-introducing a crashloop the PR aims to eliminate. Window is narrow, but the path is prod-reachable (not tag-gated). Suggestos.Remove(currentTmpPath(dir))before the symlink, or clearing it alongside the tmp-dir cleanup on entry. Affects both import.go Close() and db.go RewriteSnapshot adopt arms. - Cursor second-opinion review (cursor-review.md) produced no output; only Codex's single P1 was available to merge.
mock_block_validation's swallow set is widened to includeErrUpgradeBeforeTrigger, changing that existing tag's behavior. Intentional and documented, and it does not affect the default build, but worth an explicit reviewer sign-off since it alters an established build's semantics.- Confirmed no production behavior change: default ConsensusPolicy.HandleError returns err unchanged, and buildLastCommitInfo is behavior-identical when commit size matches the validator set (the normal path), so the rewrite only diverges under mock builds.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
47eae02 to
17a6e19
Compare
…ay builds
Adds a build-tag-gated ConsensusPolicy ("mock_chain_validation") so a
non-validating node can replay real chain history despite the consensus
divergence inherent to replaying an AppHash-breaking storage migration
(memIAVL->flatKV) against a validator set it cannot reproduce bit-for-bit.
The default build is unchanged: DefaultConsensusPolicy().HandleError preserves
every panic/halt, verified byte-identical.
Under the mock_chain_validation tag the policy swallows an explicit ALLOWLIST of
the sentinels that drift for migration / validator-set / commit reasons. Peer-
supplied block-content integrity still HALTS -- ErrDataHash (the tx merkle root
feeding the EVM execution this build compares), ErrEvidenceHash,
ErrPerEvidenceValidateBasic -- so a malformed/lying peer cannot poison the audit
input. Allowlist => a sentinel added later halts by default.
Previously-unguarded checks route through the policy (inert in the default build),
all via the single HandleError(err) verb:
- blocksync/reactor.go: VerifyCommitLight -> policy.HandleError
- state/execution.go: buildLastCommitInfo routes a commit/validator-set size
mismatch through the policy (default panics; mock_chain_validation builds
best-effort commit info, which only feeds staking rewards/downtime)
- statesync/syncer.go: state-sync snapshot appHash check -> policy (also extends
mock_block_validation, which already relaxes ErrAppHash)
- x/upgrade/abci.go: binary-updated-before-trigger panic -> policy
Tests: SwallowMatrix pins the allowlist + the never-swallow content-integrity
guard; TestBlockValidateBasic probes the active policy; a build-tagged test pins
buildLastCommitInfo's best-effort path (commit/valset size mismatch, no panic).
…every path Re-bootstrapping a node from an S3 state-sync snapshot exposed two issues: restore re-entry after an interrupted import crashlooped (now stat-first idempotent -- clears the stale tmp dir on entry, adopts an already-completed snapshot-<h>, and rejects a non-directory at that path), and Close released the import flock only on the success path, so an error leaked it and a same-process re-offer failed with ErrLocked (Close now releases via defer on every path). The import flock serializes restore against any rewrite, so the existing-dir arms are defensive idempotency, not race handling.
The mock_chain_validation image backs the migrate_evm replay node, which must execute real chain history faithfully; mock_balances stubs balances (and panics on pacific-1), corrupting execution. The standalone mock_balances image (mock-<tag>) remains for benchmark chains that submit synthetic transactions.
17a6e19 to
b88231b
Compare
There was a problem hiding this comment.
A carefully-scoped PR that makes the build-tag-gated mock_chain_validation replay policy first-class and lands two prod-reachable memIAVL state-sync restore fixes. Production (default-build) behavior is unchanged at the link level and the changes are well-tested across all three build configs; no blocking issues found.
Findings: 0 blocking | 4 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Second-opinion passes produced no actionable output: the Codex review reported "No material findings" (and noted it could not run the Go tests due to a Go 1.24 vs required 1.25.6 toolchain mismatch in its sandbox), and the Cursor review file was empty. Neither surfaced findings to merge.
- buildLastCommitInfo's best-effort path under mock_chain_validation (swallowing ErrLastCommitVerify) produces positionally-approximate Signatures↔Validators pairing. This is documented and only affects the non-production replay build (LastCommitInfo feeds staking rewards/downtime, not EVM state), but it is a genuine semantic relaxation worth explicit reviewer sign-off as the PR itself flags.
- The mock_block_validation swallow set is widened (ErrUpgradeBeforeTrigger + the state-sync ErrAppHash routing). Only affects mock builds, but it changes that tag's long-standing behavior — confirm this widening is intended for all consumers of mock_block_validation images.
- In RewriteSnapshot/Close adoption arms, an already-present snapshot- directory is adopted without validating its contents; this is safe given os.Rename atomicity and import-flock serialization (a snapshot- dir only exists from a completed atomic rename), but relies on those invariants holding.

Summary
Makes the
mock_chain_validationreplay build a first-class, reviewed capability, plus two prod-reachable memIAVL state-sync restore fixes it depends on. The build replays real chain history (e.g. an AppHash-breakingmemIAVL→flatKVmigration) as a non-validating node, checked by an out-of-band logical-digest comparator instead of consensus.Three self-contained commits:
feat(consensus)(the policy),fix(memiavl)(the restore fixes),ci(ecr)(the image build).Production behavior is unchanged
ConsensusPolicyis a build-tag-selected type whose swallowing variant compiles only under the mock tags. The default build returns every error unchanged, so every consensus halt is byte-for-byte preserved — confirmed at link level and by the test suites.What the replay build relaxes
Under
mock_chain_validationthe policy swallows the consensus checks a replayed/migrated chain can't reproduce (app-state and validator-set/commit hashes). Peer-supplied block content still HALTS — the transaction merkle root and evidence integrity — so a malicious peer can't poison the audit input. A build-failing test guard enforces that boundary.Two changes that warrant your explicit sign-off:
ErrLastCommitVerifyis swallowed, sobuildLastCommitInfobuilds best-effort commit info (approximate per-index pairing).LastCommitInfofeeds staking rewards/downtime, never EVM state.mock_block_validationwidened to also swallowErrUpgradeBeforeTriggerand the state-sync appHash check.memIAVL restore (prod-reachable, not tag-gated)
Re-bootstrapping from an S3 snapshot exposed two restore bugs, both fixed: an interrupted/re-offered restore crashlooped (restore is now idempotent — stale-tmp cleanup, adopt-an-existing-snapshot, reject non-directories), and
Close()leaked the import lock on error (now released on every path).Review & test
Blinded multi-specialist review with an assigned dissenter, iterated to unanimous RESOLVED; all automated-review threads (Cursor, Codex, seidroid, ai-review) resolved.
go build+go testpass under all three configs —default,mock_chain_validation,mock_block_validation— verified by a full sweep.