Skip to content

[codex] Harden multiversion iterator validation#3656

Open
codchen wants to merge 1 commit into
mainfrom
codex/harden-multiversion-iterator-validation
Open

[codex] Harden multiversion iterator validation#3656
codchen wants to merge 1 commit into
mainfrom
codex/harden-multiversion-iterator-validation

Conversation

@codchen

@codchen codchen commented Jun 29, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • Add a regression test for stale validation iterator keys after a lower-index transaction is re-executed with a smaller writeset.
  • Make validation iterators ignore removed multiversion entries without treating real deletes as removed keys.
  • Harden iterator validation so recovered panics become validation failures with diagnostic logging, and abort notification cannot leave an ownerless or blocked validation goroutine.

Root Cause

validationIterator.Value() assumed every captured key still had a multiversion value. When a lower-index transaction was re-executed and stopped writing one of those keys, the stale key could remain in the validation iterator's key list while GetLatestBeforeIndex returned nil, leading to a nil pointer dereference.

Validation

  • go test ./sei-cosmos/store/multiversion ./sei-cosmos/tasks ./giga/deps/tasks -count=1
  • git diff --check

@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 29, 2026, 9:29 AM

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 88.50575% with 10 lines in your changes missing coverage. Please review.
✅ Project coverage is 58.20%. Comparing base (b8f05ce) to head (04a43a7).

Files with missing lines Patch % Lines
sei-cosmos/store/multiversion/store.go 86.66% 4 Missing and 4 partials ⚠️
sei-cosmos/store/multiversion/memiterator.go 92.59% 1 Missing and 1 partial ⚠️
Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3656      +/-   ##
==========================================
- Coverage   59.15%   58.20%   -0.96%     
==========================================
  Files        2262     2178      -84     
  Lines      187031   177468    -9563     
==========================================
- Hits       110644   103294    -7350     
+ Misses      66438    65011    -1427     
+ Partials     9949     9163     -786     
Flag Coverage Δ
sei-chain-pr 90.73% <88.50%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-cosmos/store/multiversion/memiterator.go 86.48% <92.59%> (+2.81%) ⬆️
sei-cosmos/store/multiversion/store.go 89.96% <86.66%> (-2.01%) ⬇️

... and 84 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@codchen codchen marked this pull request as ready for review June 29, 2026 04:21
@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches OCC transaction validation on the consensus store path; behavior changes (invalid vs panic) are intentional but affect when txs fail validation or abort.

Overview
Fixes multiversion iterator validation for OCC when lower-index transactions are re-executed with a smaller writeset. Stale keys could stay on the validation iterator’s key list while GetLatestBeforeIndex returned nil, which led to nil pointer dereferences in Value().

validationIterator now skips keys with no current value (writeset, read cache, or MV entry) via skipRemovedKeys on Valid/Next/Key, returns early when MV lookup is nil, and sends estimate aborts through a non-blocking WriteAbort so a full abort channel cannot block validation.

Iterator validation runs replay in a goroutine with panic recovery (logged, treated as invalid), splits logic into validateIteratorReplay, polls aborts during the merge loop instead of blocking on the abort channel, and closes iterators in defer.

Adds regression tests for stale keys after writeset rewrite (blocking parent iterator), panic recovery, and multiple estimates completing without hanging.

Reviewed by Cursor Bugbot for commit 04a43a7. Bugbot is set up for automated code reviews on this repo. Configure here.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A conservative, well-tested fix that prevents a nil-pointer dereference in validation-iterator value resolution and contains iterator panics as validation failures. Logic is correct and safe (skipping vanished keys only ever pushes toward re-execution, never a false pass); only minor non-blocking simplification notes.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • The per-iterator validation goroutine no longer serves its original purpose. It previously ran concurrently so the main goroutine could listen for aborts on a channel; now aborts are polled inline via iteratorValidationAborted and the main goroutine just blocks on <-validChannel. Since recover() works within the same goroutine, panic containment does not require the goroutine either. Consider running validateIteratorReplay inline with a deferred recover to avoid a goroutine spawn per iteration tracker on this hot validation path (validateIterator is called once per tracker in checkIteratorAtIndex).
  • Codex and Cursor second-opinion passes both produced no material findings (codex-review.md notes it could not run the package tests due to blocked network access for the Go 1.25.6 toolchain; cursor-review.md was empty).
  • Confirm the new tests run under CI's race detector — they spin up goroutines coordinated via channels and a 5s timeout; the blocking/panic mock stores share state behind mutexes which looks correct, but worth ensuring go test -race passes for this package in CI.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

return false
}
// return whether we found the exact number of expected keys
return foundKeys >= len(expectedKeys)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] iteratorValidationAborted(abortChan) is polled five times per loop iteration (lines 329, 335, 343, 361) plus once after the loop. This is harmless since it's a non-blocking select, but the repetition is heavy — a single check at the top of each iteration plus the post-loop check would be equivalent given nothing inside the loop body produces an abort (aborts originate from validationIterator.Value() invoked indirectly via mergeIterator.Valid()/Key()). Minor cleanup only.

}

// try to get value from the writeset, otherwise try to get from multiversion store, otherwise try to get from parent iterator
func (vi *validationIterator) Value() []byte {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Value() deliberately reads vi.Iterator.Key() (the raw underlying key) rather than calling the overridden Key() which would run skipRemovedKeys. This is correct in practice because the merge iterator always positions via Valid()/Key() before calling Value(), and when it isn't positioned, GetLatestBeforeIndex returning nil now yields nil safely. A one-line comment noting this invariant (Value assumes the iterator is already positioned on a present key) would help future readers, since the asymmetry between Value() and the other overrides is subtle.

@masih masih added the backport release/v6.6 Backport to release v6.6 label Jun 29, 2026
@masih masih requested a review from sei-will June 29, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants