Add gov proposal based migration trigger#3650
Conversation
PR SummaryHigh Risk Overview The node-local config field, TOML flag, and config fallback are removed; default Tests and Reviewed by Cursor Bugbot for commit c20c5a8. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3650 +/- ##
==========================================
- Coverage 59.16% 58.18% -0.99%
==========================================
Files 2262 2178 -84
Lines 187009 177290 -9719
==========================================
- Hits 110643 103155 -7488
+ Misses 66430 64991 -1439
+ Partials 9936 9144 -792
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| // composite SC backend drives the in-flight memiavl->flatkv migration that | ||
| // BeginBlock paces via the migration gov param. Fail fast if the legacy | ||
| // root multistore is somehow in use. | ||
| rs, ok := app.CommitMultiStore().(*rootmulti.Store) |
There was a problem hiding this comment.
If we need to cast like this, would it make sense to change the type returned by app.CommitMultiStore()?
| genesisImportConfig genesistypes.GenesisImportConfig | ||
|
|
||
| stateStore seidb.StateStore | ||
| rs *rootmulti.Store |
There was a problem hiding this comment.
nit, would it make sense to use a more descriptive name like rootStore? Out of context, it's not obvious what rs stands for.
There was a problem hiding this comment.
good suggestion, will change
| if migrationBatchSize < 0 { | ||
| return nil, fmt.Errorf("migration batch size must not be negative, got %d", migrationBatchSize) |
There was a problem hiding this comment.
Could we make migration batch size an unsigned integer?
There was a problem hiding this comment.
Yeah, that's a good point
There was a problem hiding this comment.
Tried that, looks like it would require a big refactory, will actually do a fallback here to 0 if it's negative
There was a problem hiding this comment.
A well-structured, thoroughly-tested change that moves the SC migration rate from a node-local config to a consensus-deterministic governance param (migration.NumKeysToMigratePerBlock), correctly read once per block in BeginBlock and pushed down to the composite SC store. No blocking correctness/security issues found; a few non-blocking notes, chiefly that the new subspace value is not genesis-exportable.
Findings: 0 blocking | 6 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Migration param is not genesis-exportable (confirms Codex P2).
x/paramsExportGenesis(sei-cosmos/x/params/keeper/genesis.go) only exportsFeesParams/CosmosGasParams, and the newmigrationsubspace has no owning module's InitGenesis/ExportGenesis. After aseid export/import while a migration is in flight,NumKeysToMigratePerBlockis dropped and BeginBlock re-seeds the0(paused) default, halting the drain until governance re-submits a proposal. Data isn't lost (the boundary cursor lives in flatkv) and it's recoverable, but worth either adding genesis plumbing for this subspace or documenting the operational caveat. - Operational behavior change worth calling out in release notes: because the param defaults to
0(paused) and is the sole source of the rate, any chain already running in a migrate write-mode will pause its drain at the upgrade height until a param-change proposal raises the rate. Intentional per the consensus-safety design, but operators must know to submit the proposal. - REVIEW_GUIDELINES.md and cursor-review.md were empty/absent, so no repo-specific guidelines or Cursor second opinion were available; only the Codex pass (one P2, addressed above) contributed.
- Integration test nit (integration_test/gov_module/gov_proposal_test.yaml): the verifier
NEW_PARAM == 12345compares an unquoted numeric against a value the pipeline produces as a string (jq -r .value | tr -d "\""), unlike the sibling assertionNEW_ABCI_PARAM == "true". If the eval engine is type-strict this could mis-compare; quoting"12345"would match the established pattern. - 2 suggestion(s)/nit(s) flagged inline on specific lines.
| // composite SC backend drives the in-flight memiavl->flatkv migration that | ||
| // BeginBlock paces via the migration gov param. Fail fast if the legacy | ||
| // root multistore is somehow in use. | ||
| rs, ok := app.CommitMultiStore().(*rootmulti.Store) |
There was a problem hiding this comment.
[suggestion] This unconditional type assertion now panics in New() if the commit multistore is not *rootmulti.Store. New() is invoked by every seid subcommand that builds the app (export, query tooling, etc.), so please confirm no supported configuration (e.g. a legacy IAVL/storev2-disabled path) can reach here — otherwise this turns a previously-degraded path into a hard crash. If the legacy store is truly unsupported this is fine; a one-line note on why it's unreachable would help.
| } | ||
| numKeys := migration.DefaultNumKeysToMigratePerBlock | ||
| if subspace, ok := app.ParamsKeeper.GetSubspace(migration.SubspaceName); ok { | ||
| // The migration subspace has no owning module to seed it in InitGenesis, |
There was a problem hiding this comment.
[suggestion] The lazy seed correctly handles a fresh/never-set param deterministically, but because this subspace has no module ExportGenesis (x/params only exports Fees/CosmosGas params), the value is lost across a seid export/import: a mid-migration rate resets to the 0 default here and the drain pauses until governance re-raises it. Consider adding genesis import/export plumbing for this subspace, or documenting the caveat.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Want higher recall? High effort reviews run extra passes and find more bugs. A team admin can switch effort levels in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 91e1766. Configure here.
There was a problem hiding this comment.
The PR cleanly moves the SC migration rate from a node-local config to a consensus-safe governance param, with thorough plumbing and tests. However, an accidental rename of the receipt-store backend constant breaks an existing test (and decouples the test harness from the real config key), which blocks merge.
Findings: 1 blocking | 4 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
- 1 blocking issue(s) flagged inline on specific lines.
Non-blocking
- Cursor's second-opinion pass produced no output (cursor-review.md is empty); REVIEW_GUIDELINES.md is also empty, so no repo-specific standards could be applied.
- validateNumKeysToMigratePerBlock accepts any uint64. Because the value is cast to int before NextBatch (migration_manager.go:299), a governance value above math.MaxInt64 becomes negative and halts commit deterministically. Consider bounding the param to a sane operational maximum.
- app/abci_test.go's comment that math.MaxUint64 is 'forwarded verbatim with no conversion or clamping' is misleading — the app layer stores it verbatim, but migration_manager converts it to int downstream.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
|
|
||
| const ( | ||
| receiptStoreBackendKey = "receipt-store.rs-backend" | ||
| receiptStoreBackendKey = "receipt-store.rootStore-backend" |
There was a problem hiding this comment.
[blocker] This key must stay receipt-store.rs-backend. The parser in sei-db/config/receipt_config.go reads flagRSBackend = "receipt-store.rs-backend" (and the rs-backend mapstructure tag), so renaming this constant to receipt-store.rootStore-backend decouples the app-layer key from what ReadReceiptConfig actually reads. Concretely, TestParseReceiptConfigs_RejectsInvalidBackend (app/seidb_test.go) now sets "rocksdb" under a key the parser ignores, the config falls back to the default pebbledb, and assert.Error fails — breaking CI. This looks like collateral damage from a global rs→rootStore rename (the comment at app/test_helpers.go:486 was hit too). Revert to receipt-store.rs-backend.
| if advanceMigration { | ||
| // Get the next batch of keys to migrate. | ||
| valuesToMigrate, newBoundary, err := m.iterator.NextBatch(m.migrationBatchSize) | ||
| valuesToMigrate, newBoundary, err := m.iterator.NextBatch(int(m.migrationBatchSize)) |
There was a problem hiding this comment.
[suggestion] int(m.migrationBatchSize) truncates a uint64 gov param. On 64-bit, a value above math.MaxInt64 becomes negative, NextBatch rejects size <= 0, and ApplyChangeSets returns an error during commit — a deterministic chain halt from a governance typo. Since validateNumKeysToMigratePerBlock accepts any uint64, consider bounding the param to a sane maximum (well under math.MaxInt) at validation time.
This reverts commit 91e1766.
There was a problem hiding this comment.
Converts the SC migration per-block rate from a consensus-unsafe node-local config into a deterministic governance param read in BeginBlock; the design is sound, AppHash-safe (all nodes read the same value, lazy-default-set runs identically), and very thoroughly tested at unit, store, and integration levels. Only minor, non-blocking observations.
Findings: 0 blocking | 5 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- New() now hard-panics if the commit multistore is not *rootmulti.Store (app/app.go:560). This is intentional fail-fast, but it is a behavior change: any App-construction path that could use a non-storev2 store (offline/CLI commands, tooling) would now crash rather than degrade. Worth confirming every code path that builds App goes through the storev2 rootmulti store.
- The two later casts in app.go (~line 770 and ~2685) still re-derive the store via app.CommitMultiStore().(*rootmulti.Store) instead of reusing the new app.rootStore field; minor redundancy now that the type is guaranteed at construction.
- Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output. Codex reported no material issues.
- Consider a brief note/log when the lazy default-set writes the param the first time (app/abci.go), purely for operator visibility on the upgrade block; not required for correctness.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
Superseded: latest AI review found no blocking issues.
There was a problem hiding this comment.
The plumbing of the per-block migration rate into a governance param is clean, well-documented, and well-tested, but the PR also flips the default WriteMode to migrate_evm and relies on the claim that "paused migrate_evm produces the same app hash as memiavl_only" — which the routing code contradicts: brand-new EVM keys are written to flatkv yet excluded from the AppHash while paused, creating a fork/integrity risk. REVIEW_GUIDELINES.md and cursor-review.md were both empty; only Codex provided a second opinion, and its two P1 findings are corroborated.
Findings: 5 blocking | 4 non-blocking | 3 posted inline
Blockers
- Default WriteMode + migration trigger are not consistently network-wide. The default flips to migrate_evm (sc_config.go) but sc-write-mode remains node-local (app/seidb.go:109 only overrides when the toml key is non-empty). A node that picks up the new default runs migrate_evm while a peer with an explicit
sc-write-mode = memiavl_onlystays memiavl_only; once the gov param is raised, only the migrate_evm nodes drain. Because the paused-migrate_evm AppHash is NOT equivalent to memiavl_only (see inline finding), this mixed fleet can fork even before the migration is triggered. The PR's framing of the gov param as the 'sole' migration trigger overstates this — the write mode is still a consensus-relevant local switch. (Corroborates Codex P1 #2.) - Missing test coverage for the central safety claim: there is no test asserting that a paused migrate_evm store and a memiavl_only store produce identical AppHashes when brand-new EVM keys are written each block. Existing migration tests all set a positive batch size and drive the migration; the steady-state/paused path with new-key writes (the exact rollout state introduced by the default change) is unverified.
- 3 blocking issue(s) flagged inline on specific lines.
Non-blocking
- REVIEW_GUIDELINES.md is empty/missing, so no repo-specific standards were applied. cursor-review.md is empty — the Cursor pass produced no output. Only the Codex pass produced findings (both corroborated here).
- app/app.go: the fail-fast panic message says
expected *storev2_rootmulti.Storebut the import alias was changed torootmultiin this PR — cosmetic mismatch only. - integration_test/gov_module/gov_proposal_test.yaml: the verifier
NEW_PARAM == 12345compares a string (fromjq -r .value | tr -d '"') against an integer literal, whereas the sibling test uses a quoted string (NEW_ABCI_PARAM == "true"). Confirm the eval framework coerces types here, otherwise the assertion may not behave as intended. There is also a strayq gov proposal ... statusquery inserted between the two node votes — harmless but reads oddly. - app/abci.go applyMigrationBatchSize lazily persists the default param via subspace.Set on the first block it sees it unset. This is deterministic across nodes, but it means BeginBlock writes to the params store at the first post-upgrade height; worth a brief note that this is intended and idempotent.
| // all_migrated_but_bank, migrate_bank, flatkv_only, test_only_dual_write, auto. | ||
| // defaults to memiavl_only. | ||
| // defaults to migrate_evm. While the NumKeysToMigratePerBlock gov param is 0 | ||
| // (the default), migrate_evm is paused and produces the same app hash as |
There was a problem hiding this comment.
[blocker] This claim is not borne out by the routing code. In migrate_evm, MigrationManager.shouldForwardWriteToNewDB routes any caller write whose key is absent from memiavl directly to flatkv (migration_manager.go:431-435), and EVM creates brand-new keys (storage slots, accounts, receipts) every block. While paused (batch size 0) the boundary never advances, so shouldAppendLatticeHash returns false (composite/store.go:870-882) and flatkv is excluded from CommitInfo/AppHash. Result: paused migrate_evm commits new EVM state into flatkv that is outside the AppHash, whereas memiavl_only puts that same key into memiavl which IS in the AppHash — so the two AppHashes differ. This breaks the equivalence the rollout depends on (fork/halt risk in a mixed fleet) and leaves committed state uncovered by the merkle root. Please either route brand-new keys to memiavl while paused, or do not enable migrate_evm by default until the param is raised.
| return StateCommitConfig{ | ||
| Enable: true, | ||
| WriteMode: types.MemiavlOnly, | ||
| WriteMode: types.MigrateEVM, |
There was a problem hiding this comment.
[blocker] Changing the default WriteMode from MemiavlOnly to MigrateEVM is a consensus-relevant default flip. Combined with the node-local sc-write-mode override (app/seidb.go:109) and the AppHash non-equivalence of paused migrate_evm (see comment above), nodes that adopt this default will diverge from peers still on memiavl_only. Please confirm the upgrade story: existing nodes with an explicit memiavl_only in app.toml keep that value, but any node without it now silently enters migrate_evm. This needs to be coordinated as a network-wide switch rather than a per-node default.
| // still route below, but no keys are pulled forward and no boundary | ||
| // metadata is rewritten, so the migration holds at its current cursor | ||
| // until the batch size is raised again. | ||
| advanceMigration := firstBatchInBlock && m.migrationBatchSize > 0 |
There was a problem hiding this comment.
[blocker] advanceMigration gates only the boundary advance and metadata writes; caller-write routing below (shouldForwardWriteToNewDB) still runs unconditionally, sending brand-new keys to flatkv even when paused. Since the boundary metadata is never written in the paused state, shouldAppendLatticeHash never latches, so those flatkv writes never enter the AppHash. This is the mechanism behind the AppHash divergence flagged on sc_config.go and matches Codex P1 #1. Consider also pausing the new-key forwarding (route everything to the old DB) while migrationBatchSize == 0, so paused migrate_evm is truly AppHash-identical to memiavl_only.
| Enable: true, | ||
| WriteMode: types.MemiavlOnly, | ||
| WriteMode: types.MigrateEVM, | ||
| MemIAVLConfig: memiavl.DefaultConfig(), | ||
| FlatKVConfig: *config.DefaultConfig(), | ||
| HistoricalProofMaxInFlight: DefaultSCHistoricalProofMaxInFlight, | ||
| HistoricalProofRateLimit: DefaultSCHistoricalProofRateLimit, |
There was a problem hiding this comment.
🔴 Flipping DefaultStateCommitConfig().WriteMode from MemiavlOnly to MigrateEVM breaks the in-repo sei-ibc-go and sei-wasmd simapps: both call seidbconfig.DefaultStateCommitConfig() with no WriteMode override (sei-ibc-go/testing/simapp/test_helpers.go:75, sei-wasmd/app/test_helpers.go:81) and mount icahost/icacontroller store keys that are not in keys.MemIAVLStoreKeys. composite.validateInitialStores only relaxes the canonical-name check for MemiavlOnly/FlatKVOnly, so MigrateEVM hits the strict rejection at LoadLatestVersion and every simapp.Setup test in those packages fails. Fix by either keeping the default at MemiavlOnly (set MigrateEVM only in app/seidb.go's seid-specific config), or extending the relaxation to MigrateEVM (consistent with the PR's "paused migrate_evm == memiavl_only" claim).
Extended reasoning...
What breaks
sei-db/config/sc_config.go:61 now returns WriteMode: types.MigrateEVM from DefaultStateCommitConfig(). Two downstream simapps in this Go module pick this up without override:
sei-ibc-go/testing/simapp/test_helpers.go:75—scConfig := seidbconfig.DefaultStateCommitConfig()then only overridesAsyncCommitBuffer/SnapshotMinTimeInterval/HistoricalProof*.sei-wasmd/app/test_helpers.go:81— same pattern.
Both simapps mount IBC ICA store keys (icahost, icacontroller) via sdk.NewKVStoreKeys (sei-ibc-go/testing/simapp/app.go:241, sei-wasmd/app/app.go:296). keys.MemIAVLStoreKeys (sei-db/common/keys/store_keys.go:39-60) lists only auth, authz, bank, staking, mint, distribution, slashing, gov, params, ibc, upgrade, feegrant, evidence, transfer, capability, oracle, evm, wasm, epoch, tokenfactory — neither IBC ICA key is present.
The exact failure path
-
simapp.Setup(t, ...)→NewSimApp(..., loadLatest=true, ...) -
app.LoadLatestVersion()(sei-ibc-go/testing/simapp/app.go:449-452) -
rootmulti.Store.LoadVersion→LoadVersionAndUpgrade(0, nil)→scStore.Initialize(initialStores)(sei-cosmos/storev2/rootmulti/store.go:553-560), whereinitialStorescontains everyStoreTypeIAVLkey includingicahostandicacontroller. -
composite.Initialize→validateInitialStores(MigrateEVM, [..., icahost, icacontroller, ...])(sei-db/state_db/sc/composite/store.go:198). -
The relaxation at line 207 is gated on
mode == types.MemiavlOnly || mode == types.FlatKVOnly;MigrateEVMfalls through to the strict rejection at line 220-224 and returns:composite.Initialize: store names not routable by router: [icacontroller icahost] (allowed set is keys.MemIAVLStoreKeys)
Step-by-step proof
Reproduce with a single go test invocation against the PR HEAD:
$ go test ./sei-ibc-go/modules/core/02-client/keeper/... -count=1
failed to load latest version: composite.Initialize: store names not routable by router: [icacontroller icahost] (allowed set is keys.MemIAVLStoreKeys)
FAIL github.com/sei-protocol/sei-chain/sei-ibc-go/modules/core/02-client/keeper
The same chain produces the same error for sei-ibc-go/modules/light-clients/07-tendermint/types, 27-interchain-accounts, and every other package whose tests go through simapp.Setup. sei-wasmd/app tests fail identically.
Why the existing relaxation does not catch this
The PR explicitly added TestInitializeAcceptsUnknownStoreNamesInMemiavlOnly and TestInitializeAcceptsUnknownStoreNamesInFlatKVOnly (sei-db/state_db/sc/composite/store_test.go) with the docstring "the regression test for the sei-ibc-go simapp failure: downstream test apps that mount more modules than seid (icahost / icacontroller) must be able to run in MemiavlOnly." Those tests guard exactly the relaxation paths in validateInitialStores — MemiavlOnly and FlatKVOnly. They never exercise MigrateEVM, which the PR simultaneously made the default. The author was aware downstream simapps depend on non-canonical store names but flipped the default to a mode where the relaxation does not apply.
Impact
This is not a "downstream concern." sei-chain has a single root go.mod (only sei-cosmos/ics23 and sei-cosmos/cosmovisor are nested modules); sei-ibc-go and sei-wasmd are picked up by the go list -f ... ./... invocation in Makefile:553 that feeds the make test-group-N CI shards (and .github/workflows/go-test.yml's race-detection job). simapp.Setup is called by >10 test files across sei-ibc-go modules (light-clients/07-tendermint, light-clients/09-localhost, 27-interchain-accounts, core/02-client, core/03-connection, core/04-channel, core/05-port, core/genesis, transfer/keeper, etc.) and the sei-wasmd/app test suite. Every one of them now fails at LoadLatestVersion on this PR.
Fix options
Pick one:
- Keep the default at
MemiavlOnlyand moveMigrateEVMto the seid-specific path.app/seidb.goalready starts fromconfig.DefaultStateCommitConfig()and applies flag overrides; the seid binary can setMigrateEVMthere explicitly. This preserves the simapp behavior the PR's own regression test documents. - Extend the
validateInitialStoresrelaxation toMigrateEVM. Justified by the PR's own claim that a pausedMigrateEVM(gov param at 0) is app-hash-equivalent toMemiavlOnly; the same relaxation should apply to Initialize. Add a third test to coverMigrateEVM. - Have the downstream simapps override
sc-write-mode. AddscConfig.WriteMode = sctypes.MemiavlOnlyafterDefaultStateCommitConfig()in both simapptest_helpers.gofiles. Least invasive but easy to drift on the next config field rename.
| // validateNumKeysToMigratePerBlock only type-checks the value; any uint64 is a | ||
| // valid (consensus-deterministic) rate, with 0 meaning "paused". | ||
| func validateNumKeysToMigratePerBlock(i interface{}) error { | ||
| if _, ok := i.(uint64); !ok { | ||
| return fmt.Errorf("invalid parameter type: %T", i) | ||
| } | ||
| return nil | ||
| } |
There was a problem hiding this comment.
🔴 Bug: validateNumKeysToMigratePerBlock only type-checks the value, so a ParameterChangeProposal can set the migration batch size to any uint64 (including math.MaxUint64). The applyMigrationBatchSize clamp caps it at math.MaxInt64, which then propagates to MemiavlMigrationIterator.NextBatch where make([]ValueToMigrate, 0, size) either panics (makeslice: cap out of range) or OOMs every validator deterministically — chain halt with no recovery path other than coordinated downgrade. Reject values above a sane maximum (e.g. 1_000_000) here; seidroid[bot] flagged this in inline-comment 3483034959 and only the int-cast overflow was addressed.
Extended reasoning...
Bug
validateNumKeysToMigratePerBlock at app/migration/params.go:40-45 accepts any uint64. Any value, including math.MaxUint64, passes a ParameterChangeProposal validation. The new clamp added in this PR at app/abci.go:86-88 caps the value at math.MaxInt64, which addresses the cursor[bot] int-cast overflow concern but creates a new deterministic chain-halt vector.
The chain from gov proposal to consensus halt
- A
ParameterChangeProposalsetsmigration.NumKeysToMigratePerBlockto a very largeuint64(typo, fat finger, or malicious proposal). Validation passes because it is type-checked only. - After the voting period, the param lands in chain state. In the next
BeginBlock,applyMigrationBatchSizereads it (app/abci.go:75-90), clamps tomath.MaxInt64, and callsapp.rootStore.SetMigrationBatchSize(int(math.MaxInt64)). CompositeCommitStore.SetMigrationBatchSizeonly clamps negatives to0— a positiveMaxInt64passes through unchanged intoatomic.Int64storage and is forwarded viacs.router.SetMigrationBatchSize.MigrationManager.SetMigrationBatchSizestores it asm.migrationBatchSize. On the firstApplyChangeSetsof the block,advanceMigrationis true (batch size > 0), so the manager callsm.iterator.NextBatch(m.migrationBatchSize).MemiavlMigrationIterator.NextBatchatsei-db/state_db/sc/migration/memiavl_migration_iterator.go:94executesbatch := make([]ValueToMigrate, 0, size).
ValueToMigrate is {ModuleName string; Key []byte; Value []byte} ≈ 16+24+24 = 64 bytes. make([]ValueToMigrate, 0, math.MaxInt64) asks the Go runtime to allocate MaxInt64 × 64B ≈ 590 EB. The runtime checks cap × elemSize against maxAlloc (well below MaxInt64 on 64-bit) and panics with runtime error: makeslice: cap out of range. The panic propagates out of ApplyChangeSets, fails Commit, and aborts block production.
Step-by-step proof
Concrete scenario with value 18446744073709551615 (MaxUint64):
- Block N (proposal voting): validator nodes accept the proposal.
validateNumKeysToMigratePerBlock(uint64(18446744073709551615))returnsnilbecause the type assertioni.(uint64)succeeds. - Block N+M (param applied): every validator runs
BeginBlock.subspace.GetIfExists(ctx, KeyNumKeysToMigratePerBlock, &numKeys)setsnumKeys = 18446744073709551615. The clampif numKeys > uint64(math.MaxInt64) { numKeys = uint64(math.MaxInt64) }reduces it to9223372036854775807.app.rootStore.SetMigrationBatchSize(9223372036854775807)is called. - Same block, first write:
CompositeCommitStore.SetMigrationBatchSize(9223372036854775807)stores it (the< 0clamp does not trigger). The router push reachesMigrationManager.migrationBatchSize = 9223372036854775807. - Same block, ApplyChangeSets:
advanceMigration = firstBatchInBlock && (9223372036854775807 > 0) = true.NextBatch(9223372036854775807)runs. The first thing it does after thesize <= 0andCompletechecks ismake([]ValueToMigrate, 0, 9223372036854775807). - Runtime:
runtime.makeslicecomputescap * elemSize = 9223372036854775807 * 64, which overflowsuintptrand triggers the cap-bounds check. Panic.
Because the gov param is consensus state, every validator reads the same value and panics identically on the same height. The chain halts. No recovery path exists short of a coordinated downgrade or a new gov proposal — and the chain cannot produce blocks to vote on either.
Even non-overflowing values are dangerous: 10^10 requests ~640 GB of contiguous slice memory, OOM-killing every validator the moment the param hits chain state.
Why existing code does not prevent it
The defenses in place address adjacent concerns:
validateNumKeysToMigratePerBlockdoes type-checking only (explicitly: "any uint64 is a valid (consensus-deterministic) rate").- The
app/abci.goclamp prevents theint(uint64)cast from yielding a negative number on 64-bit (the cursor[bot] concern), but leaves the magnitude unbounded. CompositeCommitStore.SetMigrationBatchSizenormalizes negatives to 0, but the comment explicitly says "the lower layers therefore trust the batch size to be non-negative and do no validation of their own".- The relaxed
migrationBatchSize <= 0rejection inNewMigrationManager(now accepts 0) does not add an upper bound.
Reviewer history
seidroid[bot] in inline-comment 3483034959 explicitly suggested "bounding the param to a sane maximum (well under math.MaxInt) at validation time". yzang2019 replied in inline-comment 3483128268: "Tried that, looks like it would require a big refactory, will actually do a fallback here to 0 if its negative." The negative-clamp landed; the upper-bound suggestion did not.
Pre-existing vs PR-introduced
PR-introduced. The pre-PR migration rate was a node-local app.toml field (sc-keys-to-migrate-per-block) bounded by validator operators reading their own config. This PR removes that field and makes the rate a chain-wide governance param, but does not add the upper bound that governance-controlled values warrant.
Fix
One-line guard in validateNumKeysToMigratePerBlock:
const MaxNumKeysToMigratePerBlock = 1_000_000 // or another sane upper bound
func validateNumKeysToMigratePerBlock(i interface{}) error {
v, ok := i.(uint64)
if !ok {
return fmt.Errorf("invalid parameter type: %T", i)
}
if v > MaxNumKeysToMigratePerBlock {
return fmt.Errorf("NumKeysToMigratePerBlock must be <= %d, got %d", MaxNumKeysToMigratePerBlock, v)
}
return nil
}This rejects the proposal at submission, before it can reach chain state.
Severity
normal. The trigger requires a successful gov proposal (quorum + majority yes), which is a real barrier — but governance error or a malicious proposal would deterministically halt the chain, and recovery would be impossible without a coordinated downgrade. The fix is trivial and was explicitly recommended by another reviewer.
🔬 also observed by 3483034959
| oldDBPairsByStore := make(map[string]map[string]*proto.KVPair) | ||
| newDBPairsByStore := make(map[string]map[string]*proto.KVPair) | ||
|
|
||
| // advanceMigration gates the once-per-block boundary advance. It is | ||
| // suppressed when the batch size is 0 (migration paused): caller writes | ||
| // still route below, but no keys are pulled forward and no boundary | ||
| // metadata is rewritten, so the migration holds at its current cursor | ||
| // until the batch size is raised again. | ||
| advanceMigration := firstBatchInBlock && m.migrationBatchSize > 0 | ||
|
|
||
| batchStats := migrationBatchStats{} | ||
| if firstBatchInBlock { | ||
| if advanceMigration { | ||
| // Get the next batch of keys to migrate. | ||
| valuesToMigrate, newBoundary, err := m.iterator.NextBatch(m.migrationBatchSize) | ||
| if err != nil { |
There was a problem hiding this comment.
🔴 Consensus risk in the new default migrate_evm mode: With DefaultStateCommitConfig.WriteMode = MigrateEVM (sei-db/config/sc_config.go:61) and DefaultNumKeysToMigratePerBlock = 0 (app/migration/params.go:31), every fresh chain — and every upgraded chain during the window before a gov proposal raises the param — runs in paused migrate_evm. In that state, EVM caller writes are routed to flatkv via the migration manager, but flatkv's lattice is excluded from both the AppHash and from state-sync snapshots, because the gates that admit it key off migration metadata that batch=0 never writes. The chain therefore accumulates unauthenticated EVM state; any node that joins via a state-sync snapshot taken in this window restores empty flatkv and diverges on the next block when reads fall through to flatkv on the executing chain but return not-found on the joiner. The PR's own documentation comment in sc_config.go ("migrate_evm is paused and produces the same app hash as memiavl_only") is testably false: memiavl_only would route new EVM keys to memiavl and contribute them to the AppHash; paused migrate_evm routes them to flatkv and does not. Suggested fix: when migrationBatchSize == 0 and boundary is NotStarted, route caller writes through to the old DB (mirror MemiavlOnly so presence==participation); or decouple the lattice/export gates from migration metadata so they follow actual flatkv content.
Extended reasoning...
What the bug is
In paused migrate_evm (the new default with NumKeysToMigratePerBlock=0), new EVM keys created by InitGenesis or by EVM transactions are written to flatkv, but flatkv's lattice is excluded from the AppHash and from state-sync snapshots. The chain runs with consensus-unauthenticated EVM state, and a node that joins via state-sync during the paused window forks on the next block.
How it manifests — step-by-step proof
Consider a fresh chain running the new binary, with the gov param at its default 0, in block N with EVM transaction tx that creates a brand-new contract account at address A:
-
applyMigrationBatchSizeruns inBeginBlock(app/abci.go:applyMigrationBatchSize). The subspace lazy-seeds the param toDefaultNumKeysToMigratePerBlock = 0(app/migration/params.go:31) and callsrootStore.SetMigrationBatchSize(0). This propagates toCompositeCommitStore.migrationBatchSize.Store(0). -
txwrites the new account. EVM keeper callsApplyChangeSetson the migration manager. -
advanceMigrationis suppressed (migration_manager.go:296):advanceMigration = firstBatchInBlock && m.migrationBatchSize > 0=true && false=false. The iterator never advances; neitherMigrationBoundaryKeynorMigrationVersionKeyis written to flatkv's MigrationStore. -
The caller-write routing loop runs unconditionally (migration_manager.go:323-337). For the new account key,
shouldForwardWriteToNewDB:m.boundary.IsMigrated(...)=false(boundary isMigrationBoundaryNotStarted, migration_boundary.go:108).m.oldDBReader(...)checks memiavl — the new key isn't there, sofoundInOld = false.- Returns
!foundInOld=true→ route to flatkv viam.newDBWriter.
-
Lattice gate stays closed. At commit,
shouldAppendLatticeHash(composite/store.go:851) seescurrentWriteMode == MigrateEVMand callsmigrationStarted(cs.flatKV)(line 870). InsidemigrationStarted(line 893), bothMigrationBoundaryKeyandMigrationVersionKeyare absent from flatkv (step 3 never wrote them), so it returnsfalse.shouldAppendLatticeHashreturnsfalse→ the syntheticevm_latticeStoreInfois not appended toWorkingCommitInfoorLastCommitInfo. The flatkv-resident new account contributes nothing to the AppHash. -
State-sync exports omit flatkv.
Exporter(composite/store.go:1162) seesexportNeedsMetadataGating(MigrateEVM) == true(line 1151), opens a read-only flatkv clone at the snapshot version, callsmigrationStarted(ro)which again returnsfalse(step 3), and setsincludeFlatKV = falseat line 1219. The state-sync snapshot stream omits the entire flatkv section. -
Joiner diverges. A node restoring from that snapshot at height N gets empty flatkv. At block N+1 some transaction reads address
A. On the executing chainMigrationManager.Read(line 232-249) misses memiavl and falls through to flatkv, returning the real account. On the joiner the same fallback returns not-found. Different EVM execution → different post-block state root → fork.
Why existing safeguards don't prevent this
- The pre-PR
cfg.KeysToMigratePerBlock > 0validation insc_config.gowas removed. - The pre-PR
migrationBatchSize > 0validations inbuildMigrateEVMRouter/buildMigrateAllButBankRouter/buildMigrateBankRouterwere removed (router_builder.go:158, 282, 411 in the diff). NewMigrationManagernow explicitly accepts0(migration_manager.go:101 comment; the newTestNewMigrationManager_AcceptsZeroBatchSizecodifies this).- The lattice and export gates were designed under the prior invariant that
batch > 0whenever migration mode is active. They evaluate migration metadata, not whether flatkv has accumulated real writes — so a paused migration with caller writes flowing through is exactly the case they don't recognize. - The lattice latching is sticky: once gov raises the param and the first migration batch runs,
MigrationVersionKey/MigrationBoundaryKeyget written and the gate flips totrueon all validators at the same block — so the live chain stays consistent. But every snapshot taken during the paused window is silently incomplete, and the joiner-fork window is real.
Existing tests do not catch this: every migrate_evm test in the diff sets SetMigrationBatchSize to a positive value immediately after construction (1, 2, 100, 1000). TestComposite_SetMigrationBatchSize_ClampsNegative only verifies the clamp; no test exercises migrate_evm with batch=0 AND brand-new EVM caller writes.
Impact scenarios
- Fresh chain: every new chain spun up with the new binary starts in
migrate_evm+ paused. Any EVM state written by InitGenesis or by transactions before a gov proposal raises the param lands in flatkv, is excluded from the AppHash, and is excluded from snapshots. Any state-sync joiner during that period forks. - Upgrade: a chain currently in
memiavl_onlythat flipssc-write-modetomigrate_evmin app.toml starts in pausedmigrate_evm. Until aParameterChangeProposalraises the param (which takes a full voting period — measured in days), all newly-created EVM accounts/contracts route to flatkv viashouldForwardWriteToNewDB(the keys do not exist in memiavl yet), are excluded from AppHash, and are excluded from any snapshot taken in that window.
The PR's package doc claim in sc_config.go that "migrate_evm is paused and produces the same app hash as memiavl_only" is the cleanest statement of the invariant the bug breaks: memiavl_only would put new EVM keys in memiavl (contributing to the AppHash); paused migrate_evm puts them in flatkv (not contributing).
Suggested fix
Either (a) make routing follow the same rule as the lattice/export gates — when migrationBatchSize == 0 AND boundary is NotStarted, route caller writes to the old DB (mirror MemiavlOnly) so presence == participation and flatkv stays empty; or (b) decouple the gates from migration metadata so they admit any flatkv content; or (c) reject startup in migrate_evm when the gov param is 0 and the chain has unmigrated EVM state, requiring a one-time genesis bootstrap that writes the boundary key so the gate opens immediately.
| // Package migration defines the module-agnostic governance parameters | ||
| // that control the state-commitment store's background data migration | ||
| // (currently memiavl->flatkv). | ||
| // | ||
| // These live outside any business module on purpose: the migration rate | ||
| // applies to whichever stores the SC router is migrating, so it is an | ||
| // app/storage-level concern rather than EVM-specific. The value is held in a | ||
| // dedicated x/params subspace and is editable via the standard | ||
| // ParameterChangeProposal gov flow. The app reads it once per block in | ||
| // BeginBlock and pushes it into the SC commit store. | ||
| package migration |
There was a problem hiding this comment.
🟡 Pre-existing nit (flagged by seidroid[bot] as a [suggestion]): The new migration x/params subspace registered at app/app.go:2948 has no owning AppModule, so no module's ExportGenesis ever persists NumKeysToMigratePerBlock. The params module's ExportGenesis only emits FeesParams and CosmosGasParams (it does not iterate registered subspaces), so running seid export mid-migration produces a genesis JSON with no migration.NumKeysToMigratePerBlock entry; bootstrapping a new chain from that genesis (testnet replication, fork migration, recovery) silently resets the rate to the default 0 (paused) on the first BeginBlock and the drain halts until governance re-files a proposal. Consider adding a minimal AppModule with InitGenesis/ExportGenesis for the migration subspace, or documenting the caveat in app/migration/params.go.
Extended reasoning...
What the bug is
The PR adds a generic migration x/params subspace (app/migration/params.go:1-45) and registers it at app/app.go:2948 via paramsKeeper.Subspace(migration.SubspaceName).WithKeyTable(migration.ParamKeyTable()). Every other subspace registered alongside it (auth, bank, staking, mint, distr, slashing, gov, ibc, oracle, wasm, evm, epoch, tokenfactory) belongs to a business module whose own AppModule.ExportGenesis round-trips the subspace's stored values. The new migration subspace has no owning AppModule in the ModuleManager — verified by reading the module list at app/app.go:877 — and the params module's own ExportGenesis (sei-cosmos/x/params/keeper/genesis.go) only writes FeesParams and CosmosGasParams to GenesisState:
func (k Keeper) ExportGenesis(ctx sdk.Context) *types.GenesisState {
feesParams := k.GetFeesParams(ctx)
cosmosGasParams := k.GetCosmosGasParams(ctx)
return types.NewGenesisState(feesParams, cosmosGasParams)
}It does not iterate over registered subspaces and emit their stored KV pairs. So nothing in the export pipeline ever serializes migration.NumKeysToMigratePerBlock.
How the failure manifests — step by step
- Governance raises
migration.NumKeysToMigratePerBlockto, say, 5000 via aParameterChangeProposal(exactly the path the newgov_proposal_test.yamltest exercises). The chain begins draining memiavl→flatkv at 5000 keys/block. - While the migration is still in flight (boundary not yet at
MigrationBoundaryComplete), an operator runsseid exportto capture the current chain state — for a testnet replication, a coordinated fork, or recovery from a corrupted node. - The exported genesis JSON contains every business module's params but no
migration.NumKeysToMigratePerBlockentry: the params module'sExportGenesisdropped it, and no other module owns the subspace. - A new chain is bootstrapped from this exported genesis. The migration subspace is registered in
initParamsKeeperat app/app.go:2948, but its key-value store is empty: nothing wroteKeyNumKeysToMigratePerBlockduringInitGenesis. - At block 1
BeginBlock,applyMigrationBatchSize(app/abci.go:75-91) runs the lazy-seed branch:This persistsif !subspace.Has(ctx, migration.KeyNumKeysToMigratePerBlock) { subspace.Set(ctx, migration.KeyNumKeysToMigratePerBlock, migration.DefaultNumKeysToMigratePerBlock) }
DefaultNumKeysToMigratePerBlock = 0(app/migration/params.go:28), which means paused. - The SC store receives
SetMigrationBatchSize(0), theMigrationManager'sadvanceMigrationgate (sei-db/state_db/sc/migration/migration_manager.go:293-299,firstBatchInBlock && m.migrationBatchSize > 0) suppresses every boundary advance, and the drain silently halts. Block production continues normally (consensus is unaffected; caller writes still route through the migration manager), so there is no panic, no log warning, no metric anomaly — the operator has no signal that anything is wrong unless they actively pollmigration.NumKeysToMigratePerBlock.
Why existing code doesn't prevent it
- The lazy-seed in
applyMigrationBatchSizeis deterministic (every node runsBeginBlockidentically) and idempotent, so it correctly handles a fresh chain. But it cannot distinguish a fresh chain from an exported-during-migration restore: in both cases, the subspace key is absent, so it writes the same default. The exported chain's mid-migration rate is the load-bearing context the lazy seed has no way to recover. - The params module's
ExportGenesisis shared across all chains using sei-cosmos and only serializesFeesParams/CosmosGasParams; modifying it to dump arbitrary subspaces would be a broader change than this PR. - The new
abci_test.gotests coverapplyMigrationBatchSizeand the gov-proposal path, but none exercises an export/import round-trip.
Impact
This is not consensus-fatal: the resumed chain produces blocks correctly, just with the migration paused at the genesis boundary. Recovery is a single governance proposal to re-raise the param. seidroid[bot] itself labeled it [suggestion], not [blocker], in inline-comment 3482817040. The trigger surface is narrow (export during an active migration AND restart from genesis, not a regular restart which preserves the KVStore), and the migration is a one-time operational event that already requires coordinated governance action. But it turns seid export/import into a silent loss of migration progress, which is exactly the kind of operational pitfall an export-based fork or recovery would hit unexpectedly.
How to fix it
Three reasonable options, in increasing invasiveness:
- Document the caveat prominently in app/migration/params.go's package doc and in any export-tooling runbook: "the
migrationsubspace is not exported byseid export; if the chain is mid-migration, re-issue the param-change proposal on the new chain." - Log a startup warning in
applyMigrationBatchSize(or in the SC store init) whensc-write-modeis a migration mode (migrate_evm,migrate_bank,migrate_all_but_bank) and the gov param is at the default 0, so operators have a chance to notice immediately rather than discovering the silent halt hours later via dashboard. - Add a minimal AppModule for the
migrationsubspace withInitGenesis/ExportGenesis(analogous to the other params-only modules), so the value round-trips throughseid export/import deterministically.
🔬 also observed by seidroid

Summary
The state-commitment (SC) store's in-flight
memiavl → flatkvmigration was previously paced by a node-local config (sc-keys-to-migrate-per-block). Because migration writes feed the AppHash, a per-node rate is consensus-relevant and a divergence risk: two validators draining at different rates fork the chain.This PR makes the per-block migration rate a module-agnostic governance parameter (
migration.NumKeysToMigratePerBlock) that every validator reads from chain state eachBeginBlockand applies identically. The gov param also serves as the migration trigger: it defaults to0(paused), and raising it via a param-change proposal starts the drain fleet-wide at a deterministic height.The old node-local config and its fallback are removed entirely — the gov param is now the sole source of the rate.
Key changes
New generic governance parameter
app/migration/params.go— new module-agnosticmigrationparams subspace definingNumKeysToMigratePerBlock(default0), itsKeyTable, and validation. Deliberately not EVM-specific so future module migrations can reuse it.app/app.go— registers themigrationsubspace; stores the*rootmulti.Storeon the app and fails fast if the (unsupported) legacy commit multistore is in use.app/abci.go—BeginBlockreadsNumKeysToMigratePerBlockfrom chain state and pushes it into the SC store before the block's first write (applyMigrationBatchSize).Plumbing: push the rate down to the migration router
sei-db/state_db/sc/types/types.go—Committerinterface gainsSetMigrationBatchSize(int) error.sei-cosmos/storev2/rootmulti/store.go— forwardsSetMigrationBatchSizeto the SC store; addsGetMigrationBatchSizefor observability/tests.sei-db/state_db/sc/composite/store.go— holds the gov-set batch size (atomic.Int64);buildRouterandSetMigrationBatchSizeuse it directly. Removed theeffectiveMigrationBatchSizeconfig fallback —0means paused, full stop.sei-db/state_db/sc/migration/*—Routerinterface gainsSetMigrationBatchSize; all router types implement it. Only the migration manager acts on it; every other router (module/passthrough/dual-write/thread-safe) treats it as a no-op.router_builder.gonow allows a0batch size (paused) instead of rejecting it.Removed the node-local config + fallback
sei-db/config/sc_config.go,sei-db/config/toml.go,docker/localnode/config/app.toml,app/seidb.go— dropped thesc-keys-to-migrate-per-blockfield, default, validation, TOML entry, and flag parsing.Tests
app/migration/params_test.go— unit tests for the new param (default, key-table registration, validation).app/abci_test.go— subspace registration,applyMigrationBatchSize(defaults/set/clamp), and cross-block "param set in block N takes effect in N+1" coverage.SetMigrationBatchSize(...)after construction instead of the removed config field. The random-migration framework re-applies the rate on every store (re)open, faithfully mirroring how production re-pushes the gov param after each restart.Integration test
integration_test/gov_module/— newParameterChangeProposaltest that raisesNumKeysToMigratePerBlockand asserts it takes effect.integration_test/contracts/verify_flatkv_evm_migrate.sh— rewritten to drive the migration via a governance param-change proposal (submit → deposit → quorum vote → poll forPASSED→ verify on-chain) instead of injecting the removed config.Docs
AGENTS.md— Code style now mandates running bothgofmtandgoimportson every touched.gofile.