Skip to content

LittDB: Keymap threading improvements#3645

Open
cody-littley wants to merge 9 commits into
mainfrom
cjl/litt-fast-flush
Open

LittDB: Keymap threading improvements#3645
cody-littley wants to merge 9 commits into
mainfrom
cjl/litt-fast-flush

Conversation

@cody-littley

Copy link
Copy Markdown
Contributor

Describe your changes and provide context

Several changes to the threading of the keymap, with potentially large impacts for some workloads.

  • Writing to the keymap is now asyncronous and off the main control loop thread
    • this makes it possible for a crash to leave data in a segment but not in the keymap, we now detect and fix this at startup time
    • flush should have lower latency, especially for workloads with many keys being flushed
    • flushing LittDB should now be roughly equivalent to flushing a WAL, since that's basically all a littDB segment is
  • keymap GC now happens off the main control loop thread

@cursor

cursor Bot commented Jun 25, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Changes core LittDB persistence, GC ordering, and crash recovery in the storage engine; incorrect watermark or async keymap ordering could cause data loss or resurrecting deleted keys.

Overview
Keymap work is now asynchronous. A dedicated keymap manager applies puts after segment data is durable (scheduled from the flush loop) and deletes from GC in batched, prioritized work; clean shutdown drains it before the final seal write. Flush/seal no longer block on synchronous keymap writes, so crash can leave segment keys ahead of the keymap—repairKeymap rescues missing tails in one atomic batch, and tests cover clean shutdown consistency.

Garbage collection is split across goroutines. A GC manager TTL-scans sealed segments (with optional filter), fsyncs a new gc-watermark file before scheduling keymap deletes, and enqueues deletes; the control loop only reclaims segment files after the manager’s deletion watermark covers them. RunGC orchestrates collection → keymap sync → file deletion. TTL lives on the GC manager; SetTTL forwards there.

Correctness and read paths floor iteration, oldest/newest keys, reload/repair at readableFloor from the watermark so logically deleted segments (files still on disk) are not read or repaired back. Startup purgeCollectedKeymapEntries removes stale keymap entries for collected-but-unreclaimed segments; Exists verifies the backing segment is still live (like Get). Open iterators no longer suspend GC; they reserve snapshot segments and release on Close.

API cleanup: CacheAwareGet is removed from the table interface and disk table; the cached wrapper inlines cache logic in Get. New config knobs for keymap manager channels/batching; default GCPeriod is 10s. Docs describe the gc-watermark layout; Destroy removes it.

Reviewed by Cursor Bugbot for commit 8e25454. Bugbot is set up for automated code reviews on this repo. Configure here.

@github-actions

github-actions Bot commented Jun 25, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 29, 2026, 7:39 PM

Comment thread sei-db/db_engine/litt/disktable/control_loop.go Outdated
if err != nil {
c.logger.Error("failed to drain keymap writer", "error", err)
return
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Close hangs after drain failure

High Severity

If keymapManager.drain() fails during shutdown, handleShutdownRequest returns after logging without sending on req.shutdownCompleteChan. Close() still waits on that channel, so shutdown can block forever even though the table never finished stopping.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7b4c308. Configure here.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another false positive. Close() doesn't wait on that channel alone, it waits via util.Await, which selects on the channel or the error monitor's cancellation.

@codecov

codecov Bot commented Jun 25, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.14%. Comparing base (c528303) to head (8e25454).
⚠️ Report is 14 commits behind head on main.

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3645      +/-   ##
==========================================
- Coverage   59.12%   58.14%   -0.99%     
==========================================
  Files        2259     2176      -83     
  Lines      186489   176733    -9756     
==========================================
- Hits       110255   102755    -7500     
+ Misses      66353    64900    -1453     
+ Partials     9881     9078     -803     
Flag Coverage Δ
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-db/db_engine/litt/dbcache/cached_table.go 57.50% <ø> (-5.27%) ⬇️
sei-db/db_engine/litt/disktable/control_loop.go 68.93% <ø> (-0.42%) ⬇️
sei-db/db_engine/litt/disktable/disk_table.go 64.22% <ø> (-0.29%) ⬇️
sei-db/db_engine/litt/disktable/flush_loop.go 44.89% <ø> (-0.39%) ⬇️
...ei-db/db_engine/litt/disktable/forward_iterator.go 69.23% <ø> (ø)
...ei-db/db_engine/litt/disktable/reverse_iterator.go 71.73% <ø> (ø)
sei-db/db_engine/litt/littdb_config.go 42.10% <ø> (-14.76%) ⬇️
sei-db/db_engine/litt/table.go 100.00% <ø> (ø)

... and 83 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

timer := time.NewTimer(m.maxFlushInterval)
defer timer.Stop()

for len(*puts)+len(*deletes) < m.maxBatchSize {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The loop checks the batch size at the top, then appends an entire request's key slice. A single scheduleDelete for a segment with 100,000 keys would blow past the 1024 maxBatchSize in one append. This doesn't affect correctness — deleteBatch chunks deletes into deleteBatchSize (10,000) sub-batches — but the put side has no chunking, so a single writeBatch with 100,000+ keys hits the keymap in one giant Put call, which could spike latency.

Consider Cap writeBatch size the same way deleteBatch is capped, deleteBatch chunks into deleteBatchSize sub-batches, but writeBatch doesn't. For consistency and to bound peak latency.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Change made (required a refactor)

Comment thread sei-db/db_engine/litt/disktable/disk_table.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A well-engineered, thoroughly-tested rework of LittDB keymap threading (async keymap manager, two-phase GC gated by a deletion watermark, crash-time keymap repair). No blocking correctness issues found; a couple of low-severity edge-case observations remain.

Findings: 0 blocking | 4 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Cursor second-opinion pass produced no output (cursor-review.md is empty), so only the Codex pass and this review contributed external findings.
  • Startup now runs repairKeymap on every non-reload open. On a clean shutdown this is cheap (the first key checked is present, so it stops immediately), but it's worth confirming via the existing TestCleanShutdownLeavesKeymapConsistent that the zero-rescue fast path holds for large tables in CI, since the walk does a keymap.Get per missing key.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.

// could deadlock the seal path). If the channel is full the update is dropped: the watermark is monotonic and
// the control loop drains it each GC pass, so a later publish delivers an equal-or-newer value. A dropped
// update only delays file deletion, which is always safe.
func (m *keymapManager) publishDeletionWatermark(watermark int64) {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Codex flagged that a dropped watermark publish can strand GC'd segment files. The drop here is fire-and-forget (correctly, to avoid the seal-path deadlock), and the watermark is monotonic, but the control loop advances deletionScheduledIndex when a delete is scheduled and never reschedules those segments. So if the channel is full when the highest watermark is published and no later segment ever expires (e.g. writes stop on an idle table), keymapDeletionWatermark stays below the true highest and phase-2 leaves those segments' files on disk — they remain counted/readable, then get reclaimed on the next delete that advances the watermark, or on restart.

This is a disk-reclamation delay in an extreme edge case (the channel defaults to 1024 and is drained every GC pass), not data loss or resurrection of deleted keys, so I'd downgrade Codex's "permanently strand" framing to a low-severity note. Still, given the only cost of a fuller-fidelity signal is a small struct, consider publishing the watermark via a single-slot "latest value" (overwrite-on-full) rather than a bounded queue that can drop the newest value, which would close the gap entirely.

// caller without waiting for the keymap write. The segment data is already crash durable; a crash that
// loses the scheduled write is repaired from the segment key files on the next startup. A full manager
// channel backpressures here, which propagates back to Flush().
err = f.keymapManager.scheduleWrite(durableKeys)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] Behavioral change worth a note (per Codex): Flush() now returns success once the keymap put is scheduled, so a subsequent keymap.Put failure (e.g. a duplicate-key error under DoubleWriteProtection) surfaces asynchronously as an error-monitor panic on the manager goroutine rather than to the Flush() caller that produced it. The DB still enters the panicked state and later calls observe the error, so this isn't silent data loss — just a shift in which call observes the failure. Acceptable given the async design; flagging so the changed error-surfacing contract is intentional.

Kbhat1 added a commit that referenced this pull request Jun 29, 2026
With litt's async keymap flush (#3645), flushing ~once per block is cheap, so:
- littFlushInterval 100ms -> 5ms (~one block at Giga throughput), bounding
  crash loss to roughly a single block (near-WAL durability)
- ShardingFactor 16 -> 1 (flushing a single segment file is cheaper; sharding
  mainly helps across multiple disks)

DO NOT MERGE until #3645 (litt async keymap) lands: without it, flushing this
often regresses write throughput ~48%.

Validated via cryptosim (main littidx + #3645 + parallel getLogs):
- reads-off write ceiling 167k (vs 168k prior)
- reads-on ~153-157k writes / 58k point-reads/s / 520 getLogs/s @ 7ms
- 3h GC soak flat (old -27% control-loop step-down gone), 0 panics

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Comment thread sei-db/db_engine/litt/disktable/disk_table.go

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Solid, well-tested rework of LittDB keymap threading: async keymap manager, off-control-loop GC, reservation-based iterators, and a durable gc-watermark that correctly prevents post-crash resurrection of garbage-collected keys. No blocking correctness issues found; the two Codex findings are real but low-severity disk-reclamation lags that self-heal on restart.

Findings: 0 blocking | 5 non-blocking | 2 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Watermark-drop reclamation lag (Codex #1): publishDeletionWatermark drops updates when deletionWatermarkChan (cap 1024) is full. Because a full FIFO channel retains the older, smaller values and drops the newest (largest) watermark, if a collection burst publishes >1024 segment watermarks between two control-loop drains, the control loop reads a stale max and leaves the higher-indexed collected segments' files on disk until a later publish or restart. Unlikely in practice (deletes are fsync-bounded, control loop drains every GCPeriod), and it self-heals at startup via watermark seeding. Consider a latest-value (cap-1, drain-then-send / overwrite) channel like setTTL uses, so the freshest watermark is never lost.
  • Second-opinion passes: cursor-review.md is empty (Cursor produced no output) and REVIEW_GUIDELINES.md is empty/missing on the base branch, so no repo-specific standards were applied beyond AGENTS.md.
  • Positive: the crash-consistency regression test (TestGCWatermarkPreventsResurrectionAfterCrash) and the single-atomic-batch repair test (TestRepairKeymapSingleBatch) directly exercise the trickiest invariants; the new async paths are well covered.
  • 2 suggestion(s)/nit(s) flagged inline on specific lines.


// Schedule the segment's keymap entries for deletion. The manager applies it asynchronously and then
// advances the deletion watermark, which lets the control loop later delete the segment's files.
if err := g.keymapManager.scheduleDelete(g.gcCursorKeys, int64(index)); err != nil {

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Empty GC'd segment never advances the control loop's deletion watermark (Codex #2). For a sealed segment with no keys, scheduleDelete returns early (len(keys)==0), so no delete is enqueued and applyDeleteSubBatch never calls publishDeletionWatermark for this segment — yet advanceWatermark and deletionScheduledIndex have already moved past it. The control loop's keymapDeletionWatermark therefore never covers the empty segment, so deleteCollectedSegments won't reclaim its files until a later non-empty collected segment publishes a higher watermark, or until restart (where seeding recovers it). An empty sealed segment is rare but reachable (e.g. sealing an empty mutable segment). Consider publishing the deletion watermark for the segment even when its key set is empty.

is logically gone. This lets LittDB distinguish, after a crash, between a key that is missing from the keymap
because it was garbage collected (everything below the watermark) and a key that is missing because its
asynchronous keymap write was lost (a recently written key, which keymap repair must restore). Reads also
consult the watermark and treat any key in a below-watermark segment as absent.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] This says "Reads also consult the watermark and treat any key in a below-watermark segment as absent," but the runtime read path (DiskTable.Get) does not consult the gc-watermark — gcManager.advanceWatermark explicitly notes there is no in-memory read barrier and the durable value is consulted only at boot (repair/reload). At runtime, a key in a below-watermark segment stays readable until its keymap entry is actually deleted. Functionally benign (the key is expiring anyway), but the doc claim is inaccurate; recommend rewording.

Comment on lines +221 to +234
// Load the durable GC watermark (lowest readable segment). It lives at the table root so it survives a
// keymap rebuild. Reconcile against the lowest segment actually present on disk: a value at or below it
// means those segments were already physically deleted (clamp up to the lowest present); a value above it
// means GC durably deleted some segments' keymap entries but crashed before their files were removed, so
// segments [lowestSegmentIndex, lowestReadableSegment) are still on disk but logically deleted (the control
// loop will reclaim their files, and repair/reload will not resurrect them).
gcWatermarkFile, err := LoadGCWatermarkFile(qualifiedRoots[0])
if err != nil {
return nil, fmt.Errorf("failed to load gc-watermark file: %w", err)
}
lowestReadableSegment := lowestSegmentIndex
if gcWatermarkFile.IsDefined() && gcWatermarkFile.LowestReadableSegment() > lowestReadableSegment {
lowestReadableSegment = gcWatermarkFile.LowestReadableSegment()
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The gc-watermark file is loaded unconditionally from qualifiedRoots[0] (line 227), but the keymap directory is not guaranteed to live in config.Paths[0]FindKeymapLocation scans all roots, and the rebase CLI / config.Paths reordering between sessions can put it in any root. When the watermark and keymap diverge, a subsequent session sees IsDefined()==false, treats lowestReadableSegment as 0, and if a keymap loss then triggers reloadKeymap/repairKeymap, segments whose keys were durably GC-deleted will be scanned and their keys resurrected — exactly the failure mode this watermark was added to prevent. Additionally, cli/rebase.go:transferKeymap does not migrate the gc-watermark, so a post-rebase table where any GC has run will fail its later os.Remove(sourceTableDir) with "directory not empty". Fix: load via filepath.Dir(keymapPath) (the keymap's parent is already passed into NewDiskTable) or scan all qualified roots the way Drop does, and teach the rebase CLI to migrate the file.

Extended reasoning...

The bug

NewDiskTable (sei-db/db_engine/litt/disktable/disk_table.go:227) loads the gc-watermark unconditionally from the first root:

gcWatermarkFile, err := LoadGCWatermarkFile(qualifiedRoots[0])

But the design comment for this file (gc_watermark_file.go:22-23) and the filesystem-layout doc (filesystem_layout.md:87-93) both say the watermark lives at the table root that holds the keymap directory — and that root is not pinned to Paths[0]. littbuilder/build_utils.go:FindKeymapLocation (lines 32-78) explicitly scans every configured root looking for the keymap directory and returns whichever one contains it. Only new keymaps go to Paths[0] by convention (line 120).

The asymmetry with Drop() (lines 711-727) is the giveaway: Drop already iterates every root looking for and deleting this file, while the load path checks only one. If the design guaranteed Paths[0], Drop would not need the loop.

How a real session triggers it

The watermark can end up away from Paths[0] in several realistic ways:

  1. config.Paths reordering between sessions. filesystem_layout.md explicitly states "It doesn't matter which root directory contains the keymap directory." A user with Paths=[/a, /b] who reorders to Paths=[/b, /a] next session has nothing in LittDB telling them this is load-bearing — FindKeymapLocation will find the keymap regardless, but the watermark check looks only at Paths[0].
  2. Post-rebase. cli/rebase.go:transferKeymap (lines 385-424) moves the keymap to a destination root chosen by determineDestination (lines 524-532), which hashes source paths. Worse: rebase does NOT migrate the gc-watermark at all — grep -n 'GCWatermark\|gc-watermark' cli/rebase.go returns nothing. So the file stays in the old location while the keymap moves. The complementary failure: transferDataInTable ends with os.Remove(sourceTableDir) (rebase.go:304), which fails with directory not empty whenever any GC has run, because the gc-watermark file was never moved out.

Step-by-step proof of resurrection

Config: Paths=[/a, /b], single table T, TTL configured.

Session 1.

  1. Keymap is created under /a/T/keymap (the Paths[0] convention at first creation).
  2. Data is written and flushed. Segments 0..4 exist; some are sealed and TTL-expired.
  3. GC runs: it calls gcManager.advanceWatermark(5) (gc_manager.go), which Updates the watermark file at /a/T/gc-watermark to 5, then schedules keymap deletes for segments 0..4.
  4. Manager applies the deletes — keys for segments 0..4 are durably removed from the keymap at /a/T/keymap.
  5. Before the control loop deletes the segment files (or with files already deleted — either branch hits the same bug), the user shuts down cleanly.

Between sessions. The operator reorders to Paths=[/b, /a] (or runs the rebase CLI to move the keymap to /b).

Session 2.

  1. NewDiskTable runs: qualifiedRoots[0] = /b/T. LoadGCWatermarkFile(/b/T) returns an undefined file — there is no gc-watermark at /b/T. The real one is still at /a/T/gc-watermark.
  2. lowestReadableSegment = lowestSegmentIndex (clamped from the undefined watermark) = the lowest segment whose files are still on disk. If session 1 had not yet reclaimed segment files, that's 0.
  3. Suppose at some point afterward the keymap is lost (corruption, manual wipe of the keymap directory, migrating to a new keymap type — the exact case reloadKeymap exists for). On the next startup reloadKeymap runs with lowestReadableSegment = 0 and scans segments 0..highest, calling keymap.Put for every key — including the keys for segments 0..4 that GC durably deleted in session 1. Those keys are now resurrected.

The reloadKeymap doc comment (disk_table.go:454-461) explicitly acknowledges that an intact gc-watermark is what prevents this: "if BOTH the keymap and the gc-watermark file are lost, this rebuild has no record of in-flight GC and may resurrect keys." The bug is that even when the watermark file is fully intact on disk at /a/T/gc-watermark, the load path is looking in the wrong root and treating it as lost.

Why existing code does not save it

  • reloadKeymap/repairKeymap take a lowestReadableSegment parameter, but it is sourced from the misread watermark, so the floor they apply is 0 instead of 5.
  • The control loop's deletion watermark is seeded from int64(lowestReadableSegment)-1 = -1 instead of 4, so files in [0, 5) are never reclaimed by their post-restart pass even if they are still present.
  • Reads consult the watermark too: a session-2 Get of a session-1 resurrected key returns the on-disk value, not absent.

Fix

The keymapPath is already passed into NewDiskTable and is the keymap directory. Use its parent as the table root:

gcWatermarkFile, err := LoadGCWatermarkFile(filepath.Dir(keymapPath))

Or, mirroring the (already-defensive) Drop() loop, scan every qualified root for an existing watermark file and load whichever one exists. The first form is preferable because it matches the documented invariant exactly (co-located with the keymap) and survives config.Paths reordering without requiring a search.

The rebase CLI also needs a transferGCWatermark step alongside transferKeymap so that the file follows the keymap to its new root, otherwise post-rebase os.Remove(sourceTableDir) fails with directory not empty.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large but well-structured and well-tested refactor that moves keymap puts/deletes and GC collection off the control loop, adds a durable gc-watermark to prevent post-crash key resurrection, and switches iterators to reservations. No blocking correctness issues found; the one external (Codex) finding is unreachable under LittDB's documented no-overwrite contract.

Findings: 0 blocking | 6 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Codex flagged a possible stale-read: an older in-flight put-batch's writeBatch deletes a key from unflushedDataCache, potentially evicting a newer Put(K) cache entry and surfacing the older keymap value until the newer batch applies. This is technically real but unreachable in correct usage — LittDB's Put contract (table.go:21) forbids overwriting an existing value, which is the only thing that puts two put-batches for the same key in flight concurrently. For non-overwrite usage the keymap-Put-then-cache-Delete ordering is correct. Worth a code comment noting writeBatch's correctness depends on the no-overwrite invariant, but not a blocker.
  • Cursor's second-opinion pass produced no output (cursor-review.md is empty); REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.
  • gcManager.collectExpiredSegments fsyncs the gc-watermark file once per expired segment (advanceWatermark before each scheduleDelete). With GCPeriod lowered to 10s and large collection bursts this is many sequential fsyncs; it's off the control loop so it won't block writes, but coalescing the watermark advance per pass (single fsync to the highest collected index) would reduce syncs. Minor.
  • The deletionWatermark channel is fire-and-forget and can drop updates if a single pass collects more segments than KeymapManagerWatermarkChannelSize (default 1024) before the control loop drains it, deferring some file reclamation to a later pass. This is monotonic-safe and documented, but explicit RunGC over a very large backlog may not reclaim everything in one call — acceptable given the documentation.
  • repairKeymap's all-or-nothing recovery now depends on keymap.Put being crash-atomic as a batch; the interface doc was updated to require this and a test asserts a single Put call, but the guarantee is only as strong as each Keymap implementation actually provides (PebbleDB batch). Worth confirming the unsafe/non-default keymap implementations honor it.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

// The keys are now durable in both the segment and the keymap, so it is safe to drop them from the
// unflushed data cache.
for _, key := range keys {
m.unflushedDataCache.Delete(util.UnsafeBytesToString(key.Key))

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nit] This unconditional cache prune is correct only because LittDB forbids overwriting an existing key (see Table.Put, table.go:21). If two put-batches for the same key were ever in flight (an overwrite), an older batch applied here would evict the newer cache entry while the keymap still points at the older segment, briefly surfacing a stale value via Get until the newer batch applies. Consider a comment pinning this to the no-overwrite invariant so the dependency is explicit. (Raised by Codex; not reachable under the current contract.)

Comment thread sei-db/db_engine/litt/disktable/disk_table.go
return nil, fmt.Errorf("failed to purge collected keymap entries: %w", err)
}

var upperBoundSnapshotFile *BoundaryFile

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 On startup repairSnapshot is called with lowestSegmentIndex (disk_table.go:265) instead of lowestReadableSegment, so its loop snapshots segments in [lowestSegmentIndex, lowestReadableSegment) that are logically deleted (the gc-watermark has advanced past them). After a crash mid-collection (or the TestGCWatermarkPreventsResurrectionAfterCrash clean-Close-without-reclaim state), the snapshot directory persistently contains symlinks to hard-linked copies of GC'd segment files; restoring that snapshot into a fresh LittDB has no gc-watermark to floor at (it lives at the table root, outside the snapshot dir) so reloadKeymap rescans those segments and resurrects collected keys — the exact failure mode the durable watermark was added to prevent. Every other read/repair path in this PR (reloadKeymap, repairKeymap, purgeCollectedKeymapEntries, readableFloor for iterators / boundary keys) already floors at lowestReadableSegmentrepairSnapshot is the lone exception. Fix: pass lowestReadableSegment into repairSnapshot and start its loop at max(firstSegmentToConsider, lowestReadableSegment).

Extended reasoning...

What the bug is

NewDiskTable calls repairSnapshot at disk_table.go:263-267 with lowestSegmentIndex — the lowest segment whose files are physically on disk — rather than lowestReadableSegment, the durable gc-watermark just computed on lines 227-234. Inside repairSnapshot (line 440), firstSegmentToConsider defaults to that lowestSegmentIndex (or lowerBoundSnapshotFile.BoundaryIndex()+1 if an external consumer has defined one), and the loop for i := firstSegmentToConsider; i < highestSegmentIndex; i++ calls seg.Snapshot() on every segment in that range. seg.Snapshot() creates a hard link under <root>/<table>/snapshot/ and a symlink under <SnapshotDirectory>/<table>/segments/ for the segment's files. The snapshot directory is also wiped (os.RemoveAll on line 419) and rebuilt from scratch each startup, so the snapshot's contents are entirely determined by this loop.

How it triggers in practice

This PR introduces the async two-phase GC: gcManager.advanceWatermark fsyncs the gc-watermark file before scheduling keymap deletes, and the control loop reclaims segment files only after the keymap manager publishes the deletion watermark. That creates a new, durable "logically deleted, files still on disk" state for segments in [lowestSegmentIndex, lowestReadableSegment). Two concrete reproductions:

  • Crash mid-collection. After advanceWatermark and the keymap delete, but before the control loop's deleteCollectedSegments removes the files, the process crashes. Segments [lowestSegmentIndex, lowestReadableSegment) remain on disk; the gc-watermark file persists at the table root.
  • Clean Close without reclaim. The test TestGCWatermarkPreventsResurrectionAfterCrash in this PR explicitly constructs this state by running gcManager.runOnce() + keymapManager.sync() + Close() (a clean Close does not delete segment files). The gc-watermark is durably 2; seg0 and seg1 files remain on disk.

On the next startup, repairSnapshot is invoked with lowestSegmentIndex (0), so it calls seg.Snapshot() on the logically-deleted seg0 and seg1, populating the snapshot directory with their symlinks. The control loop's later deleteCollectedSegments pass calls seg.Release() and removes only the segmentDirectory files (DeepDelete); the hard-linked copies under <root>/<table>/snapshot/ and the symlinks under <SnapshotDirectory>/<table>/segments/ are never touched. The snapshot directory persistently shows the GC'd segments.

Why the existing code does not prevent it

Every other consumer of the new logical floor already uses lowestReadableSegment:

  • reloadKeymap (disk_table.go:471, called at 238) iterates from lowestReadableSegment.
  • repairKeymap (disk_table.go:550, called at 243) stops at lowestReadableSegment.
  • purgeCollectedKeymapEntries (disk_table.go:642) walks [lowestSegmentIndex, lowestReadableSegment) deleting their keymap entries.
  • The control loop's runtime reads (computeOldestPrimaryKey, computeNewestPrimaryKey, handleOpenIteratorRequest) all floor at readableFloor.

repairSnapshot is the lone exception — it is the only place where a logically-deleted segment is treated as if it were still part of the table.

What the impact is

The snapshotting contract (littdb_config.go SnapshotDirectory doc): "If, at any point in time, an external process takes all data in the snapshot directory and loads it into a new LittDB instance, then that instance will have a consistent view of the database." This bug violates that contract for the collected-but-unreclaimed segments. An external backup process reads the snapshot directory and copies/restores those segments into a new LittDB instance. The destination has no gc-watermark file because the watermark lives at <root>/<table>/gc-watermark — at the table root, outside the snapshot directory (see filesystem_layout.md:87-93). So in the restored instance, LoadGCWatermarkFile returns IsDefined()==false, lowestReadableSegment defaults to lowestSegmentIndex==0, and reloadKeymap rescans every segment including the collected ones, calling keymap.Put for each of their keys. The result is silent data resurrection — collected keys reappear in the restored backup, exactly the failure mode reloadKeymap's own doc comment (lines 470-474) warns about: "if BOTH the keymap and the gc-watermark file are lost, this rebuild has no record of in-flight GC and may resurrect keys".

Step-by-step proof

Config: TTL configured, snapshotting enabled to <snapshotRoot>, table T.

  1. Session 1. Write k0..k4, sealing every 2 keys: seg0[k0,k1], seg1[k2,k3], seg2(mutable)[k4]. Flush.
  2. TTL expires the sealed segments. gcManager.collectExpiredSegments calls advanceWatermark(2) (gc-watermark file is fsynced to 2) then schedules the keymap delete for seg0 and seg1. The keymap manager applies those deletes and publishes deletion watermark 1.
  3. The process crashes (or cleanly closes — both work) before the control loop's deleteCollectedSegments pass removes seg0/seg1 files.
  4. Session 2. NewDiskTable runs. LoadGCWatermarkFile reads watermark 2, so lowestReadableSegment=2. lowestSegmentIndex is computed from on-disk segment metadata files and equals 0 (seg0/seg1 files still present).
  5. repairSnapshot is called with lowestSegmentIndex=0. After os.RemoveAll(symlinkSegmentsDirectory), the loop for i := 0; i < 3; i++ calls seg.Snapshot() on seg0, seg1, and seg2. seg0 and seg1 are now in the snapshot directory: their key/value/metadata files are hard-linked under <root>/<table>/snapshot/, and <snapshotRoot>/<table>/segments/ contains symlinks to them.
  6. Later in startup, purgeCollectedKeymapEntries deletes seg0/seg1's keymap entries, and on the first control-loop tick deleteCollectedSegments removes the segmentDirectory files for seg0/seg1 — but the hard links under <root>/<table>/snapshot/ are untouched. The snapshot now persistently shows seg0/seg1 as part of the database.
  7. An external backup process snapshots <snapshotRoot> and restores it into a fresh LittDB instance T'. The destination's <root'>/<table>/gc-watermark does not exist. On open, LoadGCWatermarkFile returns IsDefined()==false, so lowestReadableSegment=lowestSegmentIndex=0.
  8. The fresh instance has no keymap (the snapshot directory contains segments, not the keymap), so reloadKeymap runs over [0, highestSegmentIndex]. It reads seg0's keys (k0, k1) and seg1's keys (k2, k3) and calls keymap.Put for each. The four garbage-collected keys are now alive in T'. Resurrection complete.

Fix

Change line 265 to pass lowestReadableSegment instead of lowestSegmentIndex:

upperBoundSnapshotFile, err = table.repairSnapshot(
    config.SnapshotDirectory,
    lowestReadableSegment,
    highestSegmentIndex,
    segments)

Or, equivalently and defensively, change the loop start in repairSnapshot:

firstSegmentToConsider := lowestReadableSegment
if lowerBoundSnapshotFile.IsDefined() {
    if next := lowerBoundSnapshotFile.BoundaryIndex() + 1; next > firstSegmentToConsider {
        firstSegmentToConsider = next
    }
}

The second form preserves the existing max(externalLowerBound, internalFloor) semantics: external consumers may already have processed segments above the gc-watermark, in which case their bound still wins. This brings repairSnapshot in line with every other read/repair path in this PR and closes the resurrection gap for the snapshotting code path.

Comment thread sei-db/db_engine/litt/disktable/gc_watermark_file.go

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 8e25454. Configure here.

// The only error path is a panic-induced shutdown; nothing more to do.
return
}
g.deletionScheduledIndex = int64(index)

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty segments never reclaim files

Medium Severity

When GC collects a TTL-expired sealed segment with no keys, scheduleDelete returns without enqueueing work, so the keymap manager never publishes a deletion watermark for that index. The control loop only drops segment files once keymapDeletionWatermark covers the index, while the GC manager still advances deletionScheduledIndex, leaving empty sealed segments (e.g. after SetShardingFactor) on disk indefinitely.

Additional Locations (1)
Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 8e25454. Configure here.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A large but carefully designed refactor of LittDB keymap threading: keymap puts/deletes now run asynchronously off the control loop, GC is split into an off-loop manager with a durable gc-watermark, and crash recovery (repair + purge) is hardened against resurrecting collected keys. The design is sound and very well tested; the only issues are a documented-but-real watermark-drop edge case and a stale doc comment, neither of which blocks merge.

Findings: 0 blocking | 4 non-blocking | 1 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Stale doc comment (Codex finding): sei-db/db_engine/litt/table.go ~line 152 still states "iteration disables GC temporarily" (and has a typo, "Additinoally"). This PR changes iterators to reserve their snapshot segments while GC continues, so the comment is now misleading about iterator/GC/disk-retention semantics. Not flaggable inline because that line is not in the PR diff, but it should be updated alongside the other iterator doc updates in this PR.
  • GCPeriod default was lowered from 5m to 10s (littdb_config.go). This is a meaningful behavioral change for all LittDB consumers (much more frequent collection passes). It is justified in the comment (collection is now cheap on the control loop), but worth calling out explicitly in the PR description / release notes since it affects every deployment, not just the keymap-threading path.
  • REVIEW_GUIDELINES.md is empty/missing on the base branch, so no repo-specific review standards could be applied. The Cursor second-opinion pass (cursor-review.md) produced no output (empty file); only the Codex pass contributed findings, both of which are addressed here.
  • 1 suggestion(s)/nit(s) flagged inline on specific lines.

func (m *keymapManager) publishDeletionWatermark(watermark int64) {
select {
case m.deletionWatermarkChan <- watermark:
default:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[suggestion] Watermark drop on a full channel (Codex finding, confirmed). During an explicit RunGC() over a backlog larger than KeymapManagerWatermarkChannelSize (default 1024), sync() publishes one watermark per collected segment while the control loop is not draining deletionWatermarkChan (no GC/iterator/boundary request is in flight until runGCPass() runs after sync()). Because the channel is FIFO and publishes are monotonically increasing, it retains the lowest 1024 watermarks and drops the highest ones. refreshDeletionWatermark() then floors at ~1023, so segments above that have their keymap entries durably deleted but their files (and the keyCount/Size they contribute) are not reclaimed. Since deletionScheduledIndex has already advanced past them, recovery only happens when a later segment is collected and publishes a higher watermark — if collection then goes idle, the lag persists. This is correctly documented and is never a premature deletion (the watermark is monotonic), so it's not a correctness blocker, but consider making the published watermark loss-proof — e.g. also store the latest value in an atomic.Int64 that refreshDeletionWatermark reads as a floor — so a single large RunGC pass always fully reclaims in one go regardless of channel size.

Sahil-4555 pushed a commit to Sahil-4555/sei-chain that referenced this pull request Jun 30, 2026
…ei-protocol#3660)

## Description
- NOTE: Only Merge post
[3645](sei-protocol#3645): That makes
per-block flush affordable, reclaiming near-per-block crash durability
for free.
- `littFlushInterval` 100ms -> **5ms** (~one block at Giga throughput) —
bounds crash loss to ~a single block (near-WAL durability for receipt
bodies).
- `ShardingFactor` 16 -> **1**. flushing one segment file is cheaper;
sharding mainly helps across multiple disks.

## Testing
- reads-off write ceiling **167k** (vs 168k prior)
- reads-on **~153–157k** writes / 58k point-reads/s / 520 getLogs/s @
7ms
- 3h GC soak **flat** (old −27% control-loop step-down gone), 0 panics

Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants