LittDB: Keymap threading improvements#3645
Conversation
PR SummaryHigh Risk Overview Garbage collection is split across goroutines. A GC manager TTL-scans sealed segments (with optional filter), fsyncs a new Correctness and read paths floor iteration, oldest/newest keys, reload/repair at API cleanup: Reviewed by Cursor Bugbot for commit 8e25454. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
| if err != nil { | ||
| c.logger.Error("failed to drain keymap writer", "error", err) | ||
| return | ||
| } |
There was a problem hiding this comment.
Close hangs after drain failure
High Severity
If keymapManager.drain() fails during shutdown, handleShutdownRequest returns after logging without sending on req.shutdownCompleteChan. Close() still waits on that channel, so shutdown can block forever even though the table never finished stopping.
Reviewed by Cursor Bugbot for commit 7b4c308. Configure here.
There was a problem hiding this comment.
Another false positive. Close() doesn't wait on that channel alone, it waits via util.Await, which selects on the channel or the error monitor's cancellation.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3645 +/- ##
==========================================
- Coverage 59.12% 58.14% -0.99%
==========================================
Files 2259 2176 -83
Lines 186489 176733 -9756
==========================================
- Hits 110255 102755 -7500
+ Misses 66353 64900 -1453
+ Partials 9881 9078 -803
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
| timer := time.NewTimer(m.maxFlushInterval) | ||
| defer timer.Stop() | ||
|
|
||
| for len(*puts)+len(*deletes) < m.maxBatchSize { |
There was a problem hiding this comment.
The loop checks the batch size at the top, then appends an entire request's key slice. A single scheduleDelete for a segment with 100,000 keys would blow past the 1024 maxBatchSize in one append. This doesn't affect correctness — deleteBatch chunks deletes into deleteBatchSize (10,000) sub-batches — but the put side has no chunking, so a single writeBatch with 100,000+ keys hits the keymap in one giant Put call, which could spike latency.
Consider Cap writeBatch size the same way deleteBatch is capped, deleteBatch chunks into deleteBatchSize sub-batches, but writeBatch doesn't. For consistency and to bound peak latency.
There was a problem hiding this comment.
Change made (required a refactor)
There was a problem hiding this comment.
A well-engineered, thoroughly-tested rework of LittDB keymap threading (async keymap manager, two-phase GC gated by a deletion watermark, crash-time keymap repair). No blocking correctness issues found; a couple of low-severity edge-case observations remain.
Findings: 0 blocking | 4 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Cursor second-opinion pass produced no output (cursor-review.md is empty), so only the Codex pass and this review contributed external findings.
- Startup now runs repairKeymap on every non-reload open. On a clean shutdown this is cheap (the first key checked is present, so it stops immediately), but it's worth confirming via the existing TestCleanShutdownLeavesKeymapConsistent that the zero-rescue fast path holds for large tables in CI, since the walk does a keymap.Get per missing key.
- 2 suggestion(s)/nit(s) flagged inline on specific lines.
| // could deadlock the seal path). If the channel is full the update is dropped: the watermark is monotonic and | ||
| // the control loop drains it each GC pass, so a later publish delivers an equal-or-newer value. A dropped | ||
| // update only delays file deletion, which is always safe. | ||
| func (m *keymapManager) publishDeletionWatermark(watermark int64) { |
There was a problem hiding this comment.
[suggestion] Codex flagged that a dropped watermark publish can strand GC'd segment files. The drop here is fire-and-forget (correctly, to avoid the seal-path deadlock), and the watermark is monotonic, but the control loop advances deletionScheduledIndex when a delete is scheduled and never reschedules those segments. So if the channel is full when the highest watermark is published and no later segment ever expires (e.g. writes stop on an idle table), keymapDeletionWatermark stays below the true highest and phase-2 leaves those segments' files on disk — they remain counted/readable, then get reclaimed on the next delete that advances the watermark, or on restart.
This is a disk-reclamation delay in an extreme edge case (the channel defaults to 1024 and is drained every GC pass), not data loss or resurrection of deleted keys, so I'd downgrade Codex's "permanently strand" framing to a low-severity note. Still, given the only cost of a fuller-fidelity signal is a small struct, consider publishing the watermark via a single-slot "latest value" (overwrite-on-full) rather than a bounded queue that can drop the newest value, which would close the gap entirely.
| // caller without waiting for the keymap write. The segment data is already crash durable; a crash that | ||
| // loses the scheduled write is repaired from the segment key files on the next startup. A full manager | ||
| // channel backpressures here, which propagates back to Flush(). | ||
| err = f.keymapManager.scheduleWrite(durableKeys) |
There was a problem hiding this comment.
[nit] Behavioral change worth a note (per Codex): Flush() now returns success once the keymap put is scheduled, so a subsequent keymap.Put failure (e.g. a duplicate-key error under DoubleWriteProtection) surfaces asynchronously as an error-monitor panic on the manager goroutine rather than to the Flush() caller that produced it. The DB still enters the panicked state and later calls observe the error, so this isn't silent data loss — just a shift in which call observes the failure. Acceptable given the async design; flagging so the changed error-surfacing contract is intentional.
With litt's async keymap flush (#3645), flushing ~once per block is cheap, so: - littFlushInterval 100ms -> 5ms (~one block at Giga throughput), bounding crash loss to roughly a single block (near-WAL durability) - ShardingFactor 16 -> 1 (flushing a single segment file is cheaper; sharding mainly helps across multiple disks) DO NOT MERGE until #3645 (litt async keymap) lands: without it, flushing this often regresses write throughput ~48%. Validated via cryptosim (main littidx + #3645 + parallel getLogs): - reads-off write ceiling 167k (vs 168k prior) - reads-on ~153-157k writes / 58k point-reads/s / 520 getLogs/s @ 7ms - 3h GC soak flat (old -27% control-loop step-down gone), 0 panics Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Solid, well-tested rework of LittDB keymap threading: async keymap manager, off-control-loop GC, reservation-based iterators, and a durable gc-watermark that correctly prevents post-crash resurrection of garbage-collected keys. No blocking correctness issues found; the two Codex findings are real but low-severity disk-reclamation lags that self-heal on restart.
Findings: 0 blocking | 5 non-blocking | 2 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Watermark-drop reclamation lag (Codex #1):
publishDeletionWatermarkdrops updates whendeletionWatermarkChan(cap 1024) is full. Because a full FIFO channel retains the older, smaller values and drops the newest (largest) watermark, if a collection burst publishes >1024 segment watermarks between two control-loop drains, the control loop reads a stale max and leaves the higher-indexed collected segments' files on disk until a later publish or restart. Unlikely in practice (deletes are fsync-bounded, control loop drains every GCPeriod), and it self-heals at startup via watermark seeding. Consider a latest-value (cap-1, drain-then-send / overwrite) channel likesetTTLuses, so the freshest watermark is never lost. - Second-opinion passes:
cursor-review.mdis empty (Cursor produced no output) andREVIEW_GUIDELINES.mdis empty/missing on the base branch, so no repo-specific standards were applied beyond AGENTS.md. - Positive: the crash-consistency regression test (
TestGCWatermarkPreventsResurrectionAfterCrash) and the single-atomic-batch repair test (TestRepairKeymapSingleBatch) directly exercise the trickiest invariants; the new async paths are well covered. - 2 suggestion(s)/nit(s) flagged inline on specific lines.
|
|
||
| // Schedule the segment's keymap entries for deletion. The manager applies it asynchronously and then | ||
| // advances the deletion watermark, which lets the control loop later delete the segment's files. | ||
| if err := g.keymapManager.scheduleDelete(g.gcCursorKeys, int64(index)); err != nil { |
There was a problem hiding this comment.
[suggestion] Empty GC'd segment never advances the control loop's deletion watermark (Codex #2). For a sealed segment with no keys, scheduleDelete returns early (len(keys)==0), so no delete is enqueued and applyDeleteSubBatch never calls publishDeletionWatermark for this segment — yet advanceWatermark and deletionScheduledIndex have already moved past it. The control loop's keymapDeletionWatermark therefore never covers the empty segment, so deleteCollectedSegments won't reclaim its files until a later non-empty collected segment publishes a higher watermark, or until restart (where seeding recovers it). An empty sealed segment is rare but reachable (e.g. sealing an empty mutable segment). Consider publishing the deletion watermark for the segment even when its key set is empty.
| is logically gone. This lets LittDB distinguish, after a crash, between a key that is missing from the keymap | ||
| because it was garbage collected (everything below the watermark) and a key that is missing because its | ||
| asynchronous keymap write was lost (a recently written key, which keymap repair must restore). Reads also | ||
| consult the watermark and treat any key in a below-watermark segment as absent. |
There was a problem hiding this comment.
[nit] This says "Reads also consult the watermark and treat any key in a below-watermark segment as absent," but the runtime read path (DiskTable.Get) does not consult the gc-watermark — gcManager.advanceWatermark explicitly notes there is no in-memory read barrier and the durable value is consulted only at boot (repair/reload). At runtime, a key in a below-watermark segment stays readable until its keymap entry is actually deleted. Functionally benign (the key is expiring anyway), but the doc claim is inaccurate; recommend rewording.
| // Load the durable GC watermark (lowest readable segment). It lives at the table root so it survives a | ||
| // keymap rebuild. Reconcile against the lowest segment actually present on disk: a value at or below it | ||
| // means those segments were already physically deleted (clamp up to the lowest present); a value above it | ||
| // means GC durably deleted some segments' keymap entries but crashed before their files were removed, so | ||
| // segments [lowestSegmentIndex, lowestReadableSegment) are still on disk but logically deleted (the control | ||
| // loop will reclaim their files, and repair/reload will not resurrect them). | ||
| gcWatermarkFile, err := LoadGCWatermarkFile(qualifiedRoots[0]) | ||
| if err != nil { | ||
| return nil, fmt.Errorf("failed to load gc-watermark file: %w", err) | ||
| } | ||
| lowestReadableSegment := lowestSegmentIndex | ||
| if gcWatermarkFile.IsDefined() && gcWatermarkFile.LowestReadableSegment() > lowestReadableSegment { | ||
| lowestReadableSegment = gcWatermarkFile.LowestReadableSegment() | ||
| } |
There was a problem hiding this comment.
🔴 The gc-watermark file is loaded unconditionally from qualifiedRoots[0] (line 227), but the keymap directory is not guaranteed to live in config.Paths[0] — FindKeymapLocation scans all roots, and the rebase CLI / config.Paths reordering between sessions can put it in any root. When the watermark and keymap diverge, a subsequent session sees IsDefined()==false, treats lowestReadableSegment as 0, and if a keymap loss then triggers reloadKeymap/repairKeymap, segments whose keys were durably GC-deleted will be scanned and their keys resurrected — exactly the failure mode this watermark was added to prevent. Additionally, cli/rebase.go:transferKeymap does not migrate the gc-watermark, so a post-rebase table where any GC has run will fail its later os.Remove(sourceTableDir) with "directory not empty". Fix: load via filepath.Dir(keymapPath) (the keymap's parent is already passed into NewDiskTable) or scan all qualified roots the way Drop does, and teach the rebase CLI to migrate the file.
Extended reasoning...
The bug
NewDiskTable (sei-db/db_engine/litt/disktable/disk_table.go:227) loads the gc-watermark unconditionally from the first root:
gcWatermarkFile, err := LoadGCWatermarkFile(qualifiedRoots[0])But the design comment for this file (gc_watermark_file.go:22-23) and the filesystem-layout doc (filesystem_layout.md:87-93) both say the watermark lives at the table root that holds the keymap directory — and that root is not pinned to Paths[0]. littbuilder/build_utils.go:FindKeymapLocation (lines 32-78) explicitly scans every configured root looking for the keymap directory and returns whichever one contains it. Only new keymaps go to Paths[0] by convention (line 120).
The asymmetry with Drop() (lines 711-727) is the giveaway: Drop already iterates every root looking for and deleting this file, while the load path checks only one. If the design guaranteed Paths[0], Drop would not need the loop.
How a real session triggers it
The watermark can end up away from Paths[0] in several realistic ways:
config.Pathsreordering between sessions. filesystem_layout.md explicitly states "It doesn't matter which root directory contains the keymap directory." A user withPaths=[/a, /b]who reorders toPaths=[/b, /a]next session has nothing in LittDB telling them this is load-bearing —FindKeymapLocationwill find the keymap regardless, but the watermark check looks only atPaths[0].- Post-rebase.
cli/rebase.go:transferKeymap(lines 385-424) moves the keymap to a destination root chosen bydetermineDestination(lines 524-532), which hashes source paths. Worse: rebase does NOT migrate the gc-watermark at all —grep -n 'GCWatermark\|gc-watermark' cli/rebase.goreturns nothing. So the file stays in the old location while the keymap moves. The complementary failure:transferDataInTableends withos.Remove(sourceTableDir)(rebase.go:304), which fails withdirectory not emptywhenever any GC has run, because the gc-watermark file was never moved out.
Step-by-step proof of resurrection
Config: Paths=[/a, /b], single table T, TTL configured.
Session 1.
- Keymap is created under
/a/T/keymap(thePaths[0]convention at first creation). - Data is written and flushed. Segments 0..4 exist; some are sealed and TTL-expired.
- GC runs: it calls
gcManager.advanceWatermark(5)(gc_manager.go), whichUpdates the watermark file at/a/T/gc-watermarkto 5, then schedules keymap deletes for segments 0..4. - Manager applies the deletes — keys for segments 0..4 are durably removed from the keymap at
/a/T/keymap. - Before the control loop deletes the segment files (or with files already deleted — either branch hits the same bug), the user shuts down cleanly.
Between sessions. The operator reorders to Paths=[/b, /a] (or runs the rebase CLI to move the keymap to /b).
Session 2.
NewDiskTableruns:qualifiedRoots[0] = /b/T.LoadGCWatermarkFile(/b/T)returns an undefined file — there is nogc-watermarkat/b/T. The real one is still at/a/T/gc-watermark.lowestReadableSegment = lowestSegmentIndex(clamped from the undefined watermark) = the lowest segment whose files are still on disk. If session 1 had not yet reclaimed segment files, that's 0.- Suppose at some point afterward the keymap is lost (corruption, manual wipe of the keymap directory, migrating to a new keymap type — the exact case
reloadKeymapexists for). On the next startupreloadKeymapruns withlowestReadableSegment = 0and scans segments 0..highest, callingkeymap.Putfor every key — including the keys for segments 0..4 that GC durably deleted in session 1. Those keys are now resurrected.
The reloadKeymap doc comment (disk_table.go:454-461) explicitly acknowledges that an intact gc-watermark is what prevents this: "if BOTH the keymap and the gc-watermark file are lost, this rebuild has no record of in-flight GC and may resurrect keys." The bug is that even when the watermark file is fully intact on disk at /a/T/gc-watermark, the load path is looking in the wrong root and treating it as lost.
Why existing code does not save it
reloadKeymap/repairKeymaptake alowestReadableSegmentparameter, but it is sourced from the misread watermark, so the floor they apply is 0 instead of 5.- The control loop's deletion watermark is seeded from
int64(lowestReadableSegment)-1 = -1instead of 4, so files in[0, 5)are never reclaimed by their post-restart pass even if they are still present. - Reads consult the watermark too: a session-2
Getof a session-1 resurrected key returns the on-disk value, not absent.
Fix
The keymapPath is already passed into NewDiskTable and is the keymap directory. Use its parent as the table root:
gcWatermarkFile, err := LoadGCWatermarkFile(filepath.Dir(keymapPath))Or, mirroring the (already-defensive) Drop() loop, scan every qualified root for an existing watermark file and load whichever one exists. The first form is preferable because it matches the documented invariant exactly (co-located with the keymap) and survives config.Paths reordering without requiring a search.
The rebase CLI also needs a transferGCWatermark step alongside transferKeymap so that the file follows the keymap to its new root, otherwise post-rebase os.Remove(sourceTableDir) fails with directory not empty.
There was a problem hiding this comment.
A large but well-structured and well-tested refactor that moves keymap puts/deletes and GC collection off the control loop, adds a durable gc-watermark to prevent post-crash key resurrection, and switches iterators to reservations. No blocking correctness issues found; the one external (Codex) finding is unreachable under LittDB's documented no-overwrite contract.
Findings: 0 blocking | 6 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Codex flagged a possible stale-read: an older in-flight put-batch's writeBatch deletes a key from unflushedDataCache, potentially evicting a newer Put(K) cache entry and surfacing the older keymap value until the newer batch applies. This is technically real but unreachable in correct usage — LittDB's Put contract (table.go:21) forbids overwriting an existing value, which is the only thing that puts two put-batches for the same key in flight concurrently. For non-overwrite usage the keymap-Put-then-cache-Delete ordering is correct. Worth a code comment noting writeBatch's correctness depends on the no-overwrite invariant, but not a blocker.
- Cursor's second-opinion pass produced no output (cursor-review.md is empty); REVIEW_GUIDELINES.md was also empty, so no repo-specific standards were applied.
- gcManager.collectExpiredSegments fsyncs the gc-watermark file once per expired segment (advanceWatermark before each scheduleDelete). With GCPeriod lowered to 10s and large collection bursts this is many sequential fsyncs; it's off the control loop so it won't block writes, but coalescing the watermark advance per pass (single fsync to the highest collected index) would reduce syncs. Minor.
- The deletionWatermark channel is fire-and-forget and can drop updates if a single pass collects more segments than KeymapManagerWatermarkChannelSize (default 1024) before the control loop drains it, deferring some file reclamation to a later pass. This is monotonic-safe and documented, but explicit RunGC over a very large backlog may not reclaim everything in one call — acceptable given the documentation.
- repairKeymap's all-or-nothing recovery now depends on keymap.Put being crash-atomic as a batch; the interface doc was updated to require this and a test asserts a single Put call, but the guarantee is only as strong as each Keymap implementation actually provides (PebbleDB batch). Worth confirming the unsafe/non-default keymap implementations honor it.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
| // The keys are now durable in both the segment and the keymap, so it is safe to drop them from the | ||
| // unflushed data cache. | ||
| for _, key := range keys { | ||
| m.unflushedDataCache.Delete(util.UnsafeBytesToString(key.Key)) |
There was a problem hiding this comment.
[nit] This unconditional cache prune is correct only because LittDB forbids overwriting an existing key (see Table.Put, table.go:21). If two put-batches for the same key were ever in flight (an overwrite), an older batch applied here would evict the newer cache entry while the keymap still points at the older segment, briefly surfacing a stale value via Get until the newer batch applies. Consider a comment pinning this to the no-overwrite invariant so the dependency is explicit. (Raised by Codex; not reachable under the current contract.)
| return nil, fmt.Errorf("failed to purge collected keymap entries: %w", err) | ||
| } | ||
|
|
||
| var upperBoundSnapshotFile *BoundaryFile |
There was a problem hiding this comment.
🔴 On startup repairSnapshot is called with lowestSegmentIndex (disk_table.go:265) instead of lowestReadableSegment, so its loop snapshots segments in [lowestSegmentIndex, lowestReadableSegment) that are logically deleted (the gc-watermark has advanced past them). After a crash mid-collection (or the TestGCWatermarkPreventsResurrectionAfterCrash clean-Close-without-reclaim state), the snapshot directory persistently contains symlinks to hard-linked copies of GC'd segment files; restoring that snapshot into a fresh LittDB has no gc-watermark to floor at (it lives at the table root, outside the snapshot dir) so reloadKeymap rescans those segments and resurrects collected keys — the exact failure mode the durable watermark was added to prevent. Every other read/repair path in this PR (reloadKeymap, repairKeymap, purgeCollectedKeymapEntries, readableFloor for iterators / boundary keys) already floors at lowestReadableSegment — repairSnapshot is the lone exception. Fix: pass lowestReadableSegment into repairSnapshot and start its loop at max(firstSegmentToConsider, lowestReadableSegment).
Extended reasoning...
What the bug is
NewDiskTable calls repairSnapshot at disk_table.go:263-267 with lowestSegmentIndex — the lowest segment whose files are physically on disk — rather than lowestReadableSegment, the durable gc-watermark just computed on lines 227-234. Inside repairSnapshot (line 440), firstSegmentToConsider defaults to that lowestSegmentIndex (or lowerBoundSnapshotFile.BoundaryIndex()+1 if an external consumer has defined one), and the loop for i := firstSegmentToConsider; i < highestSegmentIndex; i++ calls seg.Snapshot() on every segment in that range. seg.Snapshot() creates a hard link under <root>/<table>/snapshot/ and a symlink under <SnapshotDirectory>/<table>/segments/ for the segment's files. The snapshot directory is also wiped (os.RemoveAll on line 419) and rebuilt from scratch each startup, so the snapshot's contents are entirely determined by this loop.
How it triggers in practice
This PR introduces the async two-phase GC: gcManager.advanceWatermark fsyncs the gc-watermark file before scheduling keymap deletes, and the control loop reclaims segment files only after the keymap manager publishes the deletion watermark. That creates a new, durable "logically deleted, files still on disk" state for segments in [lowestSegmentIndex, lowestReadableSegment). Two concrete reproductions:
- Crash mid-collection. After
advanceWatermarkand the keymap delete, but before the control loop'sdeleteCollectedSegmentsremoves the files, the process crashes. Segments[lowestSegmentIndex, lowestReadableSegment)remain on disk; the gc-watermark file persists at the table root. - Clean Close without reclaim. The test
TestGCWatermarkPreventsResurrectionAfterCrashin this PR explicitly constructs this state by runninggcManager.runOnce()+keymapManager.sync()+Close()(a clean Close does not delete segment files). The gc-watermark is durably 2; seg0 and seg1 files remain on disk.
On the next startup, repairSnapshot is invoked with lowestSegmentIndex (0), so it calls seg.Snapshot() on the logically-deleted seg0 and seg1, populating the snapshot directory with their symlinks. The control loop's later deleteCollectedSegments pass calls seg.Release() and removes only the segmentDirectory files (DeepDelete); the hard-linked copies under <root>/<table>/snapshot/ and the symlinks under <SnapshotDirectory>/<table>/segments/ are never touched. The snapshot directory persistently shows the GC'd segments.
Why the existing code does not prevent it
Every other consumer of the new logical floor already uses lowestReadableSegment:
reloadKeymap(disk_table.go:471, called at 238) iterates fromlowestReadableSegment.repairKeymap(disk_table.go:550, called at 243) stops atlowestReadableSegment.purgeCollectedKeymapEntries(disk_table.go:642) walks[lowestSegmentIndex, lowestReadableSegment)deleting their keymap entries.- The control loop's runtime reads (
computeOldestPrimaryKey,computeNewestPrimaryKey,handleOpenIteratorRequest) all floor atreadableFloor.
repairSnapshot is the lone exception — it is the only place where a logically-deleted segment is treated as if it were still part of the table.
What the impact is
The snapshotting contract (littdb_config.go SnapshotDirectory doc): "If, at any point in time, an external process takes all data in the snapshot directory and loads it into a new LittDB instance, then that instance will have a consistent view of the database." This bug violates that contract for the collected-but-unreclaimed segments. An external backup process reads the snapshot directory and copies/restores those segments into a new LittDB instance. The destination has no gc-watermark file because the watermark lives at <root>/<table>/gc-watermark — at the table root, outside the snapshot directory (see filesystem_layout.md:87-93). So in the restored instance, LoadGCWatermarkFile returns IsDefined()==false, lowestReadableSegment defaults to lowestSegmentIndex==0, and reloadKeymap rescans every segment including the collected ones, calling keymap.Put for each of their keys. The result is silent data resurrection — collected keys reappear in the restored backup, exactly the failure mode reloadKeymap's own doc comment (lines 470-474) warns about: "if BOTH the keymap and the gc-watermark file are lost, this rebuild has no record of in-flight GC and may resurrect keys".
Step-by-step proof
Config: TTL configured, snapshotting enabled to <snapshotRoot>, table T.
- Session 1. Write k0..k4, sealing every 2 keys: seg0[k0,k1], seg1[k2,k3], seg2(mutable)[k4]. Flush.
- TTL expires the sealed segments.
gcManager.collectExpiredSegmentscallsadvanceWatermark(2)(gc-watermark file is fsynced to 2) then schedules the keymap delete for seg0 and seg1. The keymap manager applies those deletes and publishes deletion watermark 1. - The process crashes (or cleanly closes — both work) before the control loop's
deleteCollectedSegmentspass removes seg0/seg1 files. - Session 2.
NewDiskTableruns.LoadGCWatermarkFilereads watermark 2, solowestReadableSegment=2.lowestSegmentIndexis computed from on-disk segment metadata files and equals 0 (seg0/seg1 files still present). repairSnapshotis called withlowestSegmentIndex=0. Afteros.RemoveAll(symlinkSegmentsDirectory), the loopfor i := 0; i < 3; i++callsseg.Snapshot()on seg0, seg1, and seg2. seg0 and seg1 are now in the snapshot directory: their key/value/metadata files are hard-linked under<root>/<table>/snapshot/, and<snapshotRoot>/<table>/segments/contains symlinks to them.- Later in startup,
purgeCollectedKeymapEntriesdeletes seg0/seg1's keymap entries, and on the first control-loop tickdeleteCollectedSegmentsremoves thesegmentDirectoryfiles for seg0/seg1 — but the hard links under<root>/<table>/snapshot/are untouched. The snapshot now persistently shows seg0/seg1 as part of the database. - An external backup process snapshots
<snapshotRoot>and restores it into a fresh LittDB instanceT'. The destination's<root'>/<table>/gc-watermarkdoes not exist. On open,LoadGCWatermarkFilereturnsIsDefined()==false, solowestReadableSegment=lowestSegmentIndex=0. - The fresh instance has no keymap (the snapshot directory contains segments, not the keymap), so
reloadKeymapruns over[0, highestSegmentIndex]. It reads seg0's keys (k0, k1) and seg1's keys (k2, k3) and callskeymap.Putfor each. The four garbage-collected keys are now alive inT'. Resurrection complete.
Fix
Change line 265 to pass lowestReadableSegment instead of lowestSegmentIndex:
upperBoundSnapshotFile, err = table.repairSnapshot(
config.SnapshotDirectory,
lowestReadableSegment,
highestSegmentIndex,
segments)Or, equivalently and defensively, change the loop start in repairSnapshot:
firstSegmentToConsider := lowestReadableSegment
if lowerBoundSnapshotFile.IsDefined() {
if next := lowerBoundSnapshotFile.BoundaryIndex() + 1; next > firstSegmentToConsider {
firstSegmentToConsider = next
}
}The second form preserves the existing max(externalLowerBound, internalFloor) semantics: external consumers may already have processed segments above the gc-watermark, in which case their bound still wins. This brings repairSnapshot in line with every other read/repair path in this PR and closes the resurrection gap for the snapshotting code path.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes using default effort and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 8e25454. Configure here.
| // The only error path is a panic-induced shutdown; nothing more to do. | ||
| return | ||
| } | ||
| g.deletionScheduledIndex = int64(index) |
There was a problem hiding this comment.
Empty segments never reclaim files
Medium Severity
When GC collects a TTL-expired sealed segment with no keys, scheduleDelete returns without enqueueing work, so the keymap manager never publishes a deletion watermark for that index. The control loop only drops segment files once keymapDeletionWatermark covers the index, while the GC manager still advances deletionScheduledIndex, leaving empty sealed segments (e.g. after SetShardingFactor) on disk indefinitely.
Additional Locations (1)
Reviewed by Cursor Bugbot for commit 8e25454. Configure here.
There was a problem hiding this comment.
A large but carefully designed refactor of LittDB keymap threading: keymap puts/deletes now run asynchronously off the control loop, GC is split into an off-loop manager with a durable gc-watermark, and crash recovery (repair + purge) is hardened against resurrecting collected keys. The design is sound and very well tested; the only issues are a documented-but-real watermark-drop edge case and a stale doc comment, neither of which blocks merge.
Findings: 0 blocking | 4 non-blocking | 1 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Stale doc comment (Codex finding):
sei-db/db_engine/litt/table.go~line 152 still states "iteration disables GC temporarily" (and has a typo, "Additinoally"). This PR changes iterators to reserve their snapshot segments while GC continues, so the comment is now misleading about iterator/GC/disk-retention semantics. Not flaggable inline because that line is not in the PR diff, but it should be updated alongside the other iterator doc updates in this PR. - GCPeriod default was lowered from 5m to 10s (littdb_config.go). This is a meaningful behavioral change for all LittDB consumers (much more frequent collection passes). It is justified in the comment (collection is now cheap on the control loop), but worth calling out explicitly in the PR description / release notes since it affects every deployment, not just the keymap-threading path.
- REVIEW_GUIDELINES.md is empty/missing on the base branch, so no repo-specific review standards could be applied. The Cursor second-opinion pass (cursor-review.md) produced no output (empty file); only the Codex pass contributed findings, both of which are addressed here.
- 1 suggestion(s)/nit(s) flagged inline on specific lines.
| func (m *keymapManager) publishDeletionWatermark(watermark int64) { | ||
| select { | ||
| case m.deletionWatermarkChan <- watermark: | ||
| default: |
There was a problem hiding this comment.
[suggestion] Watermark drop on a full channel (Codex finding, confirmed). During an explicit RunGC() over a backlog larger than KeymapManagerWatermarkChannelSize (default 1024), sync() publishes one watermark per collected segment while the control loop is not draining deletionWatermarkChan (no GC/iterator/boundary request is in flight until runGCPass() runs after sync()). Because the channel is FIFO and publishes are monotonically increasing, it retains the lowest 1024 watermarks and drops the highest ones. refreshDeletionWatermark() then floors at ~1023, so segments above that have their keymap entries durably deleted but their files (and the keyCount/Size they contribute) are not reclaimed. Since deletionScheduledIndex has already advanced past them, recovery only happens when a later segment is collected and publishes a higher watermark — if collection then goes idle, the lag persists. This is correctly documented and is never a premature deletion (the watermark is monotonic), so it's not a correctness blocker, but consider making the published watermark loss-proof — e.g. also store the latest value in an atomic.Int64 that refreshDeletionWatermark reads as a floor — so a single large RunGC pass always fully reclaims in one go regardless of channel size.
…ei-protocol#3660) ## Description - NOTE: Only Merge post [3645](sei-protocol#3645): That makes per-block flush affordable, reclaiming near-per-block crash durability for free. - `littFlushInterval` 100ms -> **5ms** (~one block at Giga throughput) — bounds crash loss to ~a single block (near-WAL durability for receipt bodies). - `ShardingFactor` 16 -> **1**. flushing one segment file is cheaper; sharding mainly helps across multiple disks. ## Testing - reads-off write ceiling **167k** (vs 168k prior) - reads-on **~153–157k** writes / 58k point-reads/s / 520 getLogs/s @ 7ms - 3h GC soak **flat** (old −27% control-loop step-down gone), 0 panics Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>


Describe your changes and provide context
Several changes to the threading of the keymap, with potentially large impacts for some workloads.