Per-block littidx flush + single shard (gated on #3645)#3660
Conversation
With litt's async keymap flush (#3645), flushing ~once per block is cheap, so: - littFlushInterval 100ms -> 5ms (~one block at Giga throughput), bounding crash loss to roughly a single block (near-WAL durability) - ShardingFactor 16 -> 1 (flushing a single segment file is cheaper; sharding mainly helps across multiple disks) DO NOT MERGE until #3645 (litt async keymap) lands: without it, flushing this often regresses write throughput ~48%. Validated via cryptosim (main littidx + #3645 + parallel getLogs): - reads-off write ceiling 167k (vs 168k prior) - reads-on ~153-157k writes / 58k point-reads/s / 520 getLogs/s @ 7ms - 3h GC soak flat (old -27% control-loop step-down gone), 0 panics Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3660 +/- ##
==========================================
- Coverage 59.14% 58.18% -0.97%
==========================================
Files 2262 2178 -84
Lines 187031 177408 -9623
==========================================
- Hits 110625 103226 -7399
+ Misses 66456 65023 -1433
+ Partials 9950 9159 -791
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
PR SummaryMedium Risk Overview The receipts litt table Reviewed by Cursor Bugbot for commit 362413b. Bugbot is set up for automated code reviews on this repo. Configure here. |
There was a problem hiding this comment.
A small, well-documented tuning PR that lowers the receipt-store litt flush interval (100ms → 5ms) and drops the sharding factor (16 → 1). The changes are correct and internally consistent; the only notes are process-related (merge ordering vs. #3645) and that the second-opinion reviews produced no findings.
Findings: 0 blocking | 4 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Merge-ordering dependency: the PR is only safe to merge after #3645 (async keymap flush), as the author notes and the code comments state. The 5ms per-block flush is described as regressing write throughput by ≈-48% without that change. Confirm #3645 has landed before merging this.
- The 5ms ticker calls receipts.Flush() ~200x/sec continuously, including when there are no new writes. This is fine since Flush on a clean table is a cheap no-op and time.Ticker coalesces missed ticks (no goroutine pileup), but it is worth confirming Flush() is genuinely cheap on an empty/already-flushed table to avoid idle-CPU overhead.
- Second-opinion reviews produced no actionable output: codex-review.md reports 'No material findings' and cursor-review.md is empty. REVIEW_GUIDELINES.md is also empty, so no repo-specific standards were applied.
- ShardingFactor is changed to 1 with the rationale that sharding mainly helps across multiple disks. If a future deployment uses multiple disks for the receipt store, this tuning would need to be revisited — consider whether this should eventually be config-driven rather than a hardcoded constant.
cody-littley
left a comment
There was a problem hiding this comment.
Comments are true only once I merge my branch, but IMO not a big deal since that's merging soon.
My only concern is that timing might not provide an ironclad crash safety guarantee. But this change by itself doesn't make the problem any worse, so no need to block it while we have this discussion.
| // littFlushInterval is roughly one flush per block at Giga throughput (a | ||
| // block is ~7ms), bounding crash loss to about a single block. Flushing this | ||
| // often is only cheap because litt flushes its keymap asynchronously, off the | ||
| // control loop; without that, a per-block flush regresses write throughput | ||
| // badly (≈-48% observed before the async keymap landed). | ||
| littFlushInterval = 5 * time.Millisecond |
There was a problem hiding this comment.
Are we requiring one flush per block for crash recovery safety? If so, doing it via wall clock time seems a bit fragile.
Description
littFlushInterval100ms -> 5ms (~one block at Giga throughput) — bounds crash loss to ~a single block (near-WAL durability for receipt bodies).ShardingFactor16 -> 1. flushing one segment file is cheaper; sharding mainly helps across multiple disks.Testing