fix(compactor): make DeleteLocalSyncFiles tests deterministic (pin ring tokens, drive compaction manually) by sandy2008 · Pull Request #7619 · cortexproject/cortex

sandy2008 · 2026-06-11T00:23:12Z

What this PR does

Makes TestCompactor_DeleteLocalSyncFiles and TestPartitionCompactor_DeleteLocalSyncFiles deterministic, fixing the second-generation arm64 flake of issue #7608 (32.80s ... compactor_test.go:1855: expected true, got false). Test-only change.

Lineage of this flake:

Cleanup meta sync files in compactor for un-owned users. #3851 (2021) introduced the test with CompactionInterval = 10 * time.Minute // We will only call compaction manually. — compaction cycles were driven synchronously by the test.
Hook up partition compaction end to end implementation #6510 (Jan 2025) removed that override, so the test fell back to prepareConfig()'s 5s timer with pre-run jitter, and assertions started sampling timer-driven cycles.
Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — Should not be zero, but was 0 #7565 — first observed failure (Should not be zero, but was 0).
fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 #7567 diagnosed "transient ring-view skew at startup" and changed the wait to a 30s poll on CompactionRunsCompleted >= 2 && len(listTenantsWithMetaSyncDirectories()) > 0. That fixed a real-but-secondary mid-cycle sampling race while leaving the actual root cause in place.
Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608 — the same failure recurred on master after fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 #7567: total test time 32.80s = ~2.8s full-speed setup + the entire 30s poll budget burned. The runner was not slow; the polled condition was unsatisfiable.

Root cause

The compactor ring assigns each instance 512 random tokens (NumTokens is hardcoded in RingConfig.ToLifecyclerConfig), and the test has only 10 fixed users (fnv32a(userID) → ring.Get). With 2 instances × 512 random tokens, there is a measured ≈0.103% (~1-in-960 per twin run) probability that the second compactor owns zero of the 10 users (pooled over ≈11M Monte-Carlo trials across five independent simulations using the real fnv32a user hashes and the ring's strictly-greater token search).

In such a draw, every c2 compaction cycle completes successfully but skips all users, so no meta-sync directory is ever created and #7567's len(dirs) > 0 clause is permanently false — the poll burns its full 30s regardless of timeout. The same draw explains #7565 (c2Users was genuinely 0). No timeout bump can fix a permanently-false condition.

Deterministic reproduction (fail-without-fix evidence): pinning an adversarial token layout into the current test via ShardingRing.TokensFilePath (first compactor gets guard tokens for all 10 users, second only low filler tokens) reproduces the CI failure exactly and on every run: 32.81s locally vs CI's 32.80s, with the identical compactor_test.go:1855: expected true, got false message and c2 logging six clean cycles of "skipping user ... not owned". The adversarial variant is cited here as evidence and intentionally not committed. Under the new test design the same degenerate layout is unconstructible.

How the fix resolves it

Restores #3851's manual-drive structure and pins the ownership split, identically in both twins:

CompactionInterval = 10 * time.Minute with the original // We will only call compaction manually. comment — timer-driven cycles are out of the picture.
New pinnedTokens helper pins per-instance ring tokens via ShardingRing.TokensFilePath: exactly 512 tokens per instance (shorter files are silently topped up with random tokens by the lifecycler — guarded by require.Len), where a guard token fnv32a(user)+1 for every 2nd user makes compactor-1 own user-1,3,5,7,9 and compactor-2 own user-2,4,6,8,10 deterministically: the integer interval (h, h+1) is empty, so the guard is always the first token strictly greater than the user's hash, regardless of the low filler tokens. The split was validated by simulation against the ring's searchToken strictly-greater semantics (exact 5/5, no token collisions; the minimum pairwise hash gap between the test users is the FNV prime 16777619, so guards cannot shadow each other). The helper's comment warns against evenly-spaced token schemes: fnv32a("user-N") hashes step by the FNV prime and resonate with regular spacings, which can degenerate back to a 0/10 split.
The timed CompactionRunsCompleted polls are replaced by direct, synchronous compactUsers() calls on both compactors (the test already did this for c1's final cleanup cycle). This is a mandatory pairing with the 10m interval: c1's old startup poll would otherwise out-wait the now-jittered initial auto-run most of the time.
Before driving ownership-dependent cycles, a single poll waits until both compactors' ring views return two healthy ACTIVE instances via GetAllHealthy(RingOp) — the same operation ownUser queries. c2's own view is already barriered by starting() (it waits until c2 is ACTIVE in its own view, and every KV snapshot containing c2 also contains the earlier-registered c1), but c1's ring watcher ingests c2's registration asynchronously, and the final c1.compactUsers() cleanup depends on c1's view.
Final assertions are exact counts (numUsers/2 for c2, numUsers - numUsers/2 for c1, plus the historical sum invariant) instead of NotZero/NotEqual. This is self-protecting: if pinning were ever silently bypassed, a random draw lands exactly 5/5 only ~25% of the time, so a regression fails loudly and immediately rather than once in a thousand runs.
One consequence of the 10m interval needed handling beyond the original plan: since Hook up partition compaction end to end implementation #6510, running() returns ctx.Err() when the service is stopped during the initial jittered wait (up to 60s at a 10m interval), which StopAndAwaitTerminated reports as a service failure. Other long-interval tests in this suite ignore the stop error entirely (//nolint:errcheck); these tests now use a stricter form that tolerates only context.Canceled, keeping the cleanup's protective value for real failures.

Why not other approaches

Bumping the 30s timeout: the zero-ownership state is permanent, not slow — the satisfiable case completed in well under 10s even on loaded runners, while the unsatisfiable case fails at any timeout.
Synchronous driving without pinned tokens: re-ships the measured 1-in-960 hard failure (c2 owns zero users → exact-count or NotZero assertions fail) — i.e. generation-3 of the same flake, just with a clearer message.
Evenly-spaced / geometric token schemes: validated trap — fnv32a hashes of user-N step by the FNV prime 16777619 and alias regular spacings; a plausible-looking evenly-spaced scheme degenerates to a 0/10 split at 512 tokens. Hence guard tokens, which are provably correct in one sentence.
Keeping the 5s interval + counter polls: leaves assertions coupled to timer phase (the original Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — Should not be zero, but was 0 #7565 sampling race) on top of the ownership draw.

Scope

Test-only: pkg/compactor/compactor_test.go, pkg/compactor/compactor_paritioning_test.go (both twins get byte-identical treatment), plus one CHANGELOG.md line. No production changes, no prepareConfig() changes.
Zero overlap with the in-flight work for Flaky test: TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning — 60s CompactionRunsCompleted poll times out #7607 (TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstance...), which touches different functions in the partition test file.
Follow-up note: the Flaky test: TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning — 60s CompactionRunsCompleted poll times out #7607 test family shares the same random-token substrate; once both fixes land, a small shared test utility for pinned tokens could be factored out (kept out of this PR to avoid cross-PR churn).
The periodic-ticker code path loses no coverage: ~15 other tests in the package still exercise timer-driven compaction runs.

Which issue(s) this PR fixes:
Fixes #7608

References #7565, #7567 (supersedes #7567's poll-based approach).

Checklist

Tests updated
Documentation added — N/A (test-only; no flags or config changed, make doc not required)
CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan (all on the branch, Apple Silicon arm64)

go test -tags "netgo slicelabels" -count=20 -run 'TestCompactor_DeleteLocalSyncFiles$|TestPartitionCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/ — ok, 99.1s total for 40 test executions ≈ 2.5s per test (previously ~8.5s healthy, 32.8s when the flake hit)
Same with -race -count=5 — ok, 27.1s
GOMAXPROCS=2 go test -tags "netgo slicelabels" -count=1 -run '...' ./pkg/compactor/ — ok, 5.5s
Full package go test -tags "netgo slicelabels" ./pkg/compactor/ — ok, 92.5s
go vet -tags "netgo slicelabels" ./pkg/compactor/ — clean; gofmt / goimports -local github.com/cortexproject/cortex — clean
Pinned ownership split verified by standalone simulation against searchToken semantics: c1 = user-1,3,5,7,9, c2 = user-2,4,6,8,10 (exact 5/5); all guards far below MaxUint32; max filler token 1014 vs minimum user hash 1986520588.

Per the project's GenAI policy: this contribution was developed with substantial AI assistance (Claude Code) — root-cause analysis, Monte-Carlo simulations, code, and verification — and was reviewed and submitted under the DCO by the contributor.

🤖 Generated with Claude Code

…ng tokens, drive compaction manually) TestCompactor_DeleteLocalSyncFiles and its partition twin relied on timer-driven compaction cycles and on the random per-instance ring tokens drawn at startup. With 2 compactors x 512 random tokens and only 10 fixed users there is a measured ~0.103% (~1-in-960 per run) chance that the second compactor owns zero of the users. In such runs every c2 cycle completes without ever creating a meta-sync directory, so the condition polled since cortexproject#7567 (CompactionRunsCompleted >= 2 && len(dirs) > 0) is permanently false and the test burns the entire 30s budget: the exact 32.8s arm64 CI failure of cortexproject#7608, and the true root cause of cortexproject#7565 (cortexproject#7567's "transient ring-view skew" was a misdiagnosis - no timeout can fix a permanently false condition). Restore the manual-drive structure the test had when introduced in cortexproject#3851 (removed by cortexproject#6510) and pin the ownership split, identically in both twins: - CompactionInterval = 10m: timers are out of the picture; the test drives compaction cycles itself. - Pin per-instance ring tokens via ShardingRing.TokensFilePath (new pinnedTokens helper): a guard token fnv32a(user)+1 for alternating users makes compactor-1 own user-1,3,5,7,9 and compactor-2 own user-2,4,6,8,10 deterministically, since the integer interval (h, h+1) is empty and ring lookup picks the first token strictly greater than the user hash. - Replace the timed CompactionRunsCompleted polls with direct compactUsers() calls on both compactors, gated by one poll waiting until BOTH compactors' ring views see two healthy ACTIVE instances. - Assert exact ownership counts (numUsers/2 each) instead of NotZero/NotEqual, so any silent un-pinning fails loudly. - Tolerate (only) context.Canceled from StopAndAwaitTerminated in the test cleanup: with a 10m interval the compactor is usually still in its initial jittered wait when stopped, and running() returns ctx.Err() there since cortexproject#6510. Supersedes cortexproject#7567's poll-based approach. Fixes cortexproject#7608 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>

pull-request-size Bot added the size/L label Jun 11, 2026

dosubot Bot added component/compactor go Pull requests that update Go code type/flaky-test labels Jun 11, 2026

sandy2008 mentioned this pull request Jun 11, 2026

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608

Open

sandy2008 force-pushed the fix/flaky-compactor-delete-local-sync-files-deterministic branch 2 times, most recently from 76bfdf7 to a0d54c6 Compare June 11, 2026 07:28

sandy2008 force-pushed the fix/flaky-compactor-delete-local-sync-files-deterministic branch from a0d54c6 to 661d418 Compare June 11, 2026 07:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(compactor): make DeleteLocalSyncFiles tests deterministic (pin ring tokens, drive compaction manually)#7619

fix(compactor): make DeleteLocalSyncFiles tests deterministic (pin ring tokens, drive compaction manually)#7619
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-compactor-delete-local-sync-files-deterministic

sandy2008 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

sandy2008 commented Jun 11, 2026

What this PR does

Root cause

How the fix resolves it

Why not other approaches

Scope

Test plan (all on the branch, Apple Silicon arm64)

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant