Skip to content

fix(compactor): make DeleteLocalSyncFiles tests deterministic (pin ring tokens, drive compaction manually)#7619

Open
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-compactor-delete-local-sync-files-deterministic
Open

fix(compactor): make DeleteLocalSyncFiles tests deterministic (pin ring tokens, drive compaction manually)#7619
sandy2008 wants to merge 1 commit into
cortexproject:masterfrom
sandy2008:fix/flaky-compactor-delete-local-sync-files-deterministic

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

Makes TestCompactor_DeleteLocalSyncFiles and TestPartitionCompactor_DeleteLocalSyncFiles deterministic, fixing the second-generation arm64 flake of issue #7608 (32.80s ... compactor_test.go:1855: expected true, got false). Test-only change.

Lineage of this flake:

  1. Cleanup meta sync files in compactor for un-owned users. #3851 (2021) introduced the test with CompactionInterval = 10 * time.Minute // We will only call compaction manually. — compaction cycles were driven synchronously by the test.
  2. Hook up partition compaction end to end implementation #6510 (Jan 2025) removed that override, so the test fell back to prepareConfig()'s 5s timer with pre-run jitter, and assertions started sampling timer-driven cycles.
  3. Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — Should not be zero, but was 0 #7565 — first observed failure (Should not be zero, but was 0).
  4. fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 #7567 diagnosed "transient ring-view skew at startup" and changed the wait to a 30s poll on CompactionRunsCompleted >= 2 && len(listTenantsWithMetaSyncDirectories()) > 0. That fixed a real-but-secondary mid-cycle sampling race while leaving the actual root cause in place.
  5. Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608 — the same failure recurred on master after fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 #7567: total test time 32.80s = ~2.8s full-speed setup + the entire 30s poll budget burned. The runner was not slow; the polled condition was unsatisfiable.

Root cause

The compactor ring assigns each instance 512 random tokens (NumTokens is hardcoded in RingConfig.ToLifecyclerConfig), and the test has only 10 fixed users (fnv32a(userID)ring.Get). With 2 instances × 512 random tokens, there is a measured ≈0.103% (~1-in-960 per twin run) probability that the second compactor owns zero of the 10 users (pooled over ≈11M Monte-Carlo trials across five independent simulations using the real fnv32a user hashes and the ring's strictly-greater token search).

In such a draw, every c2 compaction cycle completes successfully but skips all users, so no meta-sync directory is ever created and #7567's len(dirs) > 0 clause is permanently false — the poll burns its full 30s regardless of timeout. The same draw explains #7565 (c2Users was genuinely 0). No timeout bump can fix a permanently-false condition.

Deterministic reproduction (fail-without-fix evidence): pinning an adversarial token layout into the current test via ShardingRing.TokensFilePath (first compactor gets guard tokens for all 10 users, second only low filler tokens) reproduces the CI failure exactly and on every run: 32.81s locally vs CI's 32.80s, with the identical compactor_test.go:1855: expected true, got false message and c2 logging six clean cycles of "skipping user ... not owned". The adversarial variant is cited here as evidence and intentionally not committed. Under the new test design the same degenerate layout is unconstructible.

How the fix resolves it

Restores #3851's manual-drive structure and pins the ownership split, identically in both twins:

  • CompactionInterval = 10 * time.Minute with the original // We will only call compaction manually. comment — timer-driven cycles are out of the picture.
  • New pinnedTokens helper pins per-instance ring tokens via ShardingRing.TokensFilePath: exactly 512 tokens per instance (shorter files are silently topped up with random tokens by the lifecycler — guarded by require.Len), where a guard token fnv32a(user)+1 for every 2nd user makes compactor-1 own user-1,3,5,7,9 and compactor-2 own user-2,4,6,8,10 deterministically: the integer interval (h, h+1) is empty, so the guard is always the first token strictly greater than the user's hash, regardless of the low filler tokens. The split was validated by simulation against the ring's searchToken strictly-greater semantics (exact 5/5, no token collisions; the minimum pairwise hash gap between the test users is the FNV prime 16777619, so guards cannot shadow each other). The helper's comment warns against evenly-spaced token schemes: fnv32a("user-N") hashes step by the FNV prime and resonate with regular spacings, which can degenerate back to a 0/10 split.
  • The timed CompactionRunsCompleted polls are replaced by direct, synchronous compactUsers() calls on both compactors (the test already did this for c1's final cleanup cycle). This is a mandatory pairing with the 10m interval: c1's old startup poll would otherwise out-wait the now-jittered initial auto-run most of the time.
  • Before driving ownership-dependent cycles, a single poll waits until both compactors' ring views return two healthy ACTIVE instances via GetAllHealthy(RingOp) — the same operation ownUser queries. c2's own view is already barriered by starting() (it waits until c2 is ACTIVE in its own view, and every KV snapshot containing c2 also contains the earlier-registered c1), but c1's ring watcher ingests c2's registration asynchronously, and the final c1.compactUsers() cleanup depends on c1's view.
  • Final assertions are exact counts (numUsers/2 for c2, numUsers - numUsers/2 for c1, plus the historical sum invariant) instead of NotZero/NotEqual. This is self-protecting: if pinning were ever silently bypassed, a random draw lands exactly 5/5 only ~25% of the time, so a regression fails loudly and immediately rather than once in a thousand runs.
  • One consequence of the 10m interval needed handling beyond the original plan: since Hook up partition compaction end to end implementation #6510, running() returns ctx.Err() when the service is stopped during the initial jittered wait (up to 60s at a 10m interval), which StopAndAwaitTerminated reports as a service failure. Other long-interval tests in this suite ignore the stop error entirely (//nolint:errcheck); these tests now use a stricter form that tolerates only context.Canceled, keeping the cleanup's protective value for real failures.

Why not other approaches

  • Bumping the 30s timeout: the zero-ownership state is permanent, not slow — the satisfiable case completed in well under 10s even on loaded runners, while the unsatisfiable case fails at any timeout.
  • Synchronous driving without pinned tokens: re-ships the measured 1-in-960 hard failure (c2 owns zero users → exact-count or NotZero assertions fail) — i.e. generation-3 of the same flake, just with a clearer message.
  • Evenly-spaced / geometric token schemes: validated trap — fnv32a hashes of user-N step by the FNV prime 16777619 and alias regular spacings; a plausible-looking evenly-spaced scheme degenerates to a 0/10 split at 512 tokens. Hence guard tokens, which are provably correct in one sentence.
  • Keeping the 5s interval + counter polls: leaves assertions coupled to timer phase (the original Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) — Should not be zero, but was 0 #7565 sampling race) on top of the ownership draw.

Scope

Which issue(s) this PR fixes:
Fixes #7608

References #7565, #7567 (supersedes #7567's poll-based approach).

Checklist

  • Tests updated
  • Documentation added — N/A (test-only; no flags or config changed, make doc not required)
  • CHANGELOG.md updated - the order of entries should be [CHANGE], [FEATURE], [ENHANCEMENT], [BUGFIX]
  • docs/configuration/v1-guarantees.md updated if this PR introduces experimental flags — N/A

Test plan (all on the branch, Apple Silicon arm64)

  • go test -tags "netgo slicelabels" -count=20 -run 'TestCompactor_DeleteLocalSyncFiles$|TestPartitionCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/ok, 99.1s total for 40 test executions ≈ 2.5s per test (previously ~8.5s healthy, 32.8s when the flake hit)
  • Same with -race -count=5 — ok, 27.1s
  • GOMAXPROCS=2 go test -tags "netgo slicelabels" -count=1 -run '...' ./pkg/compactor/ — ok, 5.5s
  • Full package go test -tags "netgo slicelabels" ./pkg/compactor/ — ok, 92.5s
  • go vet -tags "netgo slicelabels" ./pkg/compactor/ — clean; gofmt / goimports -local github.com/cortexproject/cortex — clean
  • Pinned ownership split verified by standalone simulation against searchToken semantics: c1 = user-1,3,5,7,9, c2 = user-2,4,6,8,10 (exact 5/5); all guards far below MaxUint32; max filler token 1014 vs minimum user hash 1986520588.

Per the project's GenAI policy: this contribution was developed with substantial AI assistance (Claude Code) — root-cause analysis, Monte-Carlo simulations, code, and verification — and was reviewed and submitted under the DCO by the contributor.

🤖 Generated with Claude Code

…ng tokens, drive compaction manually)

TestCompactor_DeleteLocalSyncFiles and its partition twin relied on
timer-driven compaction cycles and on the random per-instance ring tokens
drawn at startup. With 2 compactors x 512 random tokens and only 10 fixed
users there is a measured ~0.103% (~1-in-960 per run) chance that the
second compactor owns zero of the users. In such runs every c2 cycle
completes without ever creating a meta-sync directory, so the condition
polled since cortexproject#7567 (CompactionRunsCompleted >= 2 && len(dirs) > 0) is
permanently false and the test burns the entire 30s budget: the exact
32.8s arm64 CI failure of cortexproject#7608, and the true root cause of cortexproject#7565
(cortexproject#7567's "transient ring-view skew" was a misdiagnosis - no timeout can
fix a permanently false condition).

Restore the manual-drive structure the test had when introduced in cortexproject#3851
(removed by cortexproject#6510) and pin the ownership split, identically in both twins:

- CompactionInterval = 10m: timers are out of the picture; the test
  drives compaction cycles itself.
- Pin per-instance ring tokens via ShardingRing.TokensFilePath (new
  pinnedTokens helper): a guard token fnv32a(user)+1 for alternating
  users makes compactor-1 own user-1,3,5,7,9 and compactor-2 own
  user-2,4,6,8,10 deterministically, since the integer interval (h, h+1)
  is empty and ring lookup picks the first token strictly greater than
  the user hash.
- Replace the timed CompactionRunsCompleted polls with direct
  compactUsers() calls on both compactors, gated by one poll waiting
  until BOTH compactors' ring views see two healthy ACTIVE instances.
- Assert exact ownership counts (numUsers/2 each) instead of
  NotZero/NotEqual, so any silent un-pinning fails loudly.
- Tolerate (only) context.Canceled from StopAndAwaitTerminated in the
  test cleanup: with a 10m interval the compactor is usually still in its
  initial jittered wait when stopped, and running() returns ctx.Err()
  there since cortexproject#6510.

Supersedes cortexproject#7567's poll-based approach.

Fixes cortexproject#7608

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the fix/flaky-compactor-delete-local-sync-files-deterministic branch from a0d54c6 to 661d418 Compare June 11, 2026 07:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out

1 participant