You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.
Describe the bug
TestCompactor_DeleteLocalSyncFiles (pkg/compactor) is flaky again on arm64, despite the fix in #7567 (which closed #7565). It times out on the poll that #7567 added:
Root cause (corrected 2026-06-11 — this is NOT a timing problem): with 2 compactors × 512 randomly-generated ring tokens and only 10 fixed test users, the probability that the second compactor's tokens own zero of the 10 users is ≈0.103% — about 1 in 960 per run (pooled ~11M Monte-Carlo trials over the real fnv32a user hashes). In such a run every c2 cycle completes but skips all users, so no meta-sync directory is ever created and #7567's poll condition (CompactionRunsCompleted >= 2 && len(dirs) > 0) is permanently false — the test burns the full 30s and fails (CI fingerprint: fast 2.8s setup + exactly 30s = 32.8s). The failure was reproduced deterministically by adversarially pinning tokens (32.81s locally vs 32.80s in CI, identical message). The same mechanism explains first-generation #7565; #7567's "transient ring-view skew" diagnosis was a misdiagnosis, which is why its poll could not fix the flake. No finite timeout can.
Run repeatedly (expected failure rate ~1 in 960 per twin):
go test -tags "netgo slicelabels" -count=20 -run 'TestCompactor_DeleteLocalSyncFiles$|TestPartitionCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/
(Deterministic repro: pin per-instance ring tokens so c2 owns zero users.)
Expected behavior
The test's user→compactor ownership is deterministic (pinned per-instance tokens), compaction cycles are driven synchronously instead of sampled from a timer loop, and the assertions check the exact expected split — eliminating the probabilistic failure class entirely.
Related: #7565 (original report, closed), #7567 (poll-based fix that could not address the real mechanism).
Fix proposed in #7619: pin per-instance ring tokens (fnv32a(user)+1 guard tokens, exact 5/5 split), restore the test's original manual-drive structure (#3851), and assert exact ownership counts — covering both twins.
Filed and later corrected from CI failure-log analysis with AI assistance; the run link, Monte-Carlo methodology, deterministic reproduction, and cited code paths were reviewed and verified against master before submitting.
AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.
Describe the bug
TestCompactor_DeleteLocalSyncFiles(pkg/compactor) is flaky again on arm64, despite the fix in #7567 (which closed #7565). It times out on the poll that #7567 added:Root cause (corrected 2026-06-11 — this is NOT a timing problem): with 2 compactors × 512 randomly-generated ring tokens and only 10 fixed test users, the probability that the second compactor's tokens own zero of the 10 users is ≈0.103% — about 1 in 960 per run (pooled ~11M Monte-Carlo trials over the real fnv32a user hashes). In such a run every c2 cycle completes but skips all users, so no meta-sync directory is ever created and #7567's poll condition (
CompactionRunsCompleted >= 2 && len(dirs) > 0) is permanently false — the test burns the full 30s and fails (CI fingerprint: fast 2.8s setup + exactly 30s = 32.8s). The failure was reproduced deterministically by adversarially pinning tokens (32.81s locally vs 32.80s in CI, identical message). The same mechanism explains first-generation #7565; #7567's "transient ring-view skew" diagnosis was a misdiagnosis, which is why its poll could not fix the flake. No finite timeout can.To Reproduce
Steps to reproduce the behavior:
40a27ador later — i.e. with fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 #7567 merged)Expected behavior
The test's user→compactor ownership is deterministic (pinned per-instance tokens), compaction cycles are driven synchronously instead of sampled from a timer loop, and the assertions check the exact expected split — eliminating the probabilistic failure class entirely.
Environment:
ubuntu-24.04-arm(arm64),test-no-racejobAdditional Context
Observed on
masterCI run (2026-06-08), after #7567 merged: https://github.com/cortexproject/cortex/actions/runs/27123579544 (jobtest-no-race (arm64)80046345033).Related: #7565 (original report, closed), #7567 (poll-based fix that could not address the real mechanism).
Fix proposed in #7619: pin per-instance ring tokens (fnv32a(user)+1 guard tokens, exact 5/5 split), restore the test's original manual-drive structure (#3851), and assert exact ownership counts — covering both twins.
Filed and later corrected from CI failure-log analysis with AI assistance; the run link, Monte-Carlo methodology, deterministic reproduction, and cited code paths were reviewed and verified against
masterbefore submitting.