Skip to content

Flaky test: TestCompactor_DeleteLocalSyncFiles (arm64) recurs after #7567 — 30s CompactionRunsCompleted>=2 poll times out #7608

@sandy2008

Description

@sandy2008

AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.

Describe the bug

TestCompactor_DeleteLocalSyncFiles (pkg/compactor) is flaky again on arm64, despite the fix in #7567 (which closed #7565). It times out on the poll that #7567 added:

--- FAIL: TestCompactor_DeleteLocalSyncFiles (32.80s)
    compactor_test.go:1855: expected true, got false
FAIL	github.com/cortexproject/cortex/pkg/compactor

Root cause (corrected 2026-06-11 — this is NOT a timing problem): with 2 compactors × 512 randomly-generated ring tokens and only 10 fixed test users, the probability that the second compactor's tokens own zero of the 10 users is ≈0.103% — about 1 in 960 per run (pooled ~11M Monte-Carlo trials over the real fnv32a user hashes). In such a run every c2 cycle completes but skips all users, so no meta-sync directory is ever created and #7567's poll condition (CompactionRunsCompleted >= 2 && len(dirs) > 0) is permanently false — the test burns the full 30s and fails (CI fingerprint: fast 2.8s setup + exactly 30s = 32.8s). The failure was reproduced deterministically by adversarially pinning tokens (32.81s locally vs 32.80s in CI, identical message). The same mechanism explains first-generation #7565; #7567's "transient ring-view skew" diagnosis was a misdiagnosis, which is why its poll could not fix the flake. No finite timeout can.

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex (master @ 40a27ad or later — i.e. with fix(compactor): fix flaky TestCompactor_DeleteLocalSyncFiles on arm64 #7567 merged)
  2. Run repeatedly (expected failure rate ~1 in 960 per twin):
    go test -tags "netgo slicelabels" -count=20 -run 'TestCompactor_DeleteLocalSyncFiles$|TestPartitionCompactor_DeleteLocalSyncFiles$' ./pkg/compactor/
    
    (Deterministic repro: pin per-instance ring tokens so c2 owns zero users.)

Expected behavior

The test's user→compactor ownership is deterministic (pinned per-instance tokens), compaction cycles are driven synchronously instead of sampled from a timer loop, and the assertions check the exact expected split — eliminating the probabilistic failure class entirely.

Environment:

  • Infrastructure: GitHub Actions CI, ubuntu-24.04-arm (arm64), test-no-race job
  • Deployment tool: N/A (Go unit test)

Additional Context

Observed on master CI run (2026-06-08), after #7567 merged: https://github.com/cortexproject/cortex/actions/runs/27123579544 (job test-no-race (arm64) 80046345033).

Related: #7565 (original report, closed), #7567 (poll-based fix that could not address the real mechanism).
Fix proposed in #7619: pin per-instance ring tokens (fnv32a(user)+1 guard tokens, exact 5/5 split), restore the test's original manual-drive structure (#3851), and assert exact ownership counts — covering both twins.

Filed and later corrected from CI failure-log analysis with AI assistance; the run link, Monte-Carlo methodology, deterministic reproduction, and cited code paths were reviewed and verified against master before submitting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions