Skip to content

Flaky test: TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning — 60s CompactionRunsCompleted poll times out #7607

@sandy2008

Description

@sandy2008

AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.

Describe the bug

TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning (pkg/compactor) intermittently fails when a compactor does not complete a compaction run within the 60s poll window:

--- FAIL: TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning (81.65s)
    compactor_paritioning_test.go:1217: expected true, got false

Root cause (corrected 2026-06-11): this issue originally framed the failure as ring-convergence/timeout sizing. Investigation showed the ring is ACTIVE and stable before the poll starts (compactor starting() guarantees it), and — critically — the non-partition sibling with a 120s poll failed in the same CI run (124.41s on attempt 1), so a 60→120s bump alone cannot be the fix. The dominant cost is the tests' shared testify bucket ClientMock: with numUsers = 100, every bucket operation does an O(expectations) reflective scan under a single mutex (~3,700/3,208 accumulated expectations; ~38% of CPU in findExpectedCall/Arguments.Diff), making the first compaction cycle take ~20s idle and >120s under the 6-7× starvation seen on busy CI runners. A sharpener: syncMetasTimeout is coupled to CompactionInterval (5s in these tests), so slow syncs become failed runs that never increment CompactionRunsCompleted — no poll budget can outlast that.

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex (recent master)
  2. Run repeatedly (flaky on starved runners):
    go test -tags "netgo slicelabels" -race -count=3 -run 'TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning|TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning' ./pkg/compactor/
    

Expected behavior

The first compaction cycle completes well inside the poll budget even on starved runners (workload proportional to what the test actually asserts), and the partition/non-partition twins use the same poll budget.

Environment:

  • Infrastructure: GitHub Actions CI, ubuntu-24.04 (amd64), test job
  • Deployment tool: N/A (Go unit test)

Additional Context

CI evidence: run 26632776611 — attempt 1: partition twin 81.65s + non-partition sibling 124.41s (both failed; job 78485485760); attempt 2: 77.67s/122.49s. (gh run view --job serves the latest attempt; attempt-1 logs via the jobs API.)

Fix proposed in #7617: numUsers 100→20 in both twins (removes the O(N²) mock cost), 60s→120s alignment as a seatbelt, and WaitActiveInstanceTimeout per the #7503 pattern. Note: the shuffle-sharding twins are intentionally not covered — the same run's arm64 job failed one of them with a different signature (group ... owned by multiple compactors, an ownership-exclusivity race) that deserves its own issue.

Filed and later corrected from CI failure-log analysis with AI assistance; run links, CPU-profile findings, and cited code paths were reviewed and verified against master before submitting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions