AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.
Describe the bug
TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning (pkg/compactor) intermittently fails when a compactor does not complete a compaction run within the 60s poll window:
--- FAIL: TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning (81.65s)
compactor_paritioning_test.go:1217: expected true, got false
Root cause (corrected 2026-06-11): this issue originally framed the failure as ring-convergence/timeout sizing. Investigation showed the ring is ACTIVE and stable before the poll starts (compactor starting() guarantees it), and — critically — the non-partition sibling with a 120s poll failed in the same CI run (124.41s on attempt 1), so a 60→120s bump alone cannot be the fix. The dominant cost is the tests' shared testify bucket ClientMock: with numUsers = 100, every bucket operation does an O(expectations) reflective scan under a single mutex (~3,700/3,208 accumulated expectations; ~38% of CPU in findExpectedCall/Arguments.Diff), making the first compaction cycle take ~20s idle and >120s under the 6-7× starvation seen on busy CI runners. A sharpener: syncMetasTimeout is coupled to CompactionInterval (5s in these tests), so slow syncs become failed runs that never increment CompactionRunsCompleted — no poll budget can outlast that.
To Reproduce
Steps to reproduce the behavior:
- Start Cortex (recent
master)
- Run repeatedly (flaky on starved runners):
go test -tags "netgo slicelabels" -race -count=3 -run 'TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning|TestCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning' ./pkg/compactor/
Expected behavior
The first compaction cycle completes well inside the poll budget even on starved runners (workload proportional to what the test actually asserts), and the partition/non-partition twins use the same poll budget.
Environment:
- Infrastructure: GitHub Actions CI,
ubuntu-24.04 (amd64), test job
- Deployment tool: N/A (Go unit test)
Additional Context
CI evidence: run 26632776611 — attempt 1: partition twin 81.65s + non-partition sibling 124.41s (both failed; job 78485485760); attempt 2: 77.67s/122.49s. (gh run view --job serves the latest attempt; attempt-1 logs via the jobs API.)
Fix proposed in #7617: numUsers 100→20 in both twins (removes the O(N²) mock cost), 60s→120s alignment as a seatbelt, and WaitActiveInstanceTimeout per the #7503 pattern. Note: the shuffle-sharding twins are intentionally not covered — the same run's arm64 job failed one of them with a different signature (group ... owned by multiple compactors, an ownership-exclusivity race) that deserves its own issue.
Filed and later corrected from CI failure-log analysis with AI assistance; run links, CPU-profile findings, and cited code paths were reviewed and verified against master before submitting.
AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.
Describe the bug
TestPartitionCompactor_ShouldCompactOnlyUsersOwnedByTheInstanceOnShardingEnabledAndMultipleInstancesRunning(pkg/compactor) intermittently fails when a compactor does not complete a compaction run within the 60s poll window:Root cause (corrected 2026-06-11): this issue originally framed the failure as ring-convergence/timeout sizing. Investigation showed the ring is ACTIVE and stable before the poll starts (compactor
starting()guarantees it), and — critically — the non-partition sibling with a 120s poll failed in the same CI run (124.41s on attempt 1), so a 60→120s bump alone cannot be the fix. The dominant cost is the tests' shared testify bucketClientMock: withnumUsers = 100, every bucket operation does an O(expectations) reflective scan under a single mutex (~3,700/3,208 accumulated expectations; ~38% of CPU infindExpectedCall/Arguments.Diff), making the first compaction cycle take ~20s idle and >120s under the 6-7× starvation seen on busy CI runners. A sharpener:syncMetasTimeoutis coupled toCompactionInterval(5s in these tests), so slow syncs become failed runs that never incrementCompactionRunsCompleted— no poll budget can outlast that.To Reproduce
Steps to reproduce the behavior:
master)Expected behavior
The first compaction cycle completes well inside the poll budget even on starved runners (workload proportional to what the test actually asserts), and the partition/non-partition twins use the same poll budget.
Environment:
ubuntu-24.04(amd64),testjobAdditional Context
CI evidence: run 26632776611 — attempt 1: partition twin 81.65s + non-partition sibling 124.41s (both failed; job 78485485760); attempt 2: 77.67s/122.49s. (
gh run view --jobserves the latest attempt; attempt-1 logs via the jobs API.)Fix proposed in #7617:
numUsers100→20 in both twins (removes the O(N²) mock cost), 60s→120s alignment as a seatbelt, andWaitActiveInstanceTimeoutper the #7503 pattern. Note: the shuffle-sharding twins are intentionally not covered — the same run's arm64 job failed one of them with a different signature (group ... owned by multiple compactors, an ownership-exclusivity race) that deserves its own issue.Filed and later corrected from CI failure-log analysis with AI assistance; run links, CPU-profile findings, and cited code paths were reviewed and verified against
masterbefore submitting.