fix(querier): de-flake TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for ACTIVE store-gateway in querier ring view#7615
Open
sandy2008 wants to merge 1 commit into
Conversation
…estQuerierWithBlocksStorageOnMissingBlocksFromStorage The first happy-path query in this integration test intermittently failed with a 500 on arm64 CI (issue cortexproject#7605). Decoding the gzipped 500 response body logged by the querier in both failing runs shows the failure is querier-local and the query never reached the store-gateway: expanding series: failed to get store-gateway replication set owning the block <ULID>: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095 The store-gateway registers in the ring with all its tokens in JOINING state (pkg/storegateway/gateway.go:461), runs the initial blocks sync (which is what drives cortex_bucket_store_blocks_loaded to 1), and only then switches to ACTIVE (gateway.go:333). The test's existing waits (querier ring tokens 512*2, store-gateway ring tokens 512, blocks_loaded 1) are therefore all satisfiable while the querier's view of the store-gateway ring still holds the instance in JOINING, and the BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55), so the first query fails with the error above (pkg/ring/replication_strategy.go:93, wrapped at pkg/querier/blocks_store_replicated_set.go:127). The window is structural: the querier's consul watch is rate-limited to 1 req/s (pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can reach the querier's ring view over a second after the store-gateway CASed it, while all three metric waits can pass on their first poll. Close the race by additionally waiting until the querier exposes cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1 before the first query - the same inline readiness wait used to fix the identical race in backward_compatibility_test.go (cortexproject#5975, 32bd46f). The waited gauge is computed from the same mutex-guarded ring descriptor that the querier's GetClientsFor consults, so the wait condition is the exact negation of the failure condition. The deliberate post-deletion 500 assertions are unchanged. Fixes cortexproject#7605 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
73bc245 to
9101b6f
Compare
SungJin1212
reviewed
Jun 12, 2026
| * [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555 | ||
| * [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602 | ||
| * [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573 | ||
| * [BUGFIX] Querier: Fix flake in integration test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for the querier to see the store-gateway ACTIVE in the ring before the first query. #7615 |
Member
There was a problem hiding this comment.
Can you delete the changelog since this PR only fixes tests with no user-facing behavior change?
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
De-flakes
TestQuerierWithBlocksStorageOnMissingBlocksFromStorage(integration_querier, observed on arm64): the first happy-path query — before any blocks are deleted — intermittently failed withserver_error: server error: 500atquerier_test.go:367. This is a test-only change: one additional readiness wait plus aCHANGELOG.mdline. No production code is touched, and the test's deliberate post-deletion 500 assertions are unchanged.Root cause (decoded from the CI logs)
The querier logs the 500 response body gzipped inside the
Response:field of its request log line. Unescaping the Go-quoted bytes and gunzipping them yields the same body in both failing runs (modulo block ULID):16:20:11.513, server latency 5.19ms, block01KT4J55J7SX01JD4N5RD8F3WP.01:15:25.864, server latency 4.13ms, block01KSV76AWZPKY9H51G4JCGTRYX.The 4–5ms latency means the error is querier-local: the query never reached the store-gateway. The mechanism:
pkg/storegateway/gateway.go:461,OnRingInstanceRegister), runsInitialSync(gateway.go:325— this is what drivescortex_bucket_store_blocks_loadedto 1), and only then CASes itself to ACTIVE (gateway.go:333).cortex_ring_tokens_total) already at JOINING, and the other two waits are store-gateway-local metrics.BlocksReadring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55); withIgnoreUnhealthyInstancesReplicationStrategy(pkg/querier/blocks_store_queryable.go:236) and the single store-gateway this test runs, a JOINING instance yields exactlyat least 1 healthy replica required, could only find 0 - unhealthy instances: ...(pkg/ring/replication_strategy.go:93), wrapped atpkg/querier/blocks_store_replicated_set.go:127into the decoded message above → HTTP 500.WatchKeypoll is rate-limited to 1 req/s by default (pkg/ring/kv/consul/client.go:79), so after consuming the JOINING registration update the querier learns about the ACTIVE flip ≥ ~1s later (longer under arm64 runner contention), while the test's three metric waits can all pass on their first poll (300–600ms backoff).This is not a production bug: JOINING exists precisely to keep
BlocksReadaway from store-gateways that have not completed their initial blocks sync. The gap is purely in the test's readiness conditions — they assert producer-side (store-gateway) progress but never the consumer-side state (the querier's ring view) that the query path actually consults.Correcting the original triage in #7605
The
ResourceExhausted ... exceeded series limit ... limit 1 violated (got 2)store-gateway log lines quoted in the issue were misattributed: they appear at16:20:23— after this test had already failed (16:20:11) — and belong to the next test,TestQuerierWithBlocksStorageLimits, which deliberately sets-querier.max-fetched-series-per-query=1(and passed). This test sets no series limit. Two more negatives ruled out while decoding: the "reuse grpc buffer in querier's store-gateway stream" change is unrelated (the failing query never reaches the store-gateway stream), andcortex_bucket_store_blocks_loaded == 1is the correct expected count (cortex_ingester_shipper_uploads_total == 1is asserted earlier in the test).The fix
One wait inserted after the existing ring/blocks waits and before the first query, so the test waits until the querier's own view of the store-gateway ring shows 1 ACTIVE member:
Why this deterministically closes the window rather than just shrinking it:
cortex_ring_members{name="store-gateway-client",state="ACTIVE"}gauge is computed byupdateRingMetricsfrom the same mutex-guardedringDescthatGetClientsFor → ring.Get → Filterconsults (pkg/ring/ring.go), so the wait condition is the exact negation of the failure condition.cortex_ring_tokens_total == 512*2wait, which guarantees the querier has already processed a non-empty store-gateway ring snapshot;updateRingMetricszero-fills the per-state series (includingACTIVE) on that first snapshot (pkg/ring/ring.go:670-681), so the bare matcher form (noe2e.WaitMissingMetrics) cannot hit "metric not found" here — and if the waits are ever reordered, it fails loudly on the first poll instead of reintroducing a flake (hence the ordering note in the comment).Precedent: commit 32bd46f (#5975) fixed this same store-gateway-not-yet-ACTIVE race with the identical inline wait in
integration/backward_compatibility_test.go.Why not other approaches
integration/util.go: this race is being fixed in two independent PRs (see Scope below); a helper introduced by either would couple their merge order. Three one-line call sites don't yet earn the abstraction, and the in-tree precedent (Fix flakyTestNewDistributorsCanPushToOldIngestersWithReplicationtest due SG not ready #5975) is inline. A helper can be promoted in a later sweep.e2e.WaitMissingMetrics: provably unnecessary after the tokens wait (the ACTIVE series exists, zero-filled, from the first snapshot onward), and it would trade an immediate loud failure on misplacement for silent busy-polling.BlocksRead, querier retry on 0 healthy replicas, register tokens only at ACTIVE): the observed behavior is the designed cold-start guarantee — changing it for a test race would weaken the consistency model during rollouts.Scope
integration/querier_test.go(this test only) + oneCHANGELOG.mdline.16:20:34.929, block01KT4J5W1MDTKP13EGBC8GEH3B, inTestQuerierWithStoreGatewayDataBytesLimits). It is being fixed in a separate PR that adds the same wait toTestQuerierWithStoreGatewayDataBytesLimitsandTestQuerierWithBlocksStorageLimits. The two PRs have zero overlapping hunks and are merge-order independent.integration/query_frontend_test.go:533-536(TestQueryFrontendNoRetryChunkPool) andintegration/native_histogram_test.go:184-186. They are candidates for a follow-up readiness sweep (where a shared helper becomes justified by call-site count), coordinated with the flaky-test umbrellas Flaky TestQuerierWithBlocksStorageRunningInSingleBinaryMode #2856 / TestQuerierWithBlocksStorageRunningInMicroservicesMode is flaky #5306 — without claiming those issues share this root cause.Which issue(s) this PR fixes
Fixes #7605
Checklist
make docnot required)CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags — N/ATest plan
gofmt -l/goimports -local github.com/cortexproject/cortex -l integration/querier_test.go— clean.go vet -tags "slicelabels,integration,integration_querier" ./integration/— clean.go test -c -tags "slicelabels,integration,integration_querier" -o /dev/null ./integration/— compiles.integration_querierjob). Since the flake is a race, a flake-count run can't prove absence; the determinism argument above (the waited gauge and the query path read the same lock-guarded ring state) is the stability proof. Reviewer repro recipe (not run here; requires Docker): temporarily add-consul.watch-rate-limit=0.2to the querier's flags to widen the JOINING-view window to ~5s — the unpatched test should then fail near-deterministically while the patched test passes. The knob is intentionally not part of this PR.AI assistance disclosure (per GENAI_POLICY.md): the bulk of this contribution — CI log decoding, root-cause analysis, the test change, and this description — was produced with AI assistance and reviewed by the contributor.
🤖 Generated with Claude Code