From 9101b6f7fbcaa888b9e92b57d5f0ce31572de10a Mon Sep 17 00:00:00 2001 From: Sandy Chen Date: Thu, 11 Jun 2026 09:01:49 +0900 Subject: [PATCH] fix(querier): wait for store-gateway ACTIVE in querier ring view in TestQuerierWithBlocksStorageOnMissingBlocksFromStorage The first happy-path query in this integration test intermittently failed with a 500 on arm64 CI (issue #7605). Decoding the gzipped 500 response body logged by the querier in both failing runs shows the failure is querier-local and the query never reached the store-gateway: expanding series: failed to get store-gateway replication set owning the block : at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095 The store-gateway registers in the ring with all its tokens in JOINING state (pkg/storegateway/gateway.go:461), runs the initial blocks sync (which is what drives cortex_bucket_store_blocks_loaded to 1), and only then switches to ACTIVE (gateway.go:333). The test's existing waits (querier ring tokens 512*2, store-gateway ring tokens 512, blocks_loaded 1) are therefore all satisfiable while the querier's view of the store-gateway ring still holds the instance in JOINING, and the BlocksRead ring operation admits ACTIVE instances only (pkg/storegateway/gateway_ring.go:49-55), so the first query fails with the error above (pkg/ring/replication_strategy.go:93, wrapped at pkg/querier/blocks_store_replicated_set.go:127). The window is structural: the querier's consul watch is rate-limited to 1 req/s (pkg/ring/kv/consul/client.go:79), so the JOINING->ACTIVE flip can reach the querier's ring view over a second after the store-gateway CASed it, while all three metric waits can pass on their first poll. Close the race by additionally waiting until the querier exposes cortex_ring_members{name="store-gateway-client",state="ACTIVE"} == 1 before the first query - the same inline readiness wait used to fix the identical race in backward_compatibility_test.go (#5975, 32bd46f27a). The waited gauge is computed from the same mutex-guarded ring descriptor that the querier's GetClientsFor consults, so the wait condition is the exact negation of the failure condition. The deliberate post-deletion 500 assertions are unchanged. Fixes #7605 Co-Authored-By: Claude Fable 5 Signed-off-by: Sandy Chen --- CHANGELOG.md | 1 + integration/querier_test.go | 9 +++++++++ 2 files changed, 10 insertions(+) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0756d48d25..d323d7aad6 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -57,6 +57,7 @@ * [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555 * [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602 * [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573 +* [BUGFIX] Querier: Fix flake in integration test TestQuerierWithBlocksStorageOnMissingBlocksFromStorage by waiting for the querier to see the store-gateway ACTIVE in the ring before the first query. #7615 ## 1.21.0 2026-04-24 diff --git a/integration/querier_test.go b/integration/querier_test.go index 2bba87703f..8415db2996 100644 --- a/integration/querier_test.go +++ b/integration/querier_test.go @@ -376,6 +376,15 @@ func TestQuerierWithBlocksStorageOnMissingBlocksFromStorage(t *testing.T) { require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(512), "cortex_ring_tokens_total")) require.NoError(t, storeGateway.WaitSumMetrics(e2e.Equals(1), "cortex_bucket_store_blocks_loaded")) + // Wait until the querier observes the store-gateway as ACTIVE in its view of the store-gateway + // ring: the store-gateway registers as JOINING and switches to ACTIVE only after the initial + // blocks sync, so the waits above can all pass while queries would still fail with + // "at least 1 healthy replica required, could only find 0" (500). Keep after the tokens wait. + require.NoError(t, querier.WaitSumMetricsWithOptions(e2e.Equals(1), []string{"cortex_ring_members"}, e2e.WithLabelMatchers( + labels.MustNewMatcher(labels.MatchEqual, "name", "store-gateway-client"), + labels.MustNewMatcher(labels.MatchEqual, "state", "ACTIVE"), + ))) + // Query back the series. c, err = e2ecortex.NewClient("", querier.HTTPEndpoint(), "", "", "user-1") require.NoError(t, err)