fix(querier): de-flake store-gateway limits integration tests by waiting for ACTIVE store-gateway in querier ring view#7614
Open
sandy2008 wants to merge 1 commit into
Conversation
7d732db to
9fcac35
Compare
…tore-gateway limits integration tests TestQuerierWithStoreGatewayDataBytesLimits intermittently fails with HTTP 500 instead of the expected 422 (cortexproject#7606, arm64 CI). The decoded (gzipped) 500 response body from the failing run is the querier-local ring error: expanding series: failed to get store-gateway replication set owning the block <ULID>: at least 1 healthy replica required, could only find 0 - unhealthy instances: 172.18.0.8:9095 i.e. the ring lookup failed before any store-gateway RPC was made. The store-gateway registers in the ring as JOINING (already owning tokens) and switches to ACTIVE only after its initial blocks sync, while the querier's BlocksRead ring operation only selects ACTIVE instances and its consul watch is rate-limited (1 rps by default). So the existing waits (ring tokens registered, blocks loaded on the store-gateway) can all pass while the querier's view of the store-gateway ring still says JOINING, and the first query 500s. The hypothesis originally filed on the issue - that the bytes-limit error loses its 422/ResourceExhausted coding in the vendored Thanos refetch ("series size exceeded expected size; refetching") path - was falsified during investigation: those log lines belong to an earlier, passing test in the same CI job; the failing query never reached store-gateway limiter code at all; and all 10 vendored limiter consumption sites (including the refetch recursion) re-code the error as ResourceExhausted, which the querier maps to a 422 LimitError (cortexproject#5286). Fix the race in the tests by waiting until the querier sees the store-gateway ACTIVE in its store-gateway ring view before querying (same idiom as backward_compatibility_test.go, cortexproject#5975). Apply the same wait to the sibling TestQuerierWithBlocksStorageLimits, which has the identical vulnerable shape (every query expected to hit a 422 limit against a freshly started store-gateway). Same root cause as cortexproject#7605, which is fixed separately for TestQuerierWithBlocksStorageOnMissingBlocksFromStorage in a non-overlapping PR. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
9fcac35 to
366ffc3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
De-flakes two
integration_queriertests that expect every query to fail with422from a store-gateway limit but intermittently get a500instead (issue #7606, observed on arm64 CI, run 26832378915):The fix is test-only: after the existing "ring tokens registered + blocks loaded" waits, also wait until the querier's view of the store-gateway ring reports the store-gateway
ACTIVE, in:TestQuerierWithStoreGatewayDataBytesLimits— the test reported in Flaky test: TestQuerierWithStoreGatewayDataBytesLimits (integration_querier, arm64) — got 500 instead of expected 422 #7606, andTestQuerierWithBlocksStorageLimits— the other all-queries-expect-422 limits test in this file, with the identical vulnerable shape (queries a freshly started store-gateway); a flake there would present verbatim as this issue's symptom and be dup-filed against Flaky test: TestQuerierWithStoreGatewayDataBytesLimits (integration_querier, arm64) — got 500 instead of expected 422 #7606.The wait is the same idiom already used in
integration/backward_compatibility_test.go(#5975):Root cause (decoded from the failing CI runs)
We decoded the gzipped 500 response bodies from the failing CI runs. All occurrences (#7606 run 26832378915; #7605 runs 26832378915 and 26632776611) are the byte-identical querier-local error:
produced by the ring lookup (
pkg/querier/blocks_store_replicated_set.go:127←pkg/ring/replication_strategy.go:93) before any store-gateway RPC. The failing query took ~4ms and has zero store-gateway log lines in its window — it never reached limiter code on the store-gateway.Why the existing waits don't close the window:
JOININGalready owning tokens, and flips toACTIVEonly after its initial blocks sync.BlocksReadring operation only selectsACTIVEinstances (pkg/storegateway/gateway_ring.go:49).pkg/ring/kv/consul/client.go:79), so its ring view lags the store-gateway's state flip.So
cortex_ring_tokens_total == 512*2on the querier andcortex_bucket_store_blocks_loaded == 1on the store-gateway can all pass while the querier's ring view still saysJOINING→ the replication-set lookup fails → 500 instead of 422.Why this wait is deterministic (and not just a bigger timeout): the
cortex_ring_membersgauge is computed under the ring's write lock from the sameringDescthatRing.Getconsults (pkg/ring/ring.go:380,updateRingMetricsat:662); once a scrape observesACTIVE == 1, the failing predicate is permanently false (the state cannot regress absent a store-gateway crash). Thestate="ACTIVE"series is zero-filled in the same locked ring update, and the preceding 1024-token wait proves the querier has already processed a ring update containing the store-gateway — so the bare matcher form (noe2e.WaitMissingMetrics) is safe and matches the #5975 precedent.Correcting the originally filed hypothesis (falsified)
Issue #7606 was filed with the hypothesis that the store-gateway log line
series size exceeded expected size; refetching ... maxSeriesSize=1near the failure meant the bytes-limit error was re-wrapped on the vendored Thanos refetch path and lost its 422/ResourceExhaustedcoding, so the querier mapped it to 500. Investigation falsified that hypothesis; this PR corrects the record:pkg/storegateway/limiter.go:31); (2) all 10 Thanos consumption sites immediately re-code itResourceExhausted(vendor/github.com/thanos-io/thanos/pkg/store/bucket.go1256/1408/1417/3091/3168/3226/3403/3430/3743/3848), including the refetch recursion (3459 → 3430) and the lazy-postings limits; (3) Thanos' final conversions preserve the code —status.FromError(errors.Cause(err))atbucket.go:1795-1798, and the streaming warning path (proxy_merge.go→NewWarnSeriesResponse→GRPCCodeFromWarn,storepb/custom.go:66-74) always sees the substringrpc error: code = ResourceExhaustedembedded by the re-wrap, so itsUnknownfallback cannot fire for limiter errors; (4) the querier mapsResourceExhausted→validation.LimitError→ 422 (pkg/querier/blocks_store_queryable.go:733-748, using grpc-gostatus.FromError, which unwraps wrapped chains), and even a hypothetical wireCode(422)is preserved by the translator's 4xx branch (pkg/querier/error_translate_queryable.go:57-69); (5) the same CI job shows the 422 path working end-to-end (TestQuerierWithBlocksStorageLimitspassed with nestedResourceExhaustedstatuses moments before the failure).Why not other approaches
ResourceExhaustedfor "message larger than max", which must stay 5xx, so remappingResourceExhaustedmore aggressively in production risks mislabeling genuine server errors as 422.JOININGexists precisely to keepBlocksReadaway until the initial blocks sync (pkg/storegateway/gateway_ring.go); RF>1 and query-frontend retries cover the window in production. Retrying/blockingGetClientsFor, admittingJOININGintoBlocksRead, or gating querier readiness on ACTIVE store-gateways would mask outages or break bootstrap orderings for what is a test-synchronization gap.Ring.Getevaluates, observed from the same locked snapshot, so it closes the race rather than racing it more slowly.Scope
integration/querier_test.go+ oneCHANGELOG.mdline. No production code, no vendored code, no e2e-framework changes, no shared helper.ResourceExhausted→validation.LimitError→ 422 path stays covered end-to-end.TestQuerierWithBlocksStorageOnMissingBlocksFromStorage. The two PRs touch disjoint hunks in this file (no shared diff context) and are merge-order independent; this PR deliberately does not touch that test.GRPCCodeFromWarnstring contract. It passes today (there is no current bug), so if maintainers want it, it belongs in a separate tracking issue explicitly labeled not-a-current-bug.Which issue(s) this PR fixes
Fixes #7606
Checklist
make docnot required)CHANGELOG.mdupdated - the order of entries should be[CHANGE],[FEATURE],[ENHANCEMENT],[BUGFIX]docs/configuration/v1-guarantees.mdupdated if this PR introduces experimental flags — N/ATest plan
Run locally on this branch:
gofmt -l integration/querier_test.go— cleangoimports -local github.com/cortexproject/cortex -l integration/querier_test.go— clean (no new imports;labelswas already imported)go vet -tags "slicelabels,integration,integration_querier" ./integration/— cleango test -c -tags "slicelabels,integration,integration_querier" -o /dev/null ./integration/— compilesDocker-based integration runs were not executed in this environment; for reviewers who want to reproduce:
go test -v -tags=integration,requires_docker,integration_querier -timeout 2400s -count=20 ./integration/ -run '^(TestQuerierWithStoreGatewayDataBytesLimits|TestQuerierWithBlocksStorageLimits)$'-consul.watch-rate-limit=0.2to these tests — unpatched they hit the 500 near-deterministically; with this PR's wait they pass.AI assistance disclosure
Per GENAI_POLICY.md: the bulk of this contribution (root-cause investigation, code change, and PR description) was produced with AI assistance (Claude Code). It has been reviewed and validated by the submitter, who takes full responsibility for it.
🤖 Generated with Claude Code