Skip to content

Flaky test: TestQuerierWithStoreGatewayDataBytesLimits (integration_querier, arm64) — got 500 instead of expected 422 #7606

@sandy2008

Description

@sandy2008

AI Tool Usage Notice
If you used an AI tool to help draft this issue,
please make sure you have reviewed and validated all content before submitting.
You are responsible for the accuracy and quality of everything in this report.
Low-quality or unreviewed AI-generated submissions may be closed without further investigation.
See our Generative AI Contribution Policy for details.

Describe the bug

The integration_querier test TestQuerierWithStoreGatewayDataBytesLimits intermittently fails on arm64. It sets -store-gateway.max-downloaded-bytes-per-request: 1 and expects every query to be rejected with HTTP 422 ("exceeded bytes limit"), but occasionally receives 500:

querier_test.go:564:
    Error: Not equal: expected: 422  actual: 500
    Test:  TestQuerierWithStoreGatewayDataBytesLimits

Root cause (corrected 2026-06-11, from the decoded CI response): this issue originally hypothesized that the bytes-limit error loses its 422 mapping in the store-gateway's "series size exceeded expected size; refetching" path. That hypothesis is falsified: the gzipped 500 body in the CI log decodes to a querier-local ring error — the query never reached the store-gateway at all:

expanding series: failed to get store-gateway replication set owning the block <ULID>:
at least 1 healthy replica required, could only find 0 - unhealthy instances: ...

The "refetching" log lines belonged to a different (passing) test in the same job's container dump. The actual mechanism is the same store-gateway JOINING-vs-ACTIVE ring-readiness race as #7605: the test's waits are satisfiable while the SG is still JOINING in the querier's ring view, and BlocksRead only selects ACTIVE instances. An audit of the bytes-limit error propagation found it sound on every current path (the Cortex limiter's ResourceExhausted survives all 10 vendored consumption sites, including the refetch recursion, and the querier mapping from #5286 handles it).

To Reproduce

Steps to reproduce the behavior:

  1. Start Cortex (recent master)
  2. Run the integration test on arm64 (flaky):
    go test -tags=slicelabels,integration,integration_querier -count=5 -run TestQuerierWithStoreGatewayDataBytesLimits ./integration/...
    

Expected behavior

Queries deterministically reach the store-gateway (querier's ring view ACTIVE) and return 422 with "exceeded bytes limit"; the test should wait on the querier's own ring view before asserting.

Environment:

  • Infrastructure: GitHub Actions CI, ubuntu-24.04-arm (arm64), integration job, tag integration_querier
  • Deployment tool: N/A (Docker-based integration test)

Additional Context

Observed on CI (arm64, on a PR unrelated to the querier), 2026-06-02:
https://github.com/cortexproject/cortex/actions/runs/26832378915/job/79118070108

Fix proposed in #7614 (querier-side ACTIVE-ring wait in this test and its identically-shaped sibling TestQuerierWithBlocksStorageLimits). Same decoded root cause as #7605 (separate PR, non-overlapping changes).

Filed and later corrected from CI failure-log analysis with AI assistance; the run link, decoded response body, and the bytes-limit propagation audit were reviewed and verified against master before submitting.

Metadata

Metadata

Assignees

No one assigned

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions