Skip to content

test: bump brpc_channel_unittest to size=large to fix CI timeout flakiness#3339

Open
rajvarun77 wants to merge 1 commit into
apache:masterfrom
rajvarun77:deflake-backup-request-virtual-time
Open

test: bump brpc_channel_unittest to size=large to fix CI timeout flakiness#3339
rajvarun77 wants to merge 1 commit into
apache:masterfrom
rajvarun77:deflake-backup-request-virtual-time

Conversation

@rajvarun77

Copy link
Copy Markdown
Contributor

Problem

brpc_channel_unittest bundles dozens of timing-sensitive TEST_F (backup
request, retry/backoff, timeouts, connection-failure) into a single test
binary. Each test does real-time waits (server-side sleep_us, backup-request
timers, connection retries). gtest runs them serially in one process, so the
binary's wall time is the sum of all those waits.

On contended CI runners (GitHub-hosted ubuntu-22.04, ~4 shared vCPU with
hypervisor steal) that cumulative time exceeds Bazel's default per-test 300s
limit
(size = "medium"), so the binary intermittently fails with TIMEOUT
even though every assertion would pass given enough time.

Evidence (reproduced on GitHub Actions)

Measured on ubuntu-22.04, --nocache_test_results:

Configuration Result
current (size=medium, 300s), 5 runs under load TIMEOUT in 4/5 @ 300.0s
size=large (900s), single run PASSED in 91.7s
size=large (900s), 20 serialized no-cache runs 20/20 PASSED, slowest 114.0s

The nominal run is ~92–114s, but under parallel-job contention the same binary
balloons past 300s — a ~3× slowdown that crosses the medium ceiling. Raising the
limit to large (900s) gives ~8× nominal headroom and absorbs the spike.

Bench runs (throwaway branch, not part of this PR):

Fix

Add an optional per_test_size override to the generate_unittests macro and
set brpc_channel_unittest to size = "large". No test source changes.

Why not shard it?

Sharding (shard_count) was tried first and rejected: it fails
deterministically (20/20). brpc_channel_unittest's TEST_F share fixed
loopback endpoints and global state, so running shards as parallel processes
makes a "connection should be refused" test
(ChannelTest.connection_failed_selective) observe another shard's live
server
on the same port and see a successful connection instead of
ECONNREFUSED. The tests are not shard-safe; raising the size limit is the only
safe lever without rewriting the suite for isolation.


cc @chenBright for review.

brpc_channel_unittest packs dozens of timing-sensitive TEST_F (backup
request, retry/backoff, timeouts) into a single binary. On contended CI
runners its cumulative real-time waits exceed Bazel's default per-test
300s (size=medium) limit, producing flaky TIMEOUT failures (observed
TIMEOUT in 4/5 no-cache runs on a GitHub ubuntu-22.04 runner).

Add an optional per_test_size override to the generate_unittests macro
and set brpc_channel_unittest to size=large (900s). No test source changes.

Sharding was rejected: the binary's TEST_F share fixed loopback endpoints
and global state, so parallel shards make a 'connection refused' test
observe another shard's live server and fail deterministically.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant