fix(sei-tendermint): prevent readRoutine goroutine leak on /websocket when writeChan is full (PLT-707)#3664
Conversation
PR SummaryMedium Risk Overview
Adds two Reviewed by Cursor Bugbot for commit 1d2ba26. Bugbot is set up for automated code reviews on this repo. Configure here. |
|
The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #3664 +/- ##
==========================================
- Coverage 59.17% 58.20% -0.97%
==========================================
Files 2262 2178 -84
Lines 187053 177431 -9622
==========================================
- Hits 110682 103280 -7402
+ Misses 66433 65001 -1432
+ Partials 9938 9150 -788
Flags with carried forward coverage won't be shown. Click here to find out more.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
A focused, correct fix for a real goroutine leak in the JSON-RPC /websocket readRoutine: replacing the hardcoded context.Background() used for writeChan sends with a connection-scoped context that cancels when writeRoutine exits, so a send blocked on a full writeChan unwinds instead of leaking. The change is well-targeted, the panic-recovery path is handled, and two deterministic leaktest regression tests cover both paths.
Findings: 0 blocking | 3 non-blocking | 0 posted inline
Blockers
- None at the file/PR level.
Non-blocking
- Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output.
- Codex reported no material findings but noted it could not execute the tests because the environment only has Go 1.24.13 and cannot download the required Go 1.25.6 toolchain; the test verification claims in the PR description were therefore not independently confirmed by Codex.
- Nit: newEchoWSServer registers t.Cleanup(srv.Close) and the test bodies also defer s.Close(), so srv.Close() is called twice. httptest.Server.Close is idempotent so this is harmless, but the redundant defer could be dropped for clarity.
Problem
readRoutineon the JSON-RPC/websocketendpoint could leak permanently whenwriteChanis full.readRoutinepushes every response throughWriteRPCResponse, which does a blocking send onwriteChan(capacitydefaultWSWriteChanCapacity = 100), guarded only by a context:The context passed was hardcoded to
context.Background(), so the<-ctx.Done()branch could never fire.writeChanis the only consumer's input, drained solely bywriteRoutine. WhenwriteRoutineexits independently ofreadRoutine— a socket write/ping failure against a slow or dead client, or context cancellation — nothing drainswriteChananymore. Once full,readRoutine's nextWriteRPCResponseblocks forever:Stop()/cancel()only cancelwsc.ctx, which the blocked send doesn't observe.writeChanis intentionally never closed.The same defect existed on the panic-recovery path, which also called
WriteRPCResponse(context.Background(), ...)before relaunchingreadRoutine.Fix
Give
readRoutinea context that is cancelled when the write side goes away:Startderives a connection-scoped, cancellable context withdefer cancel(), so it is cancelled the momentStartreturns (i.e. oncewriteRoutineexits).readRoutineuses that context (instead ofcontext.Background()) for itsWriteRPCResponsecalls, including the panic-recovery path.When
writeRoutineexits for any reason, the context cancels, a blockedWriteRPCResponsereturnsctx.Err(), andreadRoutineunwinds via its<-ctx.Done()case instead of leaking. The relaunchedreadRoutineon the panic path inherits the same cancellable context and terminates on teardown as well.Tests
Two
leaktest-based regression tests inws_handler_test.go, both deterministic (a 32 MB response blockswriteRoutineon the socket write; tiny follow-up requests fillwriteChanto capacity):TestWebsocketReadRoutineNoLeakOnFullWriteChan— readRoutine blocked on a fullwriteChanafterwriteRoutineexits.TestWebsocketReadRoutineNoLeakOnPanicWithFullWriteChan— the panic→recover()path with a fullwriteChan.Both were verified to fail against the pre-fix code (leaked goroutine reported by
leaktest) and pass after the fix. Full package passes with-race;gofmt -sandgo vetare clean.