Skip to content

fix(sei-tendermint): prevent readRoutine goroutine leak on /websocket when writeChan is full (PLT-707)#3664

Merged
amir-deris merged 3 commits into
mainfrom
amir/plt-707-go-routine-leak-websocket
Jun 30, 2026
Merged

fix(sei-tendermint): prevent readRoutine goroutine leak on /websocket when writeChan is full (PLT-707)#3664
amir-deris merged 3 commits into
mainfrom
amir/plt-707-go-routine-leak-websocket

Conversation

@amir-deris

@amir-deris amir-deris commented Jun 29, 2026

Copy link
Copy Markdown
Contributor

Problem

readRoutine on the JSON-RPC /websocket endpoint could leak permanently when writeChan is full.

readRoutine pushes every response through WriteRPCResponse, which does a blocking send on writeChan (capacity defaultWSWriteChanCapacity = 100), guarded only by a context:

select {
case <-ctx.Done():
    return ctx.Err()
case wsc.writeChan <- resp:
    return nil
}

The context passed was hardcoded to context.Background(), so the <-ctx.Done() branch could never fire. writeChan is the only consumer's input, drained solely by writeRoutine. When writeRoutine exits independently of readRoutine — a socket write/ping failure against a slow or dead client, or context cancellation — nothing drains writeChan anymore. Once full, readRoutine's next WriteRPCResponse blocks forever:

  • Stop()/cancel() only cancel wsc.ctx, which the blocked send doesn't observe.
  • Closing the underlying connection only affects socket reads, not a channel send.
  • writeChan is intentionally never closed.

The same defect existed on the panic-recovery path, which also called WriteRPCResponse(context.Background(), ...) before relaunching readRoutine.

Fix

Give readRoutine a context that is cancelled when the write side goes away:

  • Start derives a connection-scoped, cancellable context with defer cancel(), so it is cancelled the moment Start returns (i.e. once writeRoutine exits).
  • readRoutine uses that context (instead of context.Background()) for its WriteRPCResponse calls, including the panic-recovery path.

When writeRoutine exits for any reason, the context cancels, a blocked WriteRPCResponse returns ctx.Err(), and readRoutine unwinds via its <-ctx.Done() case instead of leaking. The relaunched readRoutine on the panic path inherits the same cancellable context and terminates on teardown as well.

Tests

Two leaktest-based regression tests in ws_handler_test.go, both deterministic (a 32 MB response blocks writeRoutine on the socket write; tiny follow-up requests fill writeChan to capacity):

  • TestWebsocketReadRoutineNoLeakOnFullWriteChan — readRoutine blocked on a full writeChan after writeRoutine exits.
  • TestWebsocketReadRoutineNoLeakOnPanicWithFullWriteChan — the panic→recover() path with a full writeChan.

Both were verified to fail against the pre-fix code (leaked goroutine reported by leaktest) and pass after the fix. Full package passes with -race; gofmt -s and go vet are clean.

@cursor

cursor Bot commented Jun 29, 2026

Copy link
Copy Markdown

PR Summary

Medium Risk
Touches live WebSocket RPC connection teardown under backpressure; behavior change is narrow and covered by leak regression tests, but affects every WS client when the write side fails.

Overview
Fixes a permanent goroutine leak on the JSON-RPC /websocket path when writeRoutine stops draining writeChan (slow/dead client, failed socket write) while readRoutine still tries to enqueue responses.

Start now derives a connection-scoped context and defer cancel() so cancellation runs as soon as writeRoutine returns. readRoutine uses that context for all WriteRPCResponse calls (including panic recovery) instead of context.Background(), so blocked sends on a full channel can exit via <-ctx.Done() and unwind cleanly.

Adds two leaktest regression tests that fill writeChan and force writeRoutine to exit (normal path and panic-recovery path).

Reviewed by Cursor Bugbot for commit 1d2ba26. Bugbot is set up for automated code reviews on this repo. Configure here.

@amir-deris amir-deris changed the title Amir/plt 707 go routine leak websocket fix(sei-tendermint): prevent readRoutine goroutine leak on /websocket when writeChan is full (PLT-707) Jun 29, 2026
@github-actions

github-actions Bot commented Jun 29, 2026

Copy link
Copy Markdown

The latest Buf updates on your PR. Results from workflow Buf / buf (pull_request).

BuildFormatLintBreakingUpdated (UTC)
✅ passed✅ passed✅ passed✅ passedJun 29, 2026, 9:54 PM

@codecov

codecov Bot commented Jun 29, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 58.20%. Comparing base (bb69d38) to head (1d2ba26).

Additional details and impacted files

Impacted file tree graph

@@            Coverage Diff             @@
##             main    #3664      +/-   ##
==========================================
- Coverage   59.17%   58.20%   -0.97%     
==========================================
  Files        2262     2178      -84     
  Lines      187053   177431    -9622     
==========================================
- Hits       110682   103280    -7402     
+ Misses      66433    65001    -1432     
+ Partials     9938     9150     -788     
Flag Coverage Δ
sei-chain-pr 64.73% <100.00%> (?)
sei-db 70.41% <ø> (ø)
sei-db-state-db ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
sei-tendermint/rpc/jsonrpc/server/ws_handler.go 67.21% <100.00%> (+10.85%) ⬆️

... and 111 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@seidroid seidroid Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A focused, correct fix for a real goroutine leak in the JSON-RPC /websocket readRoutine: replacing the hardcoded context.Background() used for writeChan sends with a connection-scoped context that cancels when writeRoutine exits, so a send blocked on a full writeChan unwinds instead of leaking. The change is well-targeted, the panic-recovery path is handled, and two deterministic leaktest regression tests cover both paths.

Findings: 0 blocking | 3 non-blocking | 0 posted inline

Blockers

  • None at the file/PR level.

Non-blocking

  • Cursor's second-opinion review file (cursor-review.md) was empty — that pass produced no output.
  • Codex reported no material findings but noted it could not execute the tests because the environment only has Go 1.24.13 and cannot download the required Go 1.25.6 toolchain; the test verification claims in the PR description were therefore not independently confirmed by Codex.
  • Nit: newEchoWSServer registers t.Cleanup(srv.Close) and the test bodies also defer s.Close(), so srv.Close() is called twice. httptest.Server.Close is idempotent so this is harmless, but the redundant defer could be dropped for clarity.

@amir-deris amir-deris added this pull request to the merge queue Jun 29, 2026
@github-merge-queue github-merge-queue Bot removed this pull request from the merge queue due to failed status checks Jun 29, 2026
@amir-deris amir-deris added this pull request to the merge queue Jun 29, 2026
Merged via the queue into main with commit b8c3929 Jun 30, 2026
68 checks passed
@amir-deris amir-deris deleted the amir/plt-707-go-routine-leak-websocket branch June 30, 2026 00:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants