Graceful shutdown#58
Conversation
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: defaults Review profile: CHILL Plan: Pro Run ID: 📒 Files selected for processing (6)
✅ Files skipped from review due to trivial changes (2)
🚧 Files skipped from review as they are similar to previous changes (2)
📝 WalkthroughWalkthroughAdds configurable shutdown timeouts and container grace period; registers SIGTERM/SIGINT to set an asyncio shutdown event; makes Scheduler and MessageQueue shutdown-aware; introduces shutdown_services to drain MQ, stop health/Bolt servers, join threads with timeouts, and logs the shutdown reason and drained count. ChangesGraceful Shutdown with Signal Handling and Queue Drain
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 4
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/paperscout/scout.py (1)
317-320:⚠️ Potential issue | 🟠 Major | ⚡ Quick winBypass the breaker when draining already-queued messages.
Lines 318-320 still dead-letter queued items once the breaker is OPEN. That means messages accepted before shutdown can be dropped during
drain()if the breaker opens mid-backlog, which breaks the graceful-drain contract.Suggested fix
def _send_with_retry(self, channel: str, text: str, kwargs: dict) -> None: - if not self._breaker.allow_send(): + if not self._stop_requested.is_set() and not self._breaker.allow_send(): self._dead_letter(channel, text, reason="circuit_open", kwargs=kwargs) return🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/scout.py` around lines 317 - 320, The _send_with_retry method currently dead-letters messages whenever the circuit breaker disallows sending, which causes queued messages to be dropped during drain(); change _send_with_retry to bypass the breaker when the instance is in draining mode so already-queued messages are sent (or retried) instead of being dead-lettered: inside _send_with_retry (method name) check a draining flag (e.g. self._draining or call drain() state accessor) before calling self._breaker.allow_send(), and only call self._dead_letter(channel, text, reason="circuit_open", kwargs=kwargs) when not draining and allow_send() returns False; ensure any references to _dead_letter and drain() behavior remain consistent with the class's shutdown/drain lifecycle.
🧹 Nitpick comments (1)
src/paperscout/shutdown.py (1)
34-64: ⚡ Quick winConsider wrapping each service shutdown in try/except for robustness.
If
mq.drain(),health_server.shutdown(), orstop_bolt_server()raises an exception, subsequent services won't be shut down. While the process is exiting anyway (limiting impact), wrapping each service in a try/except block ensures all services get a chance to clean up and provides better error visibility during shutdown.🛡️ Proposed defensive shutdown pattern
def shutdown_services(...) -> int: """Ordered teardown. Returns the number of messages drained from the queue.""" drained = 0 if mq is not None: - drained = mq.drain(timeout=mq_drain_timeout) + try: + drained = mq.drain(timeout=mq_drain_timeout) + except Exception: + log.exception("shutdown: MQ drain failed") if health_server is not None: - health_server.shutdown() - _join_thread(health_thread, thread_join_timeout, "health") + try: + health_server.shutdown() + except Exception: + log.exception("shutdown: health server shutdown failed") + try: + _join_thread(health_thread, thread_join_timeout, "health") + except Exception: + log.exception("shutdown: health thread join failed") if app is not None: - stop_bolt_server(app) - _join_thread(bolt_thread, thread_join_timeout, "bolt") + try: + stop_bolt_server(app) + except Exception: + log.exception("shutdown: bolt server shutdown failed") + try: + _join_thread(bolt_thread, thread_join_timeout, "bolt") + except Exception: + log.exception("shutdown: bolt thread join failed") log.info(...) return drained🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/shutdown.py` around lines 34 - 64, In shutdown_services, wrap each external call (mq.drain(), health_server.shutdown(), stop_bolt_server(app)) in its own try/except so one failing shutdown doesn’t skip others; for mq.drain() catch exceptions around the assignment to drained and leave drained at 0 on failure, log the exception with context (use log.exception or log.error with the exception), and always proceed to call _join_thread for health and bolt threads and log their failures similarly; keep function signature and return drained as before but ensure each service shutdown is guarded and errors are logged without raising.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In @.env.example:
- Around line 77-79: Add documentation for the missing
SHUTDOWN_THREAD_JOIN_TIMEOUT_SECONDS environment variable to the .env.example:
describe that it controls how long the app waits for health and Bolt server
threads to exit during shutdown, state the default value (5.0 seconds) and that
it accepts a float/seconds, and place it alongside the existing
SHUTDOWN_MQ_DRAIN_TIMEOUT_SECONDS entry; reference the config symbol
SHUTDOWN_THREAD_JOIN_TIMEOUT_SECONDS (defined in config.py around where DEFAULT
is set) to ensure the name and default match the code.
In `@src/paperscout/monitor.py`:
- Around line 506-509: In Scheduler.run_forever(), don't unconditionally swallow
asyncio.CancelledError raised by poll_once(): check the scheduler's shutdown
condition (shutdown_event or equivalent) and only treat CancelledError as
shutdown when that event is set; otherwise re-raise the CancelledError so caller
cancellation is preserved. Locate the try/except around self.poll_once() in
run_forever(), use the shutdown flag (shutdown_event or shutdown_requested) to
decide whether to set shutdown_requested and break, and re-raise the exception
when shutdown was not actually requested.
In `@src/paperscout/scout.py`:
- Around line 208-211: The _put_shutdown_sentinel method currently calls
blocking self._q.put(_MQ_SENTINEL) which can make stop() block indefinitely if
the queue is full; change _put_shutdown_sentinel to perform a non‑blocking
enqueue (use queue.put_nowait or queue.put with timeout=0) and handle queue.Full
by evicting an item (e.g., get_nowait to drop oldest) or otherwise logging and
abandoning the enqueue so stop() never blocks; update references in stop(),
drain(), and shutdown_services() behavior expectations if needed and ensure any
exception (queue.Full) is caught and ignored or logged so the shutdown sentinel
write is non-blocking.
In `@src/paperscout/shutdown.py`:
- Around line 16-24: stop_bolt_server currently relies on Slack Bolt private
internals (app._development_server and dev._server.shutdown()) which are
fragile; update the function to be defensive: keep the existing
private-attribute shutdown as a best-effort fallback (using getattr checks you
already have), add try/except around dev._server.shutdown() to catch and log
exceptions, and add a clear comment referencing the fragility and recommending
pinning slack-bolt in requirements; also add an alternative fallback that
attempts to call SocketModeHandler.close/close_async if a known
SocketModeHandler instance is available on the app (or document that process
termination is the supported mechanism) so callers have a documented
expectation.
---
Outside diff comments:
In `@src/paperscout/scout.py`:
- Around line 317-320: The _send_with_retry method currently dead-letters
messages whenever the circuit breaker disallows sending, which causes queued
messages to be dropped during drain(); change _send_with_retry to bypass the
breaker when the instance is in draining mode so already-queued messages are
sent (or retried) instead of being dead-lettered: inside _send_with_retry
(method name) check a draining flag (e.g. self._draining or call drain() state
accessor) before calling self._breaker.allow_send(), and only call
self._dead_letter(channel, text, reason="circuit_open", kwargs=kwargs) when not
draining and allow_send() returns False; ensure any references to _dead_letter
and drain() behavior remain consistent with the class's shutdown/drain
lifecycle.
---
Nitpick comments:
In `@src/paperscout/shutdown.py`:
- Around line 34-64: In shutdown_services, wrap each external call (mq.drain(),
health_server.shutdown(), stop_bolt_server(app)) in its own try/except so one
failing shutdown doesn’t skip others; for mq.drain() catch exceptions around the
assignment to drained and leave drained at 0 on failure, log the exception with
context (use log.exception or log.error with the exception), and always proceed
to call _join_thread for health and bolt threads and log their failures
similarly; keep function signature and return drained as before but ensure each
service shutdown is guarded and errors are logged without raising.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 86dcbea7-3778-446c-9166-390a4062351f
📒 Files selected for processing (11)
.env.exampledocker-compose.ymlsrc/paperscout/__main__.pysrc/paperscout/config.pysrc/paperscout/health.pysrc/paperscout/monitor.pysrc/paperscout/scout.pysrc/paperscout/shutdown.pytests/test_message_queue.pytests/test_monitor.pytests/test_shutdown.py
|
@coderabbitai full review again |
|
✅ Action performedFull review finished. |
There was a problem hiding this comment.
Actionable comments posted: 3
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
src/paperscout/scout.py (1)
234-259:⚠️ Potential issue | 🟠 Major | ⚡ Quick winReject new enqueues once shutdown starts.
After
stop()requests shutdown,enqueue()still accepts new items. A late producer can land behind_MQ_SENTINELand never be sent, and if the queue is full the drop-oldest path can pop_MQ_SENTINELand crash on tuple unpack.Suggested change
def enqueue(self, channel: str, text: str, **kwargs) -> bool: """Queue a ``chat.postMessage``; return False when the circuit breaker rejects.""" + if self._stop_requested.is_set(): + log.warning( + "MQ enqueue-rejected shutdown %s %s", + _redact_channel(channel), + _payload_meta(text, kwargs), + ) + return False + if not self._breaker.allow_send(): log.warning( "MQ enqueue-rejected circuit=open %s %s", _redact_channel(channel), _payload_meta(text, kwargs),🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@src/paperscout/scout.py` around lines 234 - 259, The enqueue method accepts new items after shutdown starts which can leave producers placing items after the _MQ_SENTINEL or cause the drop-oldest path to pop the sentinel and crash; modify enqueue (and its use of self._q and self._queue_lock) to check a shutdown flag set by stop() (or test for the presence of _MQ_SENTINEL) before accepting items and immediately return False when shutdown is in progress, and when removing oldest entries handle the case where the popped value equals _MQ_SENTINEL by re-queuing it (or putting it back) and continuing the loop instead of unpacking, ensuring enqueue never accepts or drops the sentinel and cleanly rejects new enqueues once shutdown starts.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docker-compose.yml`:
- Line 20: Update stop_grace_period to derive from or be set above the sum of
the configurable shutdown timeouts used by shutdown_services (ensure it exceeds
the MQ drain + health join + Bolt join budgets) by making stop_grace_period
reference the corresponding .env timeout variables (or increase the hard-coded
value to at least the sum of those env-configured timeouts) and add
documentation noting that stop_grace_period must remain greater than the
combined drain/join budget; also add a runtime check in shutdown_services (or
startup validation) that compares the effective Docker stop_grace_period and the
configured shutdown timeouts and emits an error/warning if the stop_grace_period
is too small.
In `@src/paperscout/__main__.py`:
- Around line 270-284: The startup sequence starts MQ and other services before
the existing try/finally that calls shutdown_services, so partial failures
(e.g., in register_handlers, start_health_server, bolt_thread.start, or
enqueue_startup_status) skip cleanup; move/extend the try/finally so that once
mq is created the remaining startup calls and the call to
scheduler.run_forever(shutdown_event) are inside the same try block, and the
finally always calls shutdown_services(mq=mq, health_server=health_server,
app=app, bolt_thread=bolt_thread,
mq_drain_timeout=settings.shutdown_mq_drain_timeout_seconds,
thread_join_timeout=settings.shutdown_thread_join_timeout_seconds,
reason=shutdown_reason[0] or "unknown") to ensure MQ is drained/stopped and
threads are joined on any partial startup failure.
In `@tests/test_message_queue.py`:
- Around line 98-110: The test's sleep is flaky because it doesn't ensure the
sender is blocked in gated_send; modify gated_send (the side_effect for
mq._app.client.chat_postMessage) to set an Event (e.g., sender_started.set())
just before gate.wait(), then replace time.sleep(0.05) with
sender_started.wait(timeout=2.0) so the test only proceeds once the send call is
actually blocked; keep existing calls to mq.start(), mq.enqueue("C1", "queued"),
breaker failure recording (mq._breaker.record_failure()), mq.stop(), gate.set(),
and mq.drain(timeout=5.0) unchanged.
---
Outside diff comments:
In `@src/paperscout/scout.py`:
- Around line 234-259: The enqueue method accepts new items after shutdown
starts which can leave producers placing items after the _MQ_SENTINEL or cause
the drop-oldest path to pop the sentinel and crash; modify enqueue (and its use
of self._q and self._queue_lock) to check a shutdown flag set by stop() (or test
for the presence of _MQ_SENTINEL) before accepting items and immediately return
False when shutdown is in progress, and when removing oldest entries handle the
case where the popped value equals _MQ_SENTINEL by re-queuing it (or putting it
back) and continuing the loop instead of unpacking, ensuring enqueue never
accepts or drops the sentinel and cleanly rejects new enqueues once shutdown
starts.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 5ec57714-c93c-4393-afd7-e07cb696b384
📒 Files selected for processing (11)
.env.exampledocker-compose.ymlsrc/paperscout/__main__.pysrc/paperscout/config.pysrc/paperscout/health.pysrc/paperscout/monitor.pysrc/paperscout/scout.pysrc/paperscout/shutdown.pytests/test_message_queue.pytests/test_monitor.pytests/test_shutdown.py
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@src/paperscout/scout.py`:
- Around line 236-243: The shutdown race occurs because enqueue() checks
_stop_requested before acquiring _queue_lock so a caller can pass the first
check and append a message after stop() has inserted _MQ_SENTINEL; fix by
performing the shutdown check while holding _queue_lock (or re-checking
_stop_requested immediately after acquiring _queue_lock) and, if set, log the
same "enqueue-rejected shutdown" warning and return False instead of appending
to the queue; apply the same change to the second similar block around lines
handling the breaker (the code paths referencing _queue_lock, _stop_requested,
_MQ_SENTINEL, enqueue(), stop(), and _run()) so no messages are appended after
the sentinel is enqueued.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 92f6aa42-26ee-436c-a325-081253603bd4
📒 Files selected for processing (4)
docker-compose.ymlsrc/paperscout/__main__.pysrc/paperscout/scout.pytests/test_message_queue.py
✅ Files skipped from review due to trivial changes (1)
- docker-compose.yml
🚧 Files skipped from review as they are similar to previous changes (2)
- src/paperscout/main.py
- tests/test_message_queue.py
Summary
Add cooperative process shutdown for container rolling restarts. SIGTERM and SIGINT now set an
asyncio.Eventthat stops the scheduler (including mid-poll cancellation), drains the Slack message queue, shuts down the health and Bolt HTTP servers, joins worker threads, and logs the shutdown reason plus the number of messages sent during drain.Changes
MessageQueue(scout.py):stop(),drain(),join()with sentinel-based drain; double-start guard; circuit-breaker bypass on shutdown.Scheduler(monitor.py):run_forever(shutdown_event)with interruptible inter-poll sleep and_poll_once_or_cancel()for prompt exit during long probe cycles.shutdown.py(new):shutdown_services()orchestrates MQ drain → health shutdown → Bolt shutdown → thread joins.__main__.py: SIGTERM/SIGINT handlers vialoop.add_signal_handler(withsignal.signalfallback);try/finallyteardown.health.py: Exposes_paperscout_threadon the returnedHTTPServerfor join during teardown.config.py:SHUTDOWN_MQ_DRAIN_TIMEOUT_SECONDS(default 30s),SHUTDOWN_THREAD_JOIN_TIMEOUT_SECONDS(default 5s).docker-compose.yml:stop_grace_period: 45saligned with drain + join timeouts.tests/test_message_queue.py,tests/test_shutdown.py; updatedtests/test_monitor.pyshutdown/cancel coverage.Shutdown order
shutdown_event→ scheduler exits (sleep interrupt or poll cancel)MessageQueue.drain(timeout)— flush queued Slack postshealth_server.shutdown()+ thread joinshutdown()+ thread joinSIGTERM/SIGINT) and drained message countTest plan
uv run pytest tests/— 347 passeduv run ruff check src tests— clean./run check(ormake check) in CIpre-commit run --all-filesdocker compose up -d && docker compose kill -s SIGTERM paperscout— confirm shutdown log in container outputConfig
SHUTDOWN_MQ_DRAIN_TIMEOUT_SECONDS30SHUTDOWN_THREAD_JOIN_TIMEOUT_SECONDS5stop_grace_period(compose)45sRelated issues
Summary by CodeRabbit
New Features
Chores
Tests