fix(alertmanager): fix ApplyConfig/Stop data race and shutdown-vs-lazy-create window#7618
Open
sandy2008 wants to merge 3 commits into
Open
Conversation
ApplyConfig spawned the per-tenant dispatcher and inhibitor Run goroutines fire-and-forget. The vendored alertmanager v0.32.1 Dispatcher performs its WaitGroup's first finished.Add(1) inside the Run goroutine (dispatch.go:190) while Dispatcher.Stop() calls finished.Wait() (dispatch.go:449), so a Stop executing before the spawned goroutine gets scheduled violates the sync.WaitGroup contract (a positive first Add must happen before Wait) and is reported as a data race under -race. This is reachable in production on every path that stops a tenant Alertmanager shortly after (re)configuring it: shutdown (stopping), tenant deactivation (syncConfigs), and config reload (ApplyConfig stopping the previous generation). No mutex can fix this: ApplyConfig and Stop bodies are already serialized by MultitenantAlertmanager.alertmanagersMtx, and the racing actor is the unstarted goroutine itself, whose first WaitGroup.Add cannot be ordered by any lock held by the callers. Fix by adopting upstream's own embedder protocol (cmd/alertmanager/main.go in prometheus/alertmanager, reload path): start the inhibitor, wait for it to finish loading, then start the dispatcher and wait for it to finish loading, before ApplyConfig returns. WaitForLoading() returns only after the Run goroutine has performed its initial load (program-ordered after the first finished.Add and after the inhibitor installs its cancel function), so the existing caller serialization then orders every later Stop() after startup. This also fixes Inhibitor.Stop() being silently ignored - leaking the inhibitor goroutine forever - when it executed before Run() installed ih.cancel, and matches upstream's inhibitor-first ordering so the inhibitor cache is populated before the dispatcher can notify. The waits live in ApplyConfig (success path) and deliberately NOT in Stop(): a failed reload leaves a never-Run inhibitor, and waiting for it in Stop would deadlock shutdown while holding alertmanagersMtx. The regression test mirrors the supported serialized lifecycle contract (create/reload via ApplyConfig, then StopAndWait) and fails deterministically under -race without this fix, including one double-ApplyConfig iteration to cover the reload path. Also adapt TestDispatcherGroupLimits to put its alerts before ApplyConfig: it asserts an exact aggregation-group-limit counter value, but the vendored dispatcher's limit check reads the group counter without synchronization with concurrent group creation, so alerts routed through the concurrent ingestion workers produce a nondeterministic count. The test was already flaky on master for this reason (7/20 runs on a 12-core host) and the startup barrier increases the exposure by guaranteeing the initial slurp is empty under the old ordering. With the alerts put first, the dispatcher routes all of them sequentially from its initial slurp - which the barrier guarantees completes before ApplyConfig returns - making the count exact on every run. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
MultitenantAlertmanager.stopping() stops only the per-tenant alertmanagers present in the map at the time it runs. A request on the fallback lazy-create path (serveRequest -> alertmanagerFromFallbackConfig -> setConfig) that loses the race with shutdown could still register a brand-new per-tenant Alertmanager afterwards - or restart the dispatcher/inhibitor of an already stopped one via existing.ApplyConfig - and nothing would ever stop it again: its goroutines and its (closed) alert provider would leak past shutdown. This is reachable because serveRequest can be entered without ServeHTTP's service-state check (the gRPC HandleRequest path when sharding is enabled), or after passing that check just before the state transition. Gate it with a shuttingDown flag guarded by the existing alertmanagersMtx: stopping() sets it under the lock before stopping the tenants; setConfig performs the authoritative check right after taking the lock and returns the errAlertmanagerShuttingDown sentinel; alertmanagerFromFallbackConfig makes a cheap best-effort check before uploading the empty config to the alert store; serveRequest maps the sentinel to HTTP 503, consistent with ServeHTTP's existing 'Alertmanager not ready' response - a request losing this race gets the same answer it would have gotten a moment later. The regression test drives a real service shutdown and then exercises the fallback lazy-create path directly: without the gate it deterministically registers (and leaks) a new tenant Alertmanager and returns 200; with the gate it returns 503 and the alertmanagers map stays empty. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
… CHANGELOG The check-modernize lint step rewrites the T1 loop to range-over-int; also point the CHANGELOG entry at this PR (cortexproject#7618) instead of the issue, matching the repo's CHANGELOG convention. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What this PR does
Fixes the data race behind the flaky
TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig(issue #7603) — which is a production bug, not a test artifact.Alertmanager.ApplyConfig(pkg/alertmanager/alertmanager.go) spawned the per-tenant dispatcher and inhibitorRungoroutines fire-and-forget. The vendored alertmanager v0.32.1dispatch.Dispatcherperforms its WaitGroup's firstfinished.Add(1)inside theRungoroutine (dispatch.go:190), whileDispatcher.Stop()callsfinished.Wait()(dispatch.go:449). AStop()that executes before the spawned goroutine is scheduled violates thesync.WaitGroupcontract ("calls with a positive delta that start when the counter is zero must happen before a Wait") — exactly what-racereports. The same window makesInhibitor.Stop()a silent no-op (the inhibitor goroutine then leaks forever) when it runs beforeRun()has installedih.cancel.Two commits:
ApplyConfig— replace the bare spawns with upstreamcmd/alertmanager's own embedder protocol (its reload path, inhibitor first "so we don't accidentally notify for an alert that will be inhibited"):go inhibitor.Run(); inhibitor.WaitForLoading(); go dispatcher.Run(now); dispatcher.WaitForLoading().MultitenantAlertmanager— ashuttingDownflag under the existingalertmanagersMtxso the fallback lazy-create path can no longer register a brand-new per-tenant Alertmanager (which would never be stopped) afterstopping()has begun.Why this is reachable
Every per-tenant stop path can interleave with a recent
ApplyConfig: process shutdown (stopping()), tenant deactivation (syncConfigs), and config reload (ApplyConfigstops the previous generation's dispatcher/inhibitor, hitting the same unorderedWait-vs-Addpair if the previousApplyConfigjust returned). The CI flake was the lazy-create (fallback) path racing test shutdown; the same shape occurs in production whenever a tenant Alertmanager is created or reconfigured shortly before it is stopped. TheAdd-inside-Rundispatcher shape ships with the v0.32.1 upgrade (#7462).The shutdown-vs-lazy-create hole is independent:
stopping()stops only the alertmanagers present in the map when it runs, andserveRequestcan be reached withoutServeHTTP's service-state check (the gRPCHandleRequestpath when sharding is enabled) or after passing that check just before the state transition. A request losing that race would upload a config, register a new Alertmanager after shutdown stopped the others, return 200 from a shutting-down process, and leak the new instance's goroutines.How the fix resolves it
Barrier:
WaitForLoading()returns only after theRungoroutine has performed its initial load — program-ordered after the dispatcher's firstfinished.Add(1)and after the inhibitor installsih.cancel.ApplyConfigandStopare serialized by their callers (alertmanagersMtx), so onceApplyConfigreturns, every laterStop()is ordered after startup: the WaitGroup misuse and the ignored-Stopinhibitor leak are both gone. The waits live onApplyConfig's success path only — deliberately not inStop()(see below).Gate:
stopping()setsshuttingDownunder the already-held lock before stopping tenants; the authoritative check sits insetConfigimmediately after it takes the same lock and returns anerrAlertmanagerShuttingDownsentinel (this also prevents a post-stop reload from restarting goroutines on an already-stopped Alertmanager);alertmanagerFromFallbackConfigmakes a cheap best-effort check before uploading the empty config to the alert store;serveRequestmaps the sentinel viaerrors.Isto HTTP 503 — consistent withServeHTTP's existing 503 when the service is not Running, so a request losing the race gets the same answer it would have gotten a moment later.Tests
Both regression tests were verified to fail without their fix and pass with it (under
-race):TestAlertmanagerStopBeforeDispatcherStart(commit 1,alertmanager_test.go): a sequential, sleep-free 30× loop of {New→ApplyConfig→StopAndWait}, with one double-ApplyConfigiteration to cover the reload path. Without the fix it fails on the first-count=1run with the exact CI signature (race at vendoreddispatch.go:190vs:449); with the fix it passes-count=30. It is sequential by design: callers are serialized in production and the racing actor is the spawnedRungoroutine itself, so a concurrentApplyConfig∥Stoptest would exercise an unsupported contract.TestMultitenantAlertmanager_NoFallbackCreateAfterShutdown(commit 2,multitenant_test.go): drives a real service shutdown, then exercises the fallback lazy-create path directly. Without the gate it deterministically registers (and leaks) a new tenant Alertmanager and returns 200; with the gate it returns 503 and assertsam.alertmanagersstays empty.TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig: green at-race -count=30, both plain and withGOMAXPROCS=2 -cpu 2(mimicking CI).TestDispatcherGroupLimitsasserts an exact aggregation-group-limit counter value, but the vendored v0.32.1 dispatcher's limit check reads the group counter without synchronization with concurrent group creation, so alerts routed through its concurrent ingestion workers produce a nondeterministic count. This test is already flaky on unmodified master (measured 7/20 failed runs at the base commit on a 12-core host; CI's lower core count shrinks the window), and the barrier increases the exposure (14/20) because the initial slurp is now provably empty under the old alert/config ordering. The test now puts its alerts beforeApplyConfig, so the dispatcher routes all of them sequentially from its initial slurp — which the barrier guarantees completes beforeApplyConfigreturns — making the asserted count exact on every run (30/30 green, poll satisfied immediately).pkg/alertmanagersuite under-race: green.Why not other approaches
stoppedflag): empirically falsified — a faithful implementation of that design still races on the first run.alertmanagersMtxalready serialized both synchronous bodies in the CI failure; no lock held by the callers can order the unstartedRungoroutine's firstWaitGroup.Add. Only a happens-before edge to a post-Addevent in theRungoroutine fixes it, which is whatWaitForLoading()provides.Stop()instead ofApplyConfig: a failed reload (e.g.buildIntegrationsMaperror after the inhibitor was already swapped) leaves a never-Runinhibitor whose loading counter is neverDoned — a Stop-sideWaitForLoading()would deadlockstopping()while holdingalertmanagersMtx. The ApplyConfig-side barrier is immune by construction: it only waits for goroutines it just spawned.Addbefore the goroutine starts): vendored code must not be modified; the protocol used here is exactly how upstream's owncmd/alertmanagerconsumes these APIs.Dispatcher.Run/Stop, would not help.Scope
Production changes are small and flag-free: the barrier in
pkg/alertmanager/alertmanager.goand the ~15-line gate inpkg/alertmanager/multitenant.go(existing mutex, no new locks). Behavior changes:ApplyConfignow returns after the dispatcher/inhibitor finish their initial in-memory load (the same work upstream performs inline on reload), and a fallback request that loses the race with shutdown now gets a 503 instead of silently creating an Alertmanager that leaks. Test changes: two regression tests added, plus theTestDispatcherGroupLimitsdeterminism adaptation described above (test-only; the underlying unsynchronized limit accounting is in the vendored dispatcher and is upstream's to fix).Known-separate and deliberately not addressed here: the unsynchronized
am.dispatcherread in the APIGroupFuncclosure (pkg/alertmanager/alertmanager.go:291-293) can race a concurrent config reload from API-handler goroutines. That is a distinct race (a pointer-field read, not the WaitGroup internals fixed here) with its own fix shape (upstream publishes the dispatcher through an atomic pointer afterWaitForLoading), and is left for a follow-up issue.Which issue(s) this PR fixes
Fixes #7603
Checklist
CHANGELOG.mdupdated — single[BUGFIX] Alertmanagerentry (the entry whose name doesn't end inmaster).make docnot required).Test plan
gofmt -l/goimports -local github.com/cortexproject/cortex -lon changed files — cleango vet -tags "netgo slicelabels" ./pkg/alertmanager/...— cleango test -tags "netgo slicelabels" -race -count=1 -run 'TestAlertmanagerStopBeforeDispatcherStart$' ./pkg/alertmanager/— FAILS before commit 1 (data race,dispatch.go:449Stopvsdispatch.go:190Run), PASSES after;-count=30PASSgo test -tags "netgo slicelabels" -race -count=1 -run 'TestMultitenantAlertmanager_NoFallbackCreateAfterShutdown$' ./pkg/alertmanager/— FAILS before commit 2 (200 + leaked tenant Alertmanager), PASSES afterGOMAXPROCS=2 go test -tags "netgo slicelabels" -race -cpu 2 -count=30 -run 'TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig$' ./pkg/alertmanager/— PASS (and plain-count=30— PASS)go test -tags "netgo slicelabels" -race -count=30 -run 'TestDispatcherGroupLimits$' ./pkg/alertmanager/— PASS after the determinism adaptation (baseline on unmodified master: 7/20 FAIL)go test -tags "netgo slicelabels" -race ./pkg/alertmanager/— PASS (full package)This PR was developed with AI assistance (multi-agent design review + implementation); all changes, races, and test results were verified as described in the test plan.
🤖 Generated with Claude Code