Skip to content

fix(alertmanager): fix ApplyConfig/Stop data race and shutdown-vs-lazy-create window#7618

Open
sandy2008 wants to merge 3 commits into
cortexproject:masterfrom
sandy2008:fix/alertmanager-applyconfig-stop-race
Open

fix(alertmanager): fix ApplyConfig/Stop data race and shutdown-vs-lazy-create window#7618
sandy2008 wants to merge 3 commits into
cortexproject:masterfrom
sandy2008:fix/alertmanager-applyconfig-stop-race

Conversation

@sandy2008

Copy link
Copy Markdown
Contributor

What this PR does

Fixes the data race behind the flaky TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig (issue #7603) — which is a production bug, not a test artifact.

Alertmanager.ApplyConfig (pkg/alertmanager/alertmanager.go) spawned the per-tenant dispatcher and inhibitor Run goroutines fire-and-forget. The vendored alertmanager v0.32.1 dispatch.Dispatcher performs its WaitGroup's first finished.Add(1) inside the Run goroutine (dispatch.go:190), while Dispatcher.Stop() calls finished.Wait() (dispatch.go:449). A Stop() that executes before the spawned goroutine is scheduled violates the sync.WaitGroup contract ("calls with a positive delta that start when the counter is zero must happen before a Wait") — exactly what -race reports. The same window makes Inhibitor.Stop() a silent no-op (the inhibitor goroutine then leaks forever) when it runs before Run() has installed ih.cancel.

Two commits:

  1. Startup barrier in ApplyConfig — replace the bare spawns with upstream cmd/alertmanager's own embedder protocol (its reload path, inhibitor first "so we don't accidentally notify for an alert that will be inhibited"): go inhibitor.Run(); inhibitor.WaitForLoading(); go dispatcher.Run(now); dispatcher.WaitForLoading().
  2. Shutdown gate in MultitenantAlertmanager — a shuttingDown flag under the existing alertmanagersMtx so the fallback lazy-create path can no longer register a brand-new per-tenant Alertmanager (which would never be stopped) after stopping() has begun.

Why this is reachable

Every per-tenant stop path can interleave with a recent ApplyConfig: process shutdown (stopping()), tenant deactivation (syncConfigs), and config reload (ApplyConfig stops the previous generation's dispatcher/inhibitor, hitting the same unordered Wait-vs-Add pair if the previous ApplyConfig just returned). The CI flake was the lazy-create (fallback) path racing test shutdown; the same shape occurs in production whenever a tenant Alertmanager is created or reconfigured shortly before it is stopped. The Add-inside-Run dispatcher shape ships with the v0.32.1 upgrade (#7462).

The shutdown-vs-lazy-create hole is independent: stopping() stops only the alertmanagers present in the map when it runs, and serveRequest can be reached without ServeHTTP's service-state check (the gRPC HandleRequest path when sharding is enabled) or after passing that check just before the state transition. A request losing that race would upload a config, register a new Alertmanager after shutdown stopped the others, return 200 from a shutting-down process, and leak the new instance's goroutines.

How the fix resolves it

Barrier: WaitForLoading() returns only after the Run goroutine has performed its initial load — program-ordered after the dispatcher's first finished.Add(1) and after the inhibitor installs ih.cancel. ApplyConfig and Stop are serialized by their callers (alertmanagersMtx), so once ApplyConfig returns, every later Stop() is ordered after startup: the WaitGroup misuse and the ignored-Stop inhibitor leak are both gone. The waits live on ApplyConfig's success path only — deliberately not in Stop() (see below).

Gate: stopping() sets shuttingDown under the already-held lock before stopping tenants; the authoritative check sits in setConfig immediately after it takes the same lock and returns an errAlertmanagerShuttingDown sentinel (this also prevents a post-stop reload from restarting goroutines on an already-stopped Alertmanager); alertmanagerFromFallbackConfig makes a cheap best-effort check before uploading the empty config to the alert store; serveRequest maps the sentinel via errors.Is to HTTP 503 — consistent with ServeHTTP's existing 503 when the service is not Running, so a request losing the race gets the same answer it would have gotten a moment later.

Tests

Both regression tests were verified to fail without their fix and pass with it (under -race):

  • TestAlertmanagerStopBeforeDispatcherStart (commit 1, alertmanager_test.go): a sequential, sleep-free 30× loop of {NewApplyConfigStopAndWait}, with one double-ApplyConfig iteration to cover the reload path. Without the fix it fails on the first -count=1 run with the exact CI signature (race at vendored dispatch.go:190 vs :449); with the fix it passes -count=30. It is sequential by design: callers are serialized in production and the racing actor is the spawned Run goroutine itself, so a concurrent ApplyConfigStop test would exercise an unsupported contract.
  • TestMultitenantAlertmanager_NoFallbackCreateAfterShutdown (commit 2, multitenant_test.go): drives a real service shutdown, then exercises the fallback lazy-create path directly. Without the gate it deterministically registers (and leaks) a new tenant Alertmanager and returns 200; with the gate it returns 503 and asserts am.alertmanagers stays empty.
  • The originally flaky TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig: green at -race -count=30, both plain and with GOMAXPROCS=2 -cpu 2 (mimicking CI).
  • One existing test adapted: TestDispatcherGroupLimits asserts an exact aggregation-group-limit counter value, but the vendored v0.32.1 dispatcher's limit check reads the group counter without synchronization with concurrent group creation, so alerts routed through its concurrent ingestion workers produce a nondeterministic count. This test is already flaky on unmodified master (measured 7/20 failed runs at the base commit on a 12-core host; CI's lower core count shrinks the window), and the barrier increases the exposure (14/20) because the initial slurp is now provably empty under the old alert/config ordering. The test now puts its alerts before ApplyConfig, so the dispatcher routes all of them sequentially from its initial slurp — which the barrier guarantees completes before ApplyConfig returns — making the asserted count exact on every run (30/30 green, poll satisfied immediately).
  • Full pkg/alertmanager suite under -race: green.

Why not other approaches

  • Per-AM lifecycle mutex (+ stopped flag): empirically falsified — a faithful implementation of that design still races on the first run. alertmanagersMtx already serialized both synchronous bodies in the CI failure; no lock held by the callers can order the unstarted Run goroutine's first WaitGroup.Add. Only a happens-before edge to a post-Add event in the Run goroutine fixes it, which is what WaitForLoading() provides.
  • Waiting in Stop() instead of ApplyConfig: a failed reload (e.g. buildIntegrationsMap error after the inhibitor was already swapped) leaves a never-Run inhibitor whose loading counter is never Doned — a Stop-side WaitForLoading() would deadlock stopping() while holding alertmanagersMtx. The ApplyConfig-side barrier is immune by construction: it only waits for goroutines it just spawned.
  • Editing the vendored alertmanager (moving Add before the goroutine starts): vendored code must not be modified; the protocol used here is exactly how upstream's own cmd/alertmanager consumes these APIs.
  • Bumping alertmanager to v0.32.2: diff inspected — no change to Dispatcher.Run/Stop, would not help.

Scope

Production changes are small and flag-free: the barrier in pkg/alertmanager/alertmanager.go and the ~15-line gate in pkg/alertmanager/multitenant.go (existing mutex, no new locks). Behavior changes: ApplyConfig now returns after the dispatcher/inhibitor finish their initial in-memory load (the same work upstream performs inline on reload), and a fallback request that loses the race with shutdown now gets a 503 instead of silently creating an Alertmanager that leaks. Test changes: two regression tests added, plus the TestDispatcherGroupLimits determinism adaptation described above (test-only; the underlying unsynchronized limit accounting is in the vendored dispatcher and is upstream's to fix).

Known-separate and deliberately not addressed here: the unsynchronized am.dispatcher read in the API GroupFunc closure (pkg/alertmanager/alertmanager.go:291-293) can race a concurrent config reload from API-handler goroutines. That is a distinct race (a pointer-field read, not the WaitGroup internals fixed here) with its own fix shape (upstream publishes the dispatcher through an atomic pointer after WaitForLoading), and is left for a follow-up issue.

Which issue(s) this PR fixes

Fixes #7603

Checklist

  • CHANGELOG.md updated — single [BUGFIX] Alertmanager entry (the entry whose name doesn't end in master).
  • Documentation updated — N/A; no flags or config changed (make doc not required).
  • Tests: two regression tests added, each verified to fail without its fix.
  • Commits signed off (DCO).

Test plan

  • gofmt -l / goimports -local github.com/cortexproject/cortex -l on changed files — clean
  • go vet -tags "netgo slicelabels" ./pkg/alertmanager/... — clean
  • go test -tags "netgo slicelabels" -race -count=1 -run 'TestAlertmanagerStopBeforeDispatcherStart$' ./pkg/alertmanager/ — FAILS before commit 1 (data race, dispatch.go:449 Stop vs dispatch.go:190 Run), PASSES after; -count=30 PASS
  • go test -tags "netgo slicelabels" -race -count=1 -run 'TestMultitenantAlertmanager_NoFallbackCreateAfterShutdown$' ./pkg/alertmanager/ — FAILS before commit 2 (200 + leaked tenant Alertmanager), PASSES after
  • GOMAXPROCS=2 go test -tags "netgo slicelabels" -race -cpu 2 -count=30 -run 'TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig$' ./pkg/alertmanager/ — PASS (and plain -count=30 — PASS)
  • go test -tags "netgo slicelabels" -race -count=30 -run 'TestDispatcherGroupLimits$' ./pkg/alertmanager/ — PASS after the determinism adaptation (baseline on unmodified master: 7/20 FAIL)
  • go test -tags "netgo slicelabels" -race ./pkg/alertmanager/ — PASS (full package)

This PR was developed with AI assistance (multi-agent design review + implementation); all changes, races, and test results were verified as described in the test plan.

🤖 Generated with Claude Code

sandy2008 and others added 2 commits June 11, 2026 09:12
ApplyConfig spawned the per-tenant dispatcher and inhibitor Run goroutines
fire-and-forget. The vendored alertmanager v0.32.1 Dispatcher performs its
WaitGroup's first finished.Add(1) inside the Run goroutine (dispatch.go:190)
while Dispatcher.Stop() calls finished.Wait() (dispatch.go:449), so a Stop
executing before the spawned goroutine gets scheduled violates the
sync.WaitGroup contract (a positive first Add must happen before Wait) and is
reported as a data race under -race. This is reachable in production on every
path that stops a tenant Alertmanager shortly after (re)configuring it:
shutdown (stopping), tenant deactivation (syncConfigs), and config reload
(ApplyConfig stopping the previous generation).

No mutex can fix this: ApplyConfig and Stop bodies are already serialized by
MultitenantAlertmanager.alertmanagersMtx, and the racing actor is the
unstarted goroutine itself, whose first WaitGroup.Add cannot be ordered by
any lock held by the callers.

Fix by adopting upstream's own embedder protocol (cmd/alertmanager/main.go in
prometheus/alertmanager, reload path): start the inhibitor, wait for it to
finish loading, then start the dispatcher and wait for it to finish loading,
before ApplyConfig returns. WaitForLoading() returns only after the Run
goroutine has performed its initial load (program-ordered after the first
finished.Add and after the inhibitor installs its cancel function), so the
existing caller serialization then orders every later Stop() after startup.
This also fixes Inhibitor.Stop() being silently ignored - leaking the
inhibitor goroutine forever - when it executed before Run() installed
ih.cancel, and matches upstream's inhibitor-first ordering so the inhibitor
cache is populated before the dispatcher can notify.

The waits live in ApplyConfig (success path) and deliberately NOT in Stop():
a failed reload leaves a never-Run inhibitor, and waiting for it in Stop
would deadlock shutdown while holding alertmanagersMtx.

The regression test mirrors the supported serialized lifecycle contract
(create/reload via ApplyConfig, then StopAndWait) and fails deterministically
under -race without this fix, including one double-ApplyConfig iteration to
cover the reload path.

Also adapt TestDispatcherGroupLimits to put its alerts before ApplyConfig:
it asserts an exact aggregation-group-limit counter value, but the vendored
dispatcher's limit check reads the group counter without synchronization
with concurrent group creation, so alerts routed through the concurrent
ingestion workers produce a nondeterministic count. The test was already
flaky on master for this reason (7/20 runs on a 12-core host) and the
startup barrier increases the exposure by guaranteeing the initial slurp is
empty under the old ordering. With the alerts put first, the dispatcher
routes all of them sequentially from its initial slurp - which the barrier
guarantees completes before ApplyConfig returns - making the count exact on
every run.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
MultitenantAlertmanager.stopping() stops only the per-tenant alertmanagers
present in the map at the time it runs. A request on the fallback lazy-create
path (serveRequest -> alertmanagerFromFallbackConfig -> setConfig) that loses
the race with shutdown could still register a brand-new per-tenant
Alertmanager afterwards - or restart the dispatcher/inhibitor of an already
stopped one via existing.ApplyConfig - and nothing would ever stop it again:
its goroutines and its (closed) alert provider would leak past shutdown.
This is reachable because serveRequest can be entered without ServeHTTP's
service-state check (the gRPC HandleRequest path when sharding is enabled),
or after passing that check just before the state transition.

Gate it with a shuttingDown flag guarded by the existing alertmanagersMtx:
stopping() sets it under the lock before stopping the tenants; setConfig
performs the authoritative check right after taking the lock and returns the
errAlertmanagerShuttingDown sentinel; alertmanagerFromFallbackConfig makes a
cheap best-effort check before uploading the empty config to the alert store;
serveRequest maps the sentinel to HTTP 503, consistent with ServeHTTP's
existing 'Alertmanager not ready' response - a request losing this race gets
the same answer it would have gotten a moment later.

The regression test drives a real service shutdown and then exercises the
fallback lazy-create path directly: without the gate it deterministically
registers (and leaks) a new tenant Alertmanager and returns 200; with the
gate it returns 503 and the alertmanagers map stays empty.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@dosubot dosubot Bot added component/alertmanager go Pull requests that update Go code type/bug type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating. labels Jun 11, 2026
… CHANGELOG

The check-modernize lint step rewrites the T1 loop to range-over-int;
also point the CHANGELOG entry at this PR (cortexproject#7618) instead of the issue,
matching the repo's CHANGELOG convention.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

component/alertmanager go Pull requests that update Go code size/L type/bug type/production Issues related to the production use of Cortex, inc. configuration, alerting and operating.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky test: TestMultitenantAlertmanager_ServeHTTPWithFallbackConfig — data race between ApplyConfig goroutine spawn and Stop()

1 participant