From ba0edf7d286fd74a5a7cfd70532ec71821f9b3ff Mon Sep 17 00:00:00 2001 From: Sandy Chen Date: Thu, 11 Jun 2026 08:56:43 +0900 Subject: [PATCH 1/3] fix(alertmanager): wait for dispatcher/inhibitor startup in ApplyConfig ApplyConfig spawned the per-tenant dispatcher and inhibitor Run goroutines fire-and-forget. The vendored alertmanager v0.32.1 Dispatcher performs its WaitGroup's first finished.Add(1) inside the Run goroutine (dispatch.go:190) while Dispatcher.Stop() calls finished.Wait() (dispatch.go:449), so a Stop executing before the spawned goroutine gets scheduled violates the sync.WaitGroup contract (a positive first Add must happen before Wait) and is reported as a data race under -race. This is reachable in production on every path that stops a tenant Alertmanager shortly after (re)configuring it: shutdown (stopping), tenant deactivation (syncConfigs), and config reload (ApplyConfig stopping the previous generation). No mutex can fix this: ApplyConfig and Stop bodies are already serialized by MultitenantAlertmanager.alertmanagersMtx, and the racing actor is the unstarted goroutine itself, whose first WaitGroup.Add cannot be ordered by any lock held by the callers. Fix by adopting upstream's own embedder protocol (cmd/alertmanager/main.go in prometheus/alertmanager, reload path): start the inhibitor, wait for it to finish loading, then start the dispatcher and wait for it to finish loading, before ApplyConfig returns. WaitForLoading() returns only after the Run goroutine has performed its initial load (program-ordered after the first finished.Add and after the inhibitor installs its cancel function), so the existing caller serialization then orders every later Stop() after startup. This also fixes Inhibitor.Stop() being silently ignored - leaking the inhibitor goroutine forever - when it executed before Run() installed ih.cancel, and matches upstream's inhibitor-first ordering so the inhibitor cache is populated before the dispatcher can notify. The waits live in ApplyConfig (success path) and deliberately NOT in Stop(): a failed reload leaves a never-Run inhibitor, and waiting for it in Stop would deadlock shutdown while holding alertmanagersMtx. The regression test mirrors the supported serialized lifecycle contract (create/reload via ApplyConfig, then StopAndWait) and fails deterministically under -race without this fix, including one double-ApplyConfig iteration to cover the reload path. Also adapt TestDispatcherGroupLimits to put its alerts before ApplyConfig: it asserts an exact aggregation-group-limit counter value, but the vendored dispatcher's limit check reads the group counter without synchronization with concurrent group creation, so alerts routed through the concurrent ingestion workers produce a nondeterministic count. The test was already flaky on master for this reason (7/20 runs on a 12-core host) and the startup barrier increases the exposure by guaranteeing the initial slurp is empty under the old ordering. With the alerts put first, the dispatcher routes all of them sequentially from its initial slurp - which the barrier guarantees completes before ApplyConfig returns - making the count exact on every run. Co-Authored-By: Claude Fable 5 Signed-off-by: Sandy Chen --- CHANGELOG.md | 1 + pkg/alertmanager/alertmanager.go | 17 +++++++- pkg/alertmanager/alertmanager_test.go | 60 ++++++++++++++++++++++++++- 3 files changed, 76 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index 0756d48d25..fe8836c0ab 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -57,6 +57,7 @@ * [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555 * [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602 * [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573 +* [BUGFIX] Alertmanager: Fix data race between ApplyConfig's dispatcher/inhibitor startup and Stop during config reload and shutdown. #7603 ## 1.21.0 2026-04-24 diff --git a/pkg/alertmanager/alertmanager.go b/pkg/alertmanager/alertmanager.go index 602f6b40e0..b9b776cbfb 100644 --- a/pkg/alertmanager/alertmanager.go +++ b/pkg/alertmanager/alertmanager.go @@ -417,8 +417,23 @@ func (am *Alertmanager) ApplyConfig(userID string, conf *config.Config, rawCfg s am.dispatcherMetrics, ) - go am.dispatcher.Run(time.Now()) + // Start the inhibitor and dispatcher and wait for each to finish loading + // before returning, mirroring upstream cmd/alertmanager (inhibitor first, + // so its cache is populated before the dispatcher can notify). + // The waits create the happens-before edge between the startup of the + // goroutines spawned here and any later Stop(): Dispatcher.Stop() calls + // finished.Wait(), and running that concurrently with the WaitGroup's first + // Add() inside Dispatcher.Run() is a data race by the Go memory model (what + // -race reports in #7603). They likewise prevent Stop() from being silently + // ignored - leaking the inhibitor goroutine forever - when it executes + // before Run() has installed ih.cancel. ApplyConfig and Stop are serialized + // by the callers (MultitenantAlertmanager.alertmanagersMtx), so once this + // returns any later Stop() is ordered after dispatcher/inhibitor startup. go am.inhibitor.Run() + am.inhibitor.WaitForLoading() + + go am.dispatcher.Run(time.Now()) + am.dispatcher.WaitForLoading() am.configHashMetric.Set(md5HashAsMetricValue([]byte(rawCfg))) return nil diff --git a/pkg/alertmanager/alertmanager_test.go b/pkg/alertmanager/alertmanager_test.go index da478d4fa9..918c5959f2 100644 --- a/pkg/alertmanager/alertmanager_test.go +++ b/pkg/alertmanager/alertmanager_test.go @@ -135,7 +135,6 @@ route: cfg, err := config.Load(cfgRaw) require.NoError(t, err) - require.NoError(t, am.ApplyConfig(user, cfg, cfgRaw)) now := time.Now() @@ -176,6 +175,15 @@ route: require.NoError(t, am.alerts.Put(context.Background(), inputAlerts...)) } + // Apply the config after the alerts were put: the dispatcher then routes all + // of them synchronously, from its initial slurp (a single goroutine), before + // ApplyConfig returns. Routing them through the concurrent post-loading + // ingestion workers instead would make the aggregation-group limit + // accounting - and so the metric asserted below - nondeterministic, because + // the vendored dispatcher's limit check reads the group counter without + // synchronization with concurrent group creation. + require.NoError(t, am.ApplyConfig(user, cfg, cfgRaw)) + // Give it some time, as alerts are sent to dispatcher asynchronously. test.Poll(t, 3*time.Second, nil, func() any { return testutil.GatherAndCompare(reg, strings.NewReader(fmt.Sprintf(` @@ -186,6 +194,56 @@ route: }) } +// TestAlertmanagerStopBeforeDispatcherStart is a regression test for the data +// race between ApplyConfig and Stop (issue #7603): ApplyConfig used to spawn +// the dispatcher and inhibitor Run goroutines without waiting for them to +// start, so a Stop shortly after could run Dispatcher.Stop's finished.Wait() +// concurrently with the WaitGroup's first Add() inside Dispatcher.Run() - a +// WaitGroup contract violation reported under -race - and could be silently +// ignored by an inhibitor whose Run had not yet installed its cancel function. +// The loop intentionally mirrors the supported, serialized lifecycle contract +// (callers of ApplyConfig and Stop are serialized by +// MultitenantAlertmanager.alertmanagersMtx): no concurrency is needed to +// trigger the race because the racing actor is the spawned Run goroutine +// itself. +func TestAlertmanagerStopBeforeDispatcherStart(t *testing.T) { + const user = "test" + + cfgRaw := `receivers: +- name: 'prod' + +route: + group_by: ['alertname'] + receiver: 'prod'` + + cfg, err := config.Load(cfgRaw) + require.NoError(t, err) + + for i := 0; i < 30; i++ { + am, err := New(&Config{ + UserID: user, + Logger: log.NewNopLogger(), + Limits: &mockAlertManagerLimits{}, + TenantDataDir: t.TempDir(), + ExternalURL: &url.URL{Path: "/am"}, + ShardingEnabled: false, + GCInterval: 30 * time.Minute, + }, prometheus.NewPedanticRegistry()) + require.NoError(t, err) + + require.NoError(t, am.ApplyConfig(user, cfg, cfgRaw)) + + // One iteration also exercises the reload path: a second ApplyConfig + // stops the previous generation's dispatcher and inhibitor right after + // their Run goroutines were spawned. + if i == 0 { + require.NoError(t, am.ApplyConfig(user, cfg, cfgRaw)) + } + + am.StopAndWait() + } +} + var ( alert1 = model.Alert{ Labels: model.LabelSet{"alert": "first", "alertname": "alert1"}, From 3035541d9cabf0831debdafebf65b81dfb809f86 Mon Sep 17 00:00:00 2001 From: Sandy Chen Date: Thu, 11 Jun 2026 09:00:32 +0900 Subject: [PATCH 2/3] fix(alertmanager): reject lazy tenant creation after shutdown begins MultitenantAlertmanager.stopping() stops only the per-tenant alertmanagers present in the map at the time it runs. A request on the fallback lazy-create path (serveRequest -> alertmanagerFromFallbackConfig -> setConfig) that loses the race with shutdown could still register a brand-new per-tenant Alertmanager afterwards - or restart the dispatcher/inhibitor of an already stopped one via existing.ApplyConfig - and nothing would ever stop it again: its goroutines and its (closed) alert provider would leak past shutdown. This is reachable because serveRequest can be entered without ServeHTTP's service-state check (the gRPC HandleRequest path when sharding is enabled), or after passing that check just before the state transition. Gate it with a shuttingDown flag guarded by the existing alertmanagersMtx: stopping() sets it under the lock before stopping the tenants; setConfig performs the authoritative check right after taking the lock and returns the errAlertmanagerShuttingDown sentinel; alertmanagerFromFallbackConfig makes a cheap best-effort check before uploading the empty config to the alert store; serveRequest maps the sentinel to HTTP 503, consistent with ServeHTTP's existing 'Alertmanager not ready' response - a request losing this race gets the same answer it would have gotten a moment later. The regression test drives a real service shutdown and then exercises the fallback lazy-create path directly: without the gate it deterministically registers (and leaks) a new tenant Alertmanager and returns 200; with the gate it returns 503 and the alertmanagers map stays empty. Co-Authored-By: Claude Fable 5 Signed-off-by: Sandy Chen --- CHANGELOG.md | 2 +- pkg/alertmanager/multitenant.go | 31 ++++++++++++++++ pkg/alertmanager/multitenant_test.go | 54 ++++++++++++++++++++++++++++ 3 files changed, 86 insertions(+), 1 deletion(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index fe8836c0ab..f18fa0c87e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -57,7 +57,7 @@ * [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555 * [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602 * [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573 -* [BUGFIX] Alertmanager: Fix data race between ApplyConfig's dispatcher/inhibitor startup and Stop during config reload and shutdown. #7603 +* [BUGFIX] Alertmanager: Fix data race between ApplyConfig's dispatcher/inhibitor startup and Stop during config reload and shutdown, and reject lazy tenant creation after shutdown begins. #7603 ## 1.21.0 2026-04-24 diff --git a/pkg/alertmanager/multitenant.go b/pkg/alertmanager/multitenant.go index 064abc42a0..6f709cb2af 100644 --- a/pkg/alertmanager/multitenant.go +++ b/pkg/alertmanager/multitenant.go @@ -58,6 +58,9 @@ var ( errInvalidExternalURL = errors.New("the configured external URL is invalid: should not end with /") errShardingUnsupportedStorage = errors.New("the configured alertmanager storage backend is not supported when sharding is enabled") errZoneAwarenessEnabledWithoutZoneInfo = errors.New("the configured alertmanager has zone awareness enabled but zone is not set") + // errAlertmanagerShuttingDown is returned when a per-tenant alertmanager cannot be + // created or reconfigured because the MultitenantAlertmanager is shutting down. + errAlertmanagerShuttingDown = errors.New("alertmanager is shutting down") ) // MultitenantAlertmanagerConfig is the configuration for a multitenant Alertmanager. @@ -273,6 +276,10 @@ type MultitenantAlertmanager struct { // Stores the current set of configurations we're running in each tenant's Alertmanager. // Used for comparing configurations as we synchronize them. cfgs map[string]alertspb.AlertConfigDesc + // shuttingDown is set by stopping() before it stops the per-tenant alertmanagers, and + // prevents new per-tenant alertmanagers from being created or reconfigured afterwards + // (they would never be stopped). Guarded by alertmanagersMtx. + shuttingDown bool logger log.Logger alertmanagerMetrics *alertmanagerMetrics @@ -656,6 +663,10 @@ func (am *MultitenantAlertmanager) waitInitialStateSync(ctx context.Context) err // stopping runs when MultitenantAlertmanager transitions to Stopping state. func (am *MultitenantAlertmanager) stopping(_ error) error { am.alertmanagersMtx.Lock() + // Reject any further per-tenant alertmanager creation (e.g. the fallback + // lazy-create path of an in-flight request): only the alertmanagers present + // in the map right now get stopped below. + am.shuttingDown = true for _, am := range am.alertmanagers { am.StopAndWait() } @@ -820,6 +831,13 @@ func (am *MultitenantAlertmanager) setConfig(cfg alertspb.AlertConfigDesc) error am.alertmanagersMtx.Lock() defer am.alertmanagersMtx.Unlock() + + // Once shutdown has begun, don't create a new per-tenant alertmanager (it would + // never be stopped) nor restart the dispatcher/inhibitor of an already stopped one. + if am.shuttingDown { + return errAlertmanagerShuttingDown + } + existing, hasExisting := am.alertmanagers[cfg.User] rawCfg := cfg.RawConfig @@ -999,6 +1017,10 @@ func (am *MultitenantAlertmanager) serveRequest(w http.ResponseWriter, req *http if am.fallbackConfig != "" { userAM, err = am.alertmanagerFromFallbackConfig(userID) + if errors.Is(err, errAlertmanagerShuttingDown) { + http.Error(w, "Alertmanager is shutting down", http.StatusServiceUnavailable) + return + } if err != nil { level.Error(am.logger).Log("msg", "unable to initialize the Alertmanager with a fallback configuration", "user", userID, "err", err) http.Error(w, "Failed to initialize the Alertmanager", http.StatusInternalServerError) @@ -1014,6 +1036,15 @@ func (am *MultitenantAlertmanager) serveRequest(w http.ResponseWriter, req *http } func (am *MultitenantAlertmanager) alertmanagerFromFallbackConfig(userID string) (*Alertmanager, error) { + // Avoid the config upload below if shutdown has already begun. This check is + // best-effort (the authoritative one is in setConfig, under the same lock). + am.alertmanagersMtx.Lock() + shuttingDown := am.shuttingDown + am.alertmanagersMtx.Unlock() + if shuttingDown { + return nil, errAlertmanagerShuttingDown + } + // Upload an empty config so that the Alertmanager is no de-activated in the next poll cfgDesc := alertspb.ToProto("", nil, userID) err := am.store.SetAlertConfig(context.Background(), cfgDesc) diff --git a/pkg/alertmanager/multitenant_test.go b/pkg/alertmanager/multitenant_test.go index fb13da8745..26c17c720c 100644 --- a/pkg/alertmanager/multitenant_test.go +++ b/pkg/alertmanager/multitenant_test.go @@ -1181,6 +1181,60 @@ receivers: require.Equal(t, http.StatusOK, resp.StatusCode) } +// TestMultitenantAlertmanager_NoFallbackCreateAfterShutdown verifies that a +// request hitting the fallback lazy-create path after shutdown has begun does +// not register (and leak) a new per-tenant Alertmanager: stopping() only stops +// the alertmanagers present in the map when it runs, so one created afterwards +// would never be stopped. serveRequest is exercised directly because requests +// can reach it without passing ServeHTTP's service-state check (e.g. via the +// gRPC HandleRequest path when sharding is enabled), or after passing that +// check just before the state transition. +func TestMultitenantAlertmanager_NoFallbackCreateAfterShutdown(t *testing.T) { + ctx := context.Background() + amConfig := mockAlertmanagerConfig(t) + + // Run this test using a real storage client. + store, err := prepareInMemoryAlertStore() + require.NoError(t, err) + + externalURL := flagext.URLValue{} + err = externalURL.Set("http://localhost:8080/alertmanager") + require.NoError(t, err) + + fallbackCfg := ` +global: + smtp_smarthost: 'localhost:25' + smtp_from: 'youraddress@example.org' +route: + receiver: example-email +receivers: + - name: example-email + email_configs: + - to: 'youraddress@example.org' +` + amConfig.ExternalURL = externalURL + + // Create the Multitenant Alertmanager. + am, err := createMultitenantAlertmanager(amConfig, nil, nil, store, nil, nil, log.NewNopLogger(), nil) + require.NoError(t, err) + am.fallbackConfig = fallbackCfg + + require.NoError(t, services.StartAndAwaitRunning(ctx, am)) + + // Shut it down. No per-tenant alertmanager was ever created. + require.NoError(t, services.StopAndAwaitTerminated(ctx, am)) + + // A request for an unconfigured tenant must not lazily create a new + // per-tenant Alertmanager anymore. + req := httptest.NewRequest("GET", externalURL.String()+"/api/v2/status", nil) + w := httptest.NewRecorder() + am.serveRequest(w, req.WithContext(user.InjectOrgID(req.Context(), "user1"))) + + resp := w.Result() + require.Equal(t, http.StatusServiceUnavailable, resp.StatusCode) + require.Empty(t, am.alertmanagers) +} + func TestMultitenantAlertmanager_InitialSyncWithSharding(t *testing.T) { tg := ring.NewRandomTokenGenerator() tc := []struct { From 26f45000061f3213cdc3c393bd358facfc94bcf0 Mon Sep 17 00:00:00 2001 From: Sandy Chen Date: Thu, 11 Jun 2026 15:55:49 +0900 Subject: [PATCH 3/3] fix(alertmanager): modernize regression test loop and reference PR in CHANGELOG The check-modernize lint step rewrites the T1 loop to range-over-int; also point the CHANGELOG entry at this PR (#7618) instead of the issue, matching the repo's CHANGELOG convention. Co-Authored-By: Claude Fable 5 Signed-off-by: Sandy Chen --- CHANGELOG.md | 2 +- pkg/alertmanager/alertmanager_test.go | 2 +- 2 files changed, 2 insertions(+), 2 deletions(-) diff --git a/CHANGELOG.md b/CHANGELOG.md index f18fa0c87e..2552993b3c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -57,7 +57,7 @@ * [BUGFIX] Query Frontend: Fix native histogram responses not being handled correctly in `minTime()` sort ordering for split_by_interval merge. #7555 * [BUGFIX] Distributor: Release the push worker pool goroutines on shutdown by stopping the async executor during the stopping phase when `-distributor.num-push-workers` is set. #7602 * [BUGFIX] Querier: Fix unbounded resource leak in the bucket-scan blocks finder (used when the bucket index is disabled). Per-tenant metadata fetchers, their Prometheus registries, and on-disk meta caches are now evicted once a tenant is no longer active, instead of being retained for the lifetime of the process. #7573 -* [BUGFIX] Alertmanager: Fix data race between ApplyConfig's dispatcher/inhibitor startup and Stop during config reload and shutdown, and reject lazy tenant creation after shutdown begins. #7603 +* [BUGFIX] Alertmanager: Fix data race between ApplyConfig's dispatcher/inhibitor startup and Stop during config reload and shutdown, and reject lazy tenant creation after shutdown begins. #7618 ## 1.21.0 2026-04-24 diff --git a/pkg/alertmanager/alertmanager_test.go b/pkg/alertmanager/alertmanager_test.go index 918c5959f2..04352c420b 100644 --- a/pkg/alertmanager/alertmanager_test.go +++ b/pkg/alertmanager/alertmanager_test.go @@ -219,7 +219,7 @@ route: cfg, err := config.Load(cfgRaw) require.NoError(t, err) - for i := 0; i < 30; i++ { + for i := range 30 { am, err := New(&Config{ UserID: user, Logger: log.NewNopLogger(),