EV-6666: Surface Alertmanager alerts on the manager Alerts page#4879
EV-6666: Surface Alertmanager alerts on the manager Alerts page#4879rene-dekker wants to merge 12 commits into
Conversation
044b64c to
249f607
Compare
electricjesus
left a comment
There was a problem hiding this comment.
Drive-by review, courtesy of a quest from Tigera Town 🤣. Mostly looks good. One thing I think blocks merge, plus a few mediums, left inline.
Cross-PR ordering: these two have to ship together, and this PR can't stand alone. The operator's own RBAC for alertmanagerconfigs lives in the calico-private charts in tigera/calico-private#12184, so if this vendors into the operator ahead of that, the monitor controller can't create the AlertmanagerConfig and goes degraded. Same on the receiving end: without #12184 the /api/v1/events/alertmanager endpoint 404s and Linseed rejects the prometheus_alert type. Worth pinning both to the same release and noting the dependency on each PR while they're still draft.
One nice-to-have I noticed but won't block on: the config-hash annotation that rolls the pod doesn't include the token secret data, so the pod won't roll when Kubernetes first populates the token. It relies on the config-reloader watching the mounted secret. Probably fine, worth a sanity check.
| // monitor.AlertmanagerConfigName), the operator renders a copy of it in tigera-prometheus. | ||
| // Otherwise it renders the operator's default config (the Linseed webhook receiver when the UI | ||
| // alerts integration is enabled, or a null receiver when disabled). | ||
| func (r *ReconcileMonitor) readAlertmanagerConfig(ctx context.Context, uiAlertsEnabled bool) (*monitoringv1alpha1.AlertmanagerConfig, error) { |
There was a problem hiding this comment.
This drops existing customer Alertmanager config on upgrade, and I don't think we can ship it that way.
Today the customization path is the raw alertmanager-calico-node-alertmanager secret. It's documented, and the old readAlertmanagerConfigSecret carried all that owner-ref logic precisely to leave a user-modified secret alone. This PR deletes that secret and only reads config from an AlertmanagerConfig CR the customer has never created. So on upgrade, anyone who set their own receivers (PagerDuty, Slack, email) loses them and falls back to the default Linseed webhook. Their external paging stops and the alerts quietly reroute to the manager UI instead.
Options, in order of how much I'd trust them:
- Migrate: if the legacy secret exists and differs from the old default, parse it and seed the
AlertmanagerConfigbefore deleting the secret. - Failing that, detect a non-default legacy secret and
SetDegradedwith a clear message instead of silently replacing it, so the upgrade isn't invisible.
Either way the release note has to call this out as a breaking change. Right now it only describes the new feature.
|
|
||
| // The Linseed bearer-token secret is only needed when Alertmanager is running and forwarding | ||
| // alerts to Linseed (the UI alerts integration is enabled); otherwise remove it. | ||
| if mc.alertmanagerReplicas() > 0 && mc.cfg.Monitor.UIAlertsEnabled() { |
There was a problem hiding this comment.
Two things about the disable toggle when a user brings their own AlertmanagerConfig.
The toggle only swaps the default. If a user has their own AlertmanagerConfig in the operator namespace, uiAlertsIntegration: Disabled does nothing, since we copy their spec verbatim. The field doc says it "controls whether alerts are forwarded to Linseed," which won't hold for those users. Worth documenting the precedence, or deciding whether disable should win regardless.
Separately, the Linseed token secret and the tigera-alertmanager-linseed ClusterRole/Binding get created whenever Alertmanager runs with the integration enabled, even if the user's own config never talks to Linseed. That leaves a token secret and an event-create grant nothing uses. Not harmful, but it's a dangling credential. Could gate those on the default-config path rather than on UIAlertsEnabled alone.
| } | ||
|
|
||
| // +kubebuilder:validation:Enum=Enabled;Disabled | ||
| type UIAlertsIntegrationStatusType string |
There was a problem hiding this comment.
UIAlertsIntegrationStatusType reads like a status field, but this is a spec enum. It's public API and awkward to rename after release, so I'd fix it now. UIAlertsIntegrationType or UIAlertsIntegrationMode matches what it actually is.
Add a Linseed network policy ingress rule permitting traffic from the Alertmanager pods in the tigera-prometheus namespace, so Alertmanager can push Prometheus alerts to Linseed as events. The Alertmanager egress policy already allows all TCP egress, so only the Linseed ingress side was missing. Exports monitor.AlertmanagerSourceEntityRule as the single source of truth for the Alertmanager pod selector. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a ClusterRole granting create on events (linseed.tigera.io), bound to the prometheus service account that Alertmanager runs as. Linseed authorizes writes via SubjectAccessReview, so this lets Alertmanager push Prometheus alerts to Linseed as events using its existing service account token. The role/binding are rendered only when Alertmanager is enabled and removed otherwise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the placeholder Alertmanager webhook receiver with one that posts to Linseed's /api/v1/events/alertmanager endpoint, so Prometheus alerts surface on the Alerts UI page. Linseed requires mTLS plus a bearer token, so the Alertmanager spec now mounts the prometheus client TLS key pair and the trusted CA bundle, and the webhook http_config references them along with the service account token. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Add a UIAlertsIntegration (Enabled|Disabled) field to the Monitor spec that controls whether Prometheus/Alertmanager alerts are forwarded to Linseed and surfaced on the manager Alerts page (defaults to Enabled). When disabled, the operator renders an Alertmanager config that routes to a null receiver instead of the Linseed webhook. The config secret is regenerated to the selected variant when the operator owns it, so the toggle takes effect at runtime. A hash of the Alertmanager config is added as a pod annotation so that config changes roll the Alertmanager pod and reload the new config. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the raw alertmanager.yaml config secret with an AlertmanagerConfig custom resource referenced by Alertmanager.spec.alertmanagerConfiguration: - If the user supplies an AlertmanagerConfig named calico-node-alertmanager in the tigera-operator namespace, the operator renders a copy of it in tigera-prometheus. Otherwise it renders a default: the Linseed webhook receiver when the UI alerts integration is enabled, or a null receiver when disabled. - The webhook authenticates to Linseed with the Linseed-issued bearer token secret for the prometheus service account (prometheus-tigera-linseed-token) and the client cert / CA bundle, all referenced from the CR; the prometheus-operator mounts them into the Alertmanager pod, so the explicit Secrets/ConfigMaps mounts are removed. - The pod is annotated with a hash of the AlertmanagerConfig spec, client cert and CA bundle so any config change rolls the pod. - The legacy alertmanager-calico-node-alertmanager config secret is now deleted. This also fixes the upgrade gap where a pre-existing (stock) config secret was left untouched because it matched neither operator default, so the integration never wired up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> (cherry picked from commit 044b64c)
Add IPPoolNearlyExhausted (>=90%, warning) and IPPoolExhausted (>=100%, critical) rules to the rendered calico PrometheusRule. Utilisation is computed per pool as sum by (ippool) (ipam_allocations_in_use) / sum by (ippool) (ipam_ippool_size); summing both sides on ippool aggregates the per-node allocations and collapses scrape labels so the metrics match on ippool alone. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the single UIAlertsIntegration toggle with a per-alert config under
Monitor.spec.alerts. The alert set is a curated, closed catalog (one field per
alert), each {Status: Enabled|Disabled}, defaulting to Enabled when unset so new
alerts ship on automatically. Warning+critical rule pairs are folded into one
logical alert. Fields: deniedPackets, tigeraStatus, tlsCertExpiry, licenseExpiry,
ipPoolExhaustion.
Also rename the DeniedPacketsRate rule to DeniedPackets (drop "high rate" wording)
and default Alertmanager to 1 replica so the alerts feature is on by default.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…account Give Alertmanager its own service account (calico-alertmanager) instead of reusing the shared prometheus SA: create the SA, run Alertmanager under it, mint its Linseed bearer-token secret for it, and bind the events ClusterRole to it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…oken RBAC Revert AlertmanagerConfig-CR delivery to a raw alertmanager.yaml config Secret, folding Monitor.spec.alerts enable/disable into routing (enabled alerts -> the linseed webhook, everything else -> a null receiver). Drops the operator's AlertmanagerConfig watch (and the need for alertmanagerconfigs RBAC). On managed clusters, also grant the management cluster's guardian service account permission to manage secrets in tigera-prometheus (the tigera-linseed RoleBinding), so Linseed's token controller can provision the calico-alertmanager token there. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…clusters
On a managed cluster the Alertmanager Linseed token is a Linseed-issued JWT that
Linseed's token controller pushes into tigera-prometheus as an Opaque secret. The
operator was also creating a service-account-token secret of the same name, and
since a Secret's type is immutable the two collide ("type Opaque is immutable"),
so the managed-cluster token never provisions. Gate the operator's SA-token secret
on !ManagedCluster (and delete it on managed clusters) so the token controller owns it there.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
On a managed cluster the Alertmanager webhook could not reach Linseed, so UI alerts never made it to the management cluster. Three gaps: - The webhook URL/SNI hardcoded the in-cluster service (tigera-linseed.tigera-elasticsearch.svc), which does not resolve on a managed cluster. Address Linseed via a namespace-local "tigera-linseed" ExternalName service that redirects to Guardian (SNI "tigera-linseed"), mirroring fluentd. - The operator deleted the Alertmanager Linseed token Secret on managed clusters, wiping the Linseed-issued JWT that the token controller owns (same name, different immutable type). The operator now neither creates nor deletes that Secret on managed clusters. - Alertmanager's trusted bundle lacked the management cluster's Linseed CA (the webhook TLS connection terminates at the management Linseed through the tunnel). Add VoltronLinseedPublicCert to the bundle on managed clusters, mirroring fluentd. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… clusters The monitor controller read the existing Alertmanager Linseed token secret on every reconcile to carry its Kubernetes-populated data forward. On a managed cluster the operator does not render that secret (Linseed's token controller owns it as an Opaque JWT), so the read was pointless and, being type-blind, could copy Linseed's JWT into the operator's desired secret. Gate the read on !managedCluster so the controller only reads it where the renderer creates it. Also make MonitorSpec.UIAlertsEnabled use a pointer receiver to match the generated DeepCopy methods (staticcheck ST1016), and trim verbose doc comments. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Wires Prometheus/Alertmanager alerts through to the manager Alerts page, with a toggle to enable/disable the integration.
/api/v1/events/alertmanagerMonitor.spec.uiAlertsIntegration(Enabled|Disabled, default Enabled). When disabled, the rendered Alertmanager config routes to a null receiver. The operator regenerates the config secret when it owns it, so toggling takes effect at runtime.Companion PRs: calico-private (Linseed ingest + dedup), ui-modules (Alerts page toggle +
prometheus_alertrendering).🤖 Generated with Claude Code