Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity by wu-sheng · Pull Request #13909 · apache/skywalking

wu-sheng · 2026-06-13T15:39:57Z

Fix runtime-rule (MAL/LAL hot-update) schema changes in `no-init` mode, and the runtime-rule cluster node-identity collision on Kubernetes

Add a unit test to verify that the fix works.
Explain briefly why the bug exists and how to fix it.

Two bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed end-to-end on a local kind cluster:

1. Runtime-rule schema changes were inoperative in no-init mode — the mode every production OAP cluster runs (a one-shot -Dmode=init Job creates the static schema; the OAP Deployment runs -Dmode=no-init). A runtime addOrUpdate introducing a new metric blocked forever in the storage installer's init-node poll loop (ModelInstaller.whenCreating), because the loop was gated on RunningMode rather than the operation's intent. /delete?mode=revertToBundled recreate and BanyanDB in-place shape updates were dead the same way. Fix: a new StorageManipulationOpt.Flags.deferDDLToInitNode bit, set only on the static boot-time schemaCreateIfAbsent() opt (DRYed into ModelInstaller.deferDDLToInitNode(opt), reused by the BanyanDB shape-check / group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly / withoutSchemaChange) are now driven by their flags and by cluster main-ness — no-init and default no longer differ for DSL DDL; init stays the dedicated initializer. DSLManager.tickStorageOpt is collapsed accordingly.

2. Runtime-rule cross-node writes failed with HTTP 400 forward_self_loop on a multi-replica Kubernetes cluster. Every OAP replica shared the cluster selfNodeId 0.0.0.0_11800 (derived from the 0.0.0.0 agent gRPC bind host via TelemetryRelatedContext), so the main's self-loop guard rejected a legitimate peer-to-peer Forward as if it had looped back. Fix: resolve the runtime-rule node identity from the unique per-pod SKYWALKING_COLLECTOR_UID (the pod UID injected by the helm chart / swck operator from metadata.uid), in start() before any apply; falls back to the telemetry id off-Kubernetes. MainRouter already routes correctly off the cluster peer addresses (pod IPs); only the self-loop identity needed to be unique.

Tests: new ModelInstallerNoInitTest (UT) for the no-init create chokepoint; the runtime-rule cluster e2e is converted from docker-compose (default mode — which never exercised either bug) to a kind + skywalking-helm no-init cluster (oap.replicas=2) driving the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the cross-node Forward path.

If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
Update the CHANGES log.

…ode identity Runtime-rule schema changes were inoperative in no-init mode (the mode every production OAP cluster runs), and runtime-rule cross-node writes failed on multi-replica Kubernetes clusters. Both are fixed here. * no-init schema change: the storage installer's init-node poll loop (ModelInstaller.whenCreating) was gated on RunningMode, so a runtime withSchemaChange create / update / revert blocked forever on a no-init OAP. Gate it instead on a new StorageManipulationOpt.Flags.deferDDLToInitNode bit, set only on the static-boot schemaCreateIfAbsent opt and DRYed into ModelInstaller.deferDDLToInitNode(opt) (reused by the BanyanDB shape-check and group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly / withoutSchemaChange) are now driven by their flags and by cluster main-ness: no-init and default no longer differ for DSL DDL; init stays the dedicated initializer. DSLManager.tickStorageOpt is collapsed accordingly. * k8s node identity: resolve the runtime-rule selfNodeId from the unique per-pod SKYWALKING_COLLECTOR_UID (pod UID, injected from metadata.uid) instead of the colliding telemetry id (0.0.0.0_11800 under a 0.0.0.0 gRPC bind host), in start() before any apply. This fixes HTTP 400 forward_self_loop on the cross-node Forward path; MainRouter already routes correctly off pod IPs. * tests: add ModelInstallerNoInitTest (UT); convert the runtime-rule/cluster e2e from docker-compose (default mode, which exercised neither bug) to a kind + skywalking-helm no-init cluster (oap.replicas=2) covering the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the Forward path.

wu-sheng added this to the 11.0.0 milestone Jun 13, 2026

wu-sheng added the bug Something isn't working and you are sure it's a bug! label Jun 13, 2026

hanahmily approved these changes Jun 13, 2026

View reviewed changes

wu-sheng merged commit 53baf8e into master Jun 14, 2026
434 of 440 checks passed

wu-sheng deleted the fix/runtime-rule-no-init-schema-change branch June 14, 2026 00:04

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909

Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909
wu-sheng merged 1 commit into
masterfrom
fix/runtime-rule-no-init-schema-change

wu-sheng commented Jun 13, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

wu-sheng commented Jun 13, 2026

Fix runtime-rule (MAL/LAL hot-update) schema changes in no-init mode, and the runtime-rule cluster node-identity collision on Kubernetes

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Fix runtime-rule (MAL/LAL hot-update) schema changes in `no-init` mode, and the runtime-rule cluster node-identity collision on Kubernetes