Skip to content

Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909

Merged
wu-sheng merged 1 commit into
masterfrom
fix/runtime-rule-no-init-schema-change
Jun 14, 2026
Merged

Fix runtime-rule (MAL/LAL) hot-update in no-init mode and k8s cluster node identity#13909
wu-sheng merged 1 commit into
masterfrom
fix/runtime-rule-no-init-schema-change

Conversation

@wu-sheng

Copy link
Copy Markdown
Member

Fix runtime-rule (MAL/LAL hot-update) schema changes in no-init mode, and the runtime-rule cluster node-identity collision on Kubernetes

  • Add a unit test to verify that the fix works.
  • Explain briefly why the bug exists and how to fix it.

Two bugs in the runtime-rule (DSL hot-update) cluster path, both confirmed end-to-end on a local kind cluster:

1. Runtime-rule schema changes were inoperative in no-init mode — the mode every production OAP cluster runs (a one-shot -Dmode=init Job creates the static schema; the OAP Deployment runs -Dmode=no-init). A runtime addOrUpdate introducing a new metric blocked forever in the storage installer's init-node poll loop (ModelInstaller.whenCreating), because the loop was gated on RunningMode rather than the operation's intent. /delete?mode=revertToBundled recreate and BanyanDB in-place shape updates were dead the same way. Fix: a new StorageManipulationOpt.Flags.deferDDLToInitNode bit, set only on the static boot-time schemaCreateIfAbsent() opt (DRYed into ModelInstaller.deferDDLToInitNode(opt), reused by the BanyanDB shape-check / group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly / withoutSchemaChange) are now driven by their flags and by cluster main-ness — no-init and default no longer differ for DSL DDL; init stays the dedicated initializer. DSLManager.tickStorageOpt is collapsed accordingly.

2. Runtime-rule cross-node writes failed with HTTP 400 forward_self_loop on a multi-replica Kubernetes cluster. Every OAP replica shared the cluster selfNodeId 0.0.0.0_11800 (derived from the 0.0.0.0 agent gRPC bind host via TelemetryRelatedContext), so the main's self-loop guard rejected a legitimate peer-to-peer Forward as if it had looped back. Fix: resolve the runtime-rule node identity from the unique per-pod SKYWALKING_COLLECTOR_UID (the pod UID injected by the helm chart / swck operator from metadata.uid), in start() before any apply; falls back to the telemetry id off-Kubernetes. MainRouter already routes correctly off the cluster peer addresses (pod IPs); only the self-loop identity needed to be unique.

Tests: new ModelInstallerNoInitTest (UT) for the no-init create chokepoint; the runtime-rule cluster e2e is converted from docker-compose (default mode — which never exercised either bug) to a kind + skywalking-helm no-init cluster (oap.replicas=2) driving the apply / STRUCTURAL / inactivate / delete lifecycle, cross-node convergence, and the cross-node Forward path.

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

…ode identity

Runtime-rule schema changes were inoperative in no-init mode (the mode every
production OAP cluster runs), and runtime-rule cross-node writes failed on
multi-replica Kubernetes clusters. Both are fixed here.

* no-init schema change: the storage installer's init-node poll loop
  (ModelInstaller.whenCreating) was gated on RunningMode, so a runtime
  withSchemaChange create / update / revert blocked forever on a no-init OAP.
  Gate it instead on a new StorageManipulationOpt.Flags.deferDDLToInitNode bit,
  set only on the static-boot schemaCreateIfAbsent opt and DRYed into
  ModelInstaller.deferDDLToInitNode(opt) (reused by the BanyanDB shape-check and
  group-DDL gates). The runtime-rule opts (withSchemaChange / verifySchemaOnly /
  withoutSchemaChange) are now driven by their flags and by cluster main-ness:
  no-init and default no longer differ for DSL DDL; init stays the dedicated
  initializer. DSLManager.tickStorageOpt is collapsed accordingly.

* k8s node identity: resolve the runtime-rule selfNodeId from the unique per-pod
  SKYWALKING_COLLECTOR_UID (pod UID, injected from metadata.uid) instead of the
  colliding telemetry id (0.0.0.0_11800 under a 0.0.0.0 gRPC bind host), in
  start() before any apply. This fixes HTTP 400 forward_self_loop on the
  cross-node Forward path; MainRouter already routes correctly off pod IPs.

* tests: add ModelInstallerNoInitTest (UT); convert the runtime-rule/cluster e2e
  from docker-compose (default mode, which exercised neither bug) to a kind +
  skywalking-helm no-init cluster (oap.replicas=2) covering the apply / STRUCTURAL
  / inactivate / delete lifecycle, cross-node convergence, and the Forward path.
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jun 13, 2026
@wu-sheng wu-sheng added the bug Something isn't working and you are sure it's a bug! label Jun 13, 2026
@wu-sheng wu-sheng merged commit 53baf8e into master Jun 14, 2026
434 of 440 checks passed
@wu-sheng wu-sheng deleted the fix/runtime-rule-no-init-schema-change branch June 14, 2026 00:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working and you are sure it's a bug!

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants