Skip to content

SWIP-15: BanyanDB instance-relation deployment topology + category-separated self-observability rules#13905

Merged
wu-sheng merged 7 commits into
masterfrom
feat/mal-service-instance-relation
Jun 12, 2026
Merged

SWIP-15: BanyanDB instance-relation deployment topology + category-separated self-observability rules#13905
wu-sheng merged 7 commits into
masterfrom
feat/mal-service-instance-relation

Conversation

@wu-sheng

@wu-sheng wu-sheng commented Jun 12, 2026

Copy link
Copy Markdown
Member

SWIP-15: BanyanDB self-observability — pod-to-pod deployment topology + category-separated rules

  • If this is non-trivial feature, paste the links/URLs to the design doc. — SWIP-15 (docs/en/swip/SWIP-15.md).
  • Update the documentation to include this new feature. — SWIP-15, operator doc, changelog (below).
  • Tests (including UT, IT, E2E) are added to verify the new feature. — MAL execution test (1350/0), boot-check (1/0), banyandb e2e (28/28, live cluster).
  • If it's UI related, attach the screenshots below. — UI consumes these via a paired skywalking-horizon-ui PR; no UI change in this repo.

What this includes

1. SERVICE_INSTANCE_RELATION MAL scope + deployment topology

  • New SERVICE_INSTANCE_RELATION scope + serviceInstanceRelation(...) builder; the meter Analyzer bridges to the ServiceInstanceRelation server/client-side topology metrics, so getServiceInstanceTopology renders the edges.
  • New banyandb-instance-relation.yaml: the pod-to-pod flow graph (the Horizon UI "deployment" component) with per-edge, per-operation publish_* / queue_sub_* / migration_* metrics (throughput / p99 / error / bytes).

2. Category-separated BanyanDB self-observability rules

  • banyandb-instance.yaml → role-separated (node_* shared, liaison_*, data_*, lifecycle_*).
  • banyandb-endpoint.yaml → data-type-separated (measure_* / stream_* / stream_tst_* / trace_* / property_*; operation-keyed queue_* stay type-agnostic). Adds the previously-unmodeled property type (so sw_property groups stop rendering all-empty) and the trace storage inverted-index series.

3. Fix: percent metrics rendered as 100%

  • system_memory_percent / disk_usage_percent / disk_used_percent_by_path emitted a 0–1 fraction (collapsed to 0/1 in the integer meter store → rendered 100%). Now use BanyanDB's used_percent × 100. Verified live on a kind cluster: a ~51% disk now reads 51%.

4. Docs

  • SWIP-15 synced to the implemented metrics (then treated as the stable design doc); operator catalog docs/en/banyandb/dashboards-banyandb.md refreshed; the SWIP-15 changelog consolidated into one concise entry.
  • guides/How-to-release.md: new "Publish the GitHub release" section documenting the changelog-wrapping rule (GitHub renders a bullet's prose continuation lines as <br>, so keep prose on one line; sub-bullets are fine — verified on v10.4.0). Related conventions added to CLAUDE.md.

Validation

MAL test 1350/0 · boot-check 1/0 · checkstyle / license-eye / FQCN clean · banyandb e2e 28/28 (live cluster) · disk-percent fix verified live (51%).

  • If this pull request closes/resolves/fixes an existing issue, replace the issue number. Closes #.
  • Update the CHANGES log.

🤖 Generated with Claude Code

…pology and category-separated so11y rules

Add a SERVICE_INSTANCE_RELATION scope to the MAL engine and a
serviceInstanceRelation(...) builder so MAL rules can emit intra-cluster
(same-service) instance topology. The meter Analyzer bridges these to the
ServiceInstanceRelation server/client-side topology metrics, so
getServiceInstanceTopology renders the edges.

SWIP-15 uses this for the BanyanDB deployment view (new
banyandb-instance-relation.yaml): the pod-to-pod flow graph with per-edge,
per-operation metrics -- write distribution (liaison<->data via publish_* /
queue_sub_*) and tier migration (lifecycle->data via migration_*), each
carrying throughput / p99 latency / error rate / bytes rate.

Category-separate the BanyanDB self-observability rules so a metric reads only
the families that genuinely exist for its category (instead of one unified rule
that left empty panels): instance rules carry a role prefix on the rule name
(node_* shared resource/runtime, liaison_* front-door gRPC/publish, data_*
storage/index/subscribe-queue/retention, lifecycle_* migration health);
endpoint rules carry a data-type prefix (measure_* / stream_* / stream_tst_* /
trace_* / property_*, with operation-keyed queue_* / publish_bytes staying
type-agnostic). This adds the previously-unmodeled property data type (so
sw_property groups stop rendering all-empty) and the trace storage
inverted-index series/term-search/total-series (previously silently dropped).
Scope and entity keys are unchanged.

Validation: MAL execution test 1350/0, boot-check 1/0, banyandb e2e 28/28 live.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wu-sheng wu-sheng added so11y Self Observability enhancement Enhancement on performance or codes labels Jun 12, 2026
@wu-sheng wu-sheng added this to the 11.0.0 milestone Jun 12, 2026
wu-sheng and others added 6 commits June 12, 2026 14:23
…metrics

Update SWIP-15 to what was actually built (then it is the stable design doc):
the SERVICE_INSTANCE_RELATION scope is now implemented (no longer future work /
out of config-only scope), and the instance/endpoint catalogs are
category-separated (node_*/liaison_*/data_*/lifecycle_* and
measure_*/stream_*/stream_tst_*/trace_*/property_*). Resolve the stale "should
pin the closure with a compile test" note (it ships + is boot-compile-tested),
and reference sections via Markdown anchor links instead of the section sign.

Refresh docs/en/banyandb/dashboards-banyandb.md (the living operator catalog)
with the per-role / per-type / property metric names and a new deployment
(instance-relation) topology section.

Add a CLAUDE.md note: a SWIP is the stable design doc synced once at
implementation and then frozen (further metrics go to the operator doc); use
Markdown anchor links in docs, not the section sign.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
system_memory_percent / disk_usage_percent / disk_used_percent_by_path produced
a raw 0-1 fraction (memory used_percent gauge, or disk used/total), which the
integer meter-value store collapses toward 0/1 — so a ~51% disk rendered as
100%. Scale to a 0-100 percentage (* 100), the convention the other otel-rules
already follow.

For disk, use BanyanDB's own kind='used_percent' (gopsutil used/(used+free))
instead of recomputing used/total (which ignores reserved blocks and
under-reports); the node's data paths share one filesystem and report the same
value, so avg() collapses them without the per-path sum inflation. retention
*_disk_usage_percent already emits 0-100 and is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
GitHub renders release notes as GFM, where a single newline inside a list item
becomes a <br> -- so a hard-wrapped changelog bullet shows jagged mid-sentence
breaks on the release page (the docs website reflows and hides this). Add a
"Publish the GitHub release" section to How-to-release.md explaining the gh
release step and the one-line-per-bullet rule, note it in CLAUDE.md, and
un-wrap this branch's SWIP-15 changelog entry to a single line.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…rified on v10.4.0

Confirmed against the v10.4.0 release page's rendered HTML: a prose continuation
line becomes a <br> (the DrainBalancer entry renders '...rebalancing
(DrainBalancer).<br>Designed to replace DataCarrier...'), while nested
sub-bullets render as a clean nested list. Refine the guidance accordingly --
only a bullet's prose must stay on one line; sub-bullets are fine.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Lead the instance name with the role: instance() keys become
['container_name','pod_name'] (e.g. data@demo-banyandb-data-hot-0). The
instance-relation endpoint keys are flipped in lockstep (local + remote) so the
deployment topology still resolves to the same instances. Fixtures, e2e cases /
expected topology, SWIP-15, and the operator doc updated to match.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wu-sheng wu-sheng merged commit 4185f50 into master Jun 12, 2026
436 of 439 checks passed
@wu-sheng wu-sheng deleted the feat/mal-service-instance-relation branch June 12, 2026 12:19
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement Enhancement on performance or codes so11y Self Observability

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants