feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633
feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633weicao wants to merge 150 commits into
Conversation
weicao
left a comment
There was a problem hiding this comment.
Review: PR #2633 — feat(mariadb): topology merge + addon-api conformance (alpha.86 -> alpha.90)
Reviewer: @JADE (PR Reviewer)
Stats: +24,151 / -34 across 54 files. CI RED (28 ShellSpec failures).
Summary
Mega-rollup of MariaDB addon evolution spanning ~25 alpha versions. Delivers:
- Topology merge: async + semi-sync consolidated into single
replicationCmpD with mode selector - Galera topology (new CmpD + scripts)
- Version expansion: 1 -> 6 declared versions
- Addon-API conformance (PD, PCR, ClusterDefinition, BPT)
- Backup modernization (datasafed)
- Security hardening (kb_internal_root, SUPER stripping)
Blocking Issues
B1. CI ShellSpec tests RED — 28 failures out of 1197.
Multiple categories: test-vs-code desync (parametersSchema removed but tests still expect it), hardcoded version literals not updated, new env-wire tests failing, template content tests stale. Tests must pass before merge.
B2. Image tag mismatch for 11.4.10.
cmpv.yaml maps 11.4.10 release to floating tag mariadb:11.4 instead of pinned mariadb:11.4.10. This means the release silently drifts to whatever Docker Hub resolves. Either pin to 11.4.10 or document the floating-tag decision.
B3. Chart.yaml contains ~2,500 lines of development journal.
Every alpha version's root cause, fix rationale, and design discussion references are inlined. This ships in the chart artifact — every helm show chart dumps pages of internal notes. Move to a separate changelog or commit messages.
Non-blocking
-
Scope mixing — this is at least 4 PRs in one (topology merge, Galera, version expansion, addon-API conformance, backup modernization, security hardening, standalone overhaul, README rewrite). 24K lines across 54 files makes thorough review impractical. Strongly recommend splitting.
-
Legacy CmpDs rendered unconditionally —
cmpd-replication.yaml(1K lines) andcmpd-semisync.yaml(2.4K lines) retained for upgrade compat but no.Valuesgate to disable them. -
Excessive inline comments — templates contain hundreds of lines of rationale referencing specific people, timestamps, message IDs. Not meaningful to future maintainers.
-
values.yamldefaultreplication.mode: ""is ambiguous — empty string means async behavior but isn't declared. New users get no signal. -
BPT uses broad CmpD regex (
^mariadb-) matching all topologies. If Galera has different backup requirements, this is a problem.
Questions
- Why is Galera included in a PR titled "topology merge"? It's entirely new, not a merge.
- For the merged CmpD, does mode "async" still provision semi-sync grants? Wasteful or harmful?
- Was the PR rebased after live validation? CI is red against actual branch head.
Verdict: REQUEST_CHANGES
Mandatory:
- Fix all 28 ShellSpec test failures
- Pin
mariadb:11.4->mariadb:11.4.10or document floating-tag intent - Remove 2,500-line Chart.yaml development journal
Strongly recommended:
4. Split into smaller PRs. At minimum: (a) standalone + addon-API, (b) topology merge, (c) Galera, (d) version expansion. The current PR is unreviewable as a single unit.
Add MariaDB 11.4 standalone, replication, semisync, and Galera chart resources. Harden semisync startup, role publication, switchover fencing, and script distribution. Add shell specs for replication member join, role probe, switchover, and standalone template mapping.
…ut preservation (#2757)
Preserve the syncer-promoted rc=2 path during semisync replica rejoin. When the local pod is promoted while rejoin is waiting, publish the primary SQL listener path instead of continuing replica fail-closed cleanup. Bump the MariaDB chart to 1.2.0-alpha.8.
… tech-debt tracker
b8f6d56 to
11f7464
Compare
Require pending secondary roleProbe publication to prove healthy replica IO/SQL threads before reporting secondary. Includes r28 regression coverage and alpha.10 chart bump.
Co-authored-by: weicao <weicao@users.noreply.github.com>
Recover existing-slave-config startup when the persisted slave metadata exists but the runtime slave channel has disappeared. Reuse the existing primary-service replication configure path and keep normal unhealthy-channel retries fail-closed.
After a rolling restart (e.g. static reconfigure), the IO thread auto-reconnects but semisync negotiation can take >90s to self-heal. Add recover_semisync_slave_health_after_rejoin() that polls for 10s, then forces IO thread restart if Rpl_semi_sync_slave_status is still OFF, with a 15s recovery window. Called from both success paths of finalize_replication_rejoin_ready_gate(). No-op for async mode.
Co-authored-by: weicao <weicao@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mariadb-backup --slave-info internally runs SHOW ALL SLAVES STATUS which requires SLAVE MONITOR privilege. During initial pod startup there is a brief window after mariadbd restarts on 0.0.0.0 where privilege tables may not yet be fully loaded, causing the backup to fail with access denied. Add a bounded 30s retry that verifies SHOW ALL SLAVES STATUS succeeds before invoking mariadb-backup, closing the race window.
During switchover, the runtime-primary reconcile loop can race with the switchover action's fence_current_primary_local_writes_after_dcs: the reconcile sets read_only=OFF (thinking it's still primary in the same second the switchover action sets read_only=ON. The single-shot verification fails because read_only was toggled back. Replace the single-shot check with a 10-attempt bounded retry (1s each). The reconcile loop discovers the new role from syncer within ~3 seconds and stops fighting, so read_only=ON stabilizes well within the budget. EOF )
After Restart/Stop-Start the candidate pod may have a brief window where 3306 is not yet listening. Syncer does a single-shot TCP read-check before creating the DCS switchover; if that single check hits the window, the entire switchover fails with "syncerctl could not create DCS switchover". This bounded pre-DCS gate absorbs the transient window (default 12s budget, 1s poll, 1s connect timeout).
Summary
MariaDB addon evolution: alpha.37 semisync fencing baseline through 1.2.0-alpha.9. Topology merge, addon-api conformance, galera stabilization, replication hardening, syncer-side bug fixes, and full-topology acceptance.
Key changes
replicationtopology with merged CmpD covers async + semisync; legacy CmpDs retained for upgrade compat.replicationMode={async,semisync}user switch; CUE validates, addon mapper translates torpl_semi_sync_*variables.kb_internal_rootprovisioning, declarative account Phase A Path C.Validation
Follow-up
Child PRs (merged into dev branch)