Skip to content

feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633

Open
weicao wants to merge 150 commits into
mainfrom
feat/mariadb-alpha37-semisync-fencing-pr
Open

feat(mariadb): topology merge + addon-api conformance (alpha.86 → alpha.90)#2633
weicao wants to merge 150 commits into
mainfrom
feat/mariadb-alpha37-semisync-fencing-pr

Conversation

@weicao
Copy link
Copy Markdown
Contributor

@weicao weicao commented May 9, 2026

Summary

MariaDB addon evolution: alpha.37 semisync fencing baseline through 1.2.0-alpha.9. Topology merge, addon-api conformance, galera stabilization, replication hardening, syncer-side bug fixes, and full-topology acceptance.

Key changes

  • Topology merge (alpha.89): single replication topology with merged CmpD covers async + semisync; legacy CmpDs retained for upgrade compat.
  • Synthetic-parameter mapper (alpha.89): replicationMode={async,semisync} user switch; CUE validates, addon mapper translates to rpl_semi_sync_* variables.
  • Account model (alpha.108-114): BINLOG ADMIN restore, kb_internal_root provisioning, declarative account Phase A Path C.
  • Galera stabilization (alpha.115-122): single-owner bootstrap, wsrep-recover crash recovery, self-healing watcher, peer check, reconfigure auth fixes.
  • Replication hardening (alpha.124-129): host-scoped BINLOG ADMIN fence, semisync mode guard, non-mutating switchover probe, sql_log_bin suppression.
  • Syncer fixes (PR chore: fix restore pitr failed for mongo, if using datafile backup #170/Revert "chore: update disk size (#188)" #192): WriteCheck binlog guard outside transactions (Error 1694), async mode guard (no semisync in async), fallback lag check, EnableSemiSyncSource runtime timeout fix.
  • addon-api/12b conformance: README capability matrix, explicit unsupported declarations, BackupPolicyTemplate target block.
  • CmpV pin (alpha.9): 11.4.10 image tag pinned (was floating :11.4), Chart.yaml development journal stripped.

Validation

Follow-up

Child PRs (merged into dev branch)

@weicao weicao requested review from a team and leon-ape as code owners May 9, 2026 07:06
@codecov-commenter
Copy link
Copy Markdown

codecov-commenter commented May 9, 2026

Codecov Report

❌ Patch coverage is 0% with 3601 lines in your changes missing coverage. Please review.
✅ Project coverage is 0.00%. Comparing base (d784c97) to head (8a9d3cf).
⚠️ Report is 1 commits behind head on main.

Files with missing lines Patch % Lines
...iadb/scripts-ut-spec/replication_roleprobe_spec.sh 0.00% 654 Missing ⚠️
...db/scripts-ut-spec/replication_member_join_spec.sh 0.00% 620 Missing ⚠️
...pts-ut-spec/semisync_rejoin_fence_template_spec.sh 0.00% 389 Missing ⚠️
...ipts-ut-spec/reconfigure_persisted_alpha86_spec.sh 0.00% 346 Missing ⚠️
...db/scripts-ut-spec/replication_mode_mapper_spec.sh 0.00% 269 Missing ⚠️
...plication_merged_semisync_startup_recovery_spec.sh 0.00% 224 Missing ⚠️
...ts-ut-spec/seed_replication_mode_overrides_spec.sh 0.00% 213 Missing ⚠️
...plication_merged_replication_mode_env_wire_spec.sh 0.00% 166 Missing ⚠️
...ripts-ut-spec/replication_user_convergence_spec.sh 0.00% 142 Missing ⚠️
...replication_merged_pd_regex_disambiguation_spec.sh 0.00% 123 Missing ⚠️
... and 8 more
Additional details and impacted files
@@           Coverage Diff           @@
##            main   #2633     +/-   ##
=======================================
  Coverage   0.00%   0.00%             
=======================================
  Files         73      92     +19     
  Lines       9270   14942   +5672     
=======================================
- Misses      9270   14942   +5672     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor Author

@weicao weicao left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: PR #2633 — feat(mariadb): topology merge + addon-api conformance (alpha.86 -> alpha.90)

Reviewer: @JADE (PR Reviewer)
Stats: +24,151 / -34 across 54 files. CI RED (28 ShellSpec failures).


Summary

Mega-rollup of MariaDB addon evolution spanning ~25 alpha versions. Delivers:

  1. Topology merge: async + semi-sync consolidated into single replication CmpD with mode selector
  2. Galera topology (new CmpD + scripts)
  3. Version expansion: 1 -> 6 declared versions
  4. Addon-API conformance (PD, PCR, ClusterDefinition, BPT)
  5. Backup modernization (datasafed)
  6. Security hardening (kb_internal_root, SUPER stripping)

Blocking Issues

B1. CI ShellSpec tests RED — 28 failures out of 1197.
Multiple categories: test-vs-code desync (parametersSchema removed but tests still expect it), hardcoded version literals not updated, new env-wire tests failing, template content tests stale. Tests must pass before merge.

B2. Image tag mismatch for 11.4.10.
cmpv.yaml maps 11.4.10 release to floating tag mariadb:11.4 instead of pinned mariadb:11.4.10. This means the release silently drifts to whatever Docker Hub resolves. Either pin to 11.4.10 or document the floating-tag decision.

B3. Chart.yaml contains ~2,500 lines of development journal.
Every alpha version's root cause, fix rationale, and design discussion references are inlined. This ships in the chart artifact — every helm show chart dumps pages of internal notes. Move to a separate changelog or commit messages.


Non-blocking

  1. Scope mixing — this is at least 4 PRs in one (topology merge, Galera, version expansion, addon-API conformance, backup modernization, security hardening, standalone overhaul, README rewrite). 24K lines across 54 files makes thorough review impractical. Strongly recommend splitting.

  2. Legacy CmpDs rendered unconditionallycmpd-replication.yaml (1K lines) and cmpd-semisync.yaml (2.4K lines) retained for upgrade compat but no .Values gate to disable them.

  3. Excessive inline comments — templates contain hundreds of lines of rationale referencing specific people, timestamps, message IDs. Not meaningful to future maintainers.

  4. values.yaml default replication.mode: "" is ambiguous — empty string means async behavior but isn't declared. New users get no signal.

  5. BPT uses broad CmpD regex (^mariadb-) matching all topologies. If Galera has different backup requirements, this is a problem.


Questions

  1. Why is Galera included in a PR titled "topology merge"? It's entirely new, not a merge.
  2. For the merged CmpD, does mode "async" still provision semi-sync grants? Wasteful or harmful?
  3. Was the PR rebased after live validation? CI is red against actual branch head.

Verdict: REQUEST_CHANGES

Mandatory:

  1. Fix all 28 ShellSpec test failures
  2. Pin mariadb:11.4 -> mariadb:11.4.10 or document floating-tag intent
  3. Remove 2,500-line Chart.yaml development journal

Strongly recommended:
4. Split into smaller PRs. At minimum: (a) standalone + addon-API, (b) topology merge, (c) Galera, (d) version expansion. The current PR is unreviewable as a single unit.

weicao and others added 15 commits June 5, 2026 06:14
Preserve the syncer-promoted rc=2 path during semisync replica rejoin. When the local pod is promoted while rejoin is waiting, publish the primary SQL listener path instead of continuing replica fail-closed cleanup. Bump the MariaDB chart to 1.2.0-alpha.8.
@weicao weicao force-pushed the feat/mariadb-alpha37-semisync-fencing-pr branch from b8f6d56 to 11f7464 Compare June 4, 2026 22:14
weicao and others added 14 commits June 5, 2026 07:59
Require pending secondary roleProbe publication to prove healthy replica IO/SQL threads before reporting secondary. Includes r28 regression coverage and alpha.10 chart bump.
Co-authored-by: weicao <weicao@users.noreply.github.com>
Recover existing-slave-config startup when the persisted slave metadata exists but the runtime slave channel has disappeared. Reuse the existing primary-service replication configure path and keep normal unhealthy-channel retries fail-closed.
After a rolling restart (e.g. static reconfigure), the IO thread
auto-reconnects but semisync negotiation can take >90s to self-heal.
Add recover_semisync_slave_health_after_rejoin() that polls for 10s,
then forces IO thread restart if Rpl_semi_sync_slave_status is still
OFF, with a 15s recovery window. Called from both success paths of
finalize_replication_rejoin_ready_gate(). No-op for async mode.
Co-authored-by: weicao <weicao@users.noreply.github.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
mariadb-backup --slave-info internally runs SHOW ALL SLAVES STATUS which
requires SLAVE MONITOR privilege. During initial pod startup there is a
brief window after mariadbd restarts on 0.0.0.0 where privilege tables
may not yet be fully loaded, causing the backup to fail with access denied.

Add a bounded 30s retry that verifies SHOW ALL SLAVES STATUS succeeds
before invoking mariadb-backup, closing the race window.
During switchover, the runtime-primary reconcile loop can race with the
switchover action's fence_current_primary_local_writes_after_dcs: the
reconcile sets read_only=OFF (thinking it's still primary in the same
second the switchover action sets read_only=ON. The single-shot
verification fails because read_only was toggled back.

Replace the single-shot check with a 10-attempt bounded retry (1s each).
The reconcile loop discovers the new role from syncer within ~3 seconds
and stops fighting, so read_only=ON stabilizes well within the budget.
EOF
)
After Restart/Stop-Start the candidate pod may have a brief window
where 3306 is not yet listening. Syncer does a single-shot TCP
read-check before creating the DCS switchover; if that single check
hits the window, the entire switchover fails with "syncerctl could
not create DCS switchover". This bounded pre-DCS gate absorbs the
transient window (default 12s budget, 1s poll, 1s connect timeout).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

nopick Not auto cherry-pick when PR merged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants