KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377
Conversation
KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off command reported success. Against an already-off chassis the BMC rejects the power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA therefore never restarted the VMs until the dead host was powered back on. Fencing now succeeds based on the actual chassis power state: - if the host is already powered off (OOBM STATUS == Off), treat it as fenced; - otherwise issue a best-effort power-off and confirm via OOBM STATUS; - only a confirmed Off state counts as success; if the state cannot be confirmed (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain. Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing an unresponsive host (SOFT remains the graceful ACPI shutdown). Fixes apache#13376
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## 4.22 #13377 +/- ##
=========================================
Coverage 17.67% 17.68%
- Complexity 15792 15798 +6
=========================================
Files 5922 5922
Lines 533165 533184 +19
Branches 65208 65211 +3
=========================================
+ Hits 94242 94273 +31
+ Misses 428276 428264 -12
Partials 10647 10647
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
|
@blueorangutan package KVM |
|
@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM SystemVM template(s). I'll keep you posted as I make progress. |
|
Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18194 |
|
@blueorangutan test r9 kvm-r9 |
|
@andrijapanicsb [SL] unsupported parameters provided. Supported mgmt server os are: |
|
@blueorangutan test rocky9 kvm-rocky9 |
|
@andrijapanicsb a [SL] Trillian-Jenkins test job (rocky9 mgmt + kvm-rocky9) has been kicked to run smoke tests |
Description
When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to
Downand therefore never restarts its VMs on other hosts — the host stays inAlert/Disconnectedindefinitely.Root cause: the host-HA state machine declares a host dead (
HAState.Fenced→ investigatorStatus.Down) only after a successful OOBM power-off. Against an already-off chassis the BMC rejects the power-off (the Redfish driver mapsOFFtoGracefulShutdown, which returns HTTP 409 when the system is already off), soKVMHAProvider.fence()reports failure and the host stays stuck in theFencingstate — whichHAManagerImpl.getHostStatusFromHAConfig()maps toStatus.Disconnected, notStatus.Down. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds.Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376.
Fix
Fencing now succeeds based on the actual chassis power state, not the power-off command's return code:
OOBM STATUS == Off) → treat it as fenced (no power-off issued);Offstate counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain.This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers).
Additionally, the Redfish driver now maps
PowerOperation.OFFtoForceOff(a hard power-off) instead ofGracefulShutdown— consistent with the ipmitool driver and appropriate for fencing an unresponsive host;SOFTremains the graceful ACPI shutdown. Also fixes a latentString.formatargument-count bug on the RedfishSTATUSbranch.Fixes: #13376
Types of changes
Bug Severity
How Has This Been Tested?
Unit tests added to
KVMHostHATest(all green) covering the fence behaviour:Off→ fenced;Off→ still fenced (the regression for this issue);Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's
stopVirtualMachine, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above.