Skip to content

KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377

Open
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:fix/kvm-ha-fence-already-off
Open

KVM HA: fence by confirming host power state (fix host stuck in Fencing when already powered off)#13377
andrijapanicsb wants to merge 1 commit into
apache:4.22from
andrijapanicsb:fix/kvm-ha-fence-already-off

Conversation

@andrijapanicsb
Copy link
Copy Markdown
Contributor

Description

When a KVM host with host-HA + out-of-band management (OOBM) enabled is hard powered off (forced chassis-off from the BMC, or a real power/cable failure), CloudStack never transitions the host to Down and therefore never restarts its VMs on other hosts — the host stays in Alert/Disconnected indefinitely.

Root cause: the host-HA state machine declares a host dead (HAState.Fenced → investigator Status.Down) only after a successful OOBM power-off. Against an already-off chassis the BMC rejects the power-off (the Redfish driver maps OFF to GracefulShutdown, which returns HTTP 409 when the system is already off), so KVMHAProvider.fence() reports failure and the host stays stuck in the Fencing state — which HAManagerImpl.getHostStatusFromHAConfig() maps to Status.Disconnected, not Status.Down. VM-HA is therefore never invoked, and the VMs are only recovered once the original (dead) host is powered back on, at which point the pending power-off finally succeeds.

Observed in production with Redfish/iDRAC. Full root-cause analysis and management-server log evidence are in #13376.

Fix

Fencing now succeeds based on the actual chassis power state, not the power-off command's return code:

  • if the host is already powered off (OOBM STATUS == Off) → treat it as fenced (no power-off issued);
  • otherwise issue a best-effort power-off and then confirm via OOBM STATUS;
  • only a confirmed Off state counts as a successful fence; if the state cannot be confirmed (e.g. an unreachable BMC) the fence fails and is retried, to avoid split-brain.

This is OOBM-driver-agnostic (works for ipmitool, Redfish and nested-cloudstack drivers).

Additionally, the Redfish driver now maps PowerOperation.OFF to ForceOff (a hard power-off) instead of GracefulShutdown — consistent with the ipmitool driver and appropriate for fencing an unresponsive host; SOFT remains the graceful ACPI shutdown. Also fixes a latent String.format argument-count bug on the Redfish STATUS branch.

Fixes: #13376

Types of changes

  • Breaking change (fix or feature that would cause existing functionality to change)
  • New feature/enhancement (non-breaking change which adds functionality)
  • Bug fix (non-breaking change which fixes an issue)
  • Enhancement (improves an existing feature and functionality)
  • Cleanup (Code refactoring and cleanup, that may add test cases)
  • build/CI

Bug Severity

  • BLOCKER
  • Critical
  • Major
  • Minor
  • Trivial

How Has This Been Tested?

Unit tests added to KVMHostHATest (all green) covering the fence behaviour:

  • host already off → fenced without issuing a power-off;
  • power-off succeeds, STATUS confirms Off → fenced;
  • power-off command fails (HTTP 409) but STATUS confirms Off → still fenced (the regression for this issue);
  • power state cannot be confirmed (unreachable BMC) → fence fails (no split-brain);
  • OOBM not enabled → fence fails.
mvn -pl plugins/hypervisors/kvm -Dtest=KVMHostHATest test
=> Tests run: 9, Failures: 0, Errors: 0, Skipped: 0

Note on reproduction: the original symptom reproduces on real Redfish hardware (power-off-when-off → HTTP 409). Software/nested OOBM drivers whose power-off is idempotent (e.g. the nested-cloudstack driver's stopVirtualMachine, which is a no-op on an already-stopped VM) do not exhibit the bug, so the deterministic coverage is provided by the unit tests above.

KVMHAProvider.fence() declared a host fenced only when the out-of-band power-off
command reported success. Against an already-off chassis the BMC rejects the
power-off (e.g. Redfish returns HTTP 409), so fence() failed and the host stayed
stuck in the Fencing HA state, which maps to Disconnected (not Down). VM-HA
therefore never restarted the VMs until the dead host was powered back on.

Fencing now succeeds based on the actual chassis power state:
 - if the host is already powered off (OOBM STATUS == Off), treat it as fenced;
 - otherwise issue a best-effort power-off and confirm via OOBM STATUS;
 - only a confirmed Off state counts as success; if the state cannot be confirmed
   (e.g. unreachable BMC) the fence fails and is retried, to avoid split-brain.

Also map Redfish PowerOperation.OFF to ForceOff (hard power-off) instead of
GracefulShutdown, consistent with the ipmitool driver and appropriate for fencing
an unresponsive host (SOFT remains the graceful ACPI shutdown).

Fixes apache#13376
@codecov
Copy link
Copy Markdown

codecov Bot commented Jun 8, 2026

Codecov Report

❌ Patch coverage is 79.31034% with 6 lines in your changes missing coverage. Please review.
✅ Project coverage is 17.68%. Comparing base (21b2025) to head (65a3e99).

Files with missing lines Patch % Lines
...va/org/apache/cloudstack/kvm/ha/KVMHAProvider.java 85.18% 3 Missing and 1 partial ⚠️
...fbandmanagement/driver/redfish/RedfishWrapper.java 0.00% 2 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##               4.22   #13377   +/-   ##
=========================================
  Coverage     17.67%   17.68%           
- Complexity    15792    15798    +6     
=========================================
  Files          5922     5922           
  Lines        533165   533184   +19     
  Branches      65208    65211    +3     
=========================================
+ Hits          94242    94273   +31     
+ Misses       428276   428264   -12     
  Partials      10647    10647           
Flag Coverage Δ
uitests 3.69% <ø> (ø)
unittests 18.75% <79.31%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@andrijapanicsb
Copy link
Copy Markdown
Contributor Author

@blueorangutan package KVM

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a [SL] Jenkins job has been kicked to build packages. It will be bundled with KVM SystemVM template(s). I'll keep you posted as I make progress.

@blueorangutan
Copy link
Copy Markdown

Packaging result [SF]: ✔️ el8 ✔️ el9 ✔️ el10 ✔️ debian ✔️ suse15. SL-JID 18194

@andrijapanicsb
Copy link
Copy Markdown
Contributor Author

@blueorangutan test r9 kvm-r9

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb [SL] unsupported parameters provided. Supported mgmt server os are: suse15, alma10, ol10, rocky10, alma9, centos7, centos6, rocky9, alma8, ubuntu18, ol9, ol8, ubuntu22, debian12, ubuntu20, rocky8, ubuntu24. Supported hypervisors are: kvm-centos6, kvm-centos7, kvm-rocky8, kvm-rocky9, kvm-rocky10, kvm-ol8, kvm-ol9, kvm-ol10, kvm-alma8, kvm-alma9, kvm-alma10, kvm-ubuntu18, kvm-ubuntu20, kvm-ubuntu22, kvm-ubuntu24, kvm-debian12, kvm-suse15, vmware-55u3, vmware-60u2, vmware-65u2, vmware-67u3, vmware-70u1, vmware-70u2, vmware-70u3, vmware-80, vmware-80u1, vmware-80u2, vmware-80u3, vmware-80u3e, xenserver-65sp1, xenserver-71, xenserver-74, xenserver-84, xcpng74, xcpng76, xcpng80, xcpng81, xcpng82, xcpng83

@andrijapanicsb
Copy link
Copy Markdown
Contributor Author

@blueorangutan test rocky9 kvm-rocky9

@blueorangutan
Copy link
Copy Markdown

@andrijapanicsb a [SL] Trillian-Jenkins test job (rocky9 mgmt + kvm-rocky9) has been kicked to run smoke tests

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants