Skip to content

ROX-35008: Add GH action to add VMs to existing OCP clusters#21060

Open
vikin91 wants to merge 4 commits into
masterfrom
piotr/ROX-35008-action-add-VMs
Open

ROX-35008: Add GH action to add VMs to existing OCP clusters#21060
vikin91 wants to merge 4 commits into
masterfrom
piotr/ROX-35008-action-add-VMs

Conversation

@vikin91

@vikin91 vikin91 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Description

Add a GitHub Actions workflow and supporting shell scripts to provision RHEL VMs on an existing OCP cluster and install the native roxagent on them.

This PR is now the native-only base of the stack:

  • installs OpenShift Virtualization when needed and waits for required rollout
  • deploys or adopts RHEL VMs and configures automation SSH access
  • installs roxagent natively and mounts a curated /tmp/roxroot view for scanning
  • exposes a workflow_dispatch action for running the flow from GitHub Actions

Follow-ups

User-facing documentation

Testing and quality

  • the change is production ready: the change is GA, or otherwise the functionality is gated by a feature flag
  • CI results are inspected

Automated testing

  • added unit tests
  • added e2e tests
  • added regression tests
  • added compatibility tests
  • modified existing tests

How I validated my change

Running the action:

Adding new VM to an existing cluster with another VM (cluster has KVM and virt operator installed already)

gh workflow run add-vms-to-cluster.yml \
    --ref piotr/ROX-35008-action-add-VMs \
    -f cluster-name=pr-06-08-vm \
    -f num-vms=1 \
    -f vm-os=rhel10

Logs: https://github.com/stackrox/stackrox/actions/runs/27331925416

Copying to a totally fresh cluster (no ACS):

➜ gh workflow run add-vms-to-cluster.yml \
    --ref piotr/ROX-35008-action-add-VMs \
    -f cluster-name=pr-06-11-action1 \
    -f num-vms=1 \
    -f vm-os=rhel10 

Logs: https://github.com/stackrox/stackrox/actions/runs/27337049006 (action was ✅ but roxagent had a tiny failure - wrong binary was uploaded)
Logs for rerun (to overwrite the roxagent): https://github.com/stackrox/stackrox/actions/runs/27338863169/job/80770148617#logs

Manually by running the bash scripts:

  1. Creating a new VM on existing cluster
  2. Taking over an existing VM (created before manually) and updating roxagent
QUAY_RHACS_ENG_RO_USERNAME=user QUAY_RHACS_ENG_RO_PASSWORD=xxx \
    scripts/ci/add-vms/add-vms.sh \
        --num-vms 1 \
        --os rhel10 \
        --ssh-key ~/.ssh/id_ed25519.pub

@vikin91

vikin91 commented Jun 10, 2026

Copy link
Copy Markdown
Contributor Author

@openshift-ci

openshift-ci Bot commented Jun 10, 2026

Copy link
Copy Markdown

Skipping CI for Draft Pull Request.
If you want CI signal for your change, please convert it to an actual PR.
You can still manually trigger a test run with /test all

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 9ff20a4c-1cdb-4e8e-a538-58e17723b144

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR adds a complete GitHub Actions-driven workflow for provisioning RHEL VMs on ACS clusters and installing a StackRox VM agent. It includes a dispatch-triggered workflow, composite action for infrastructure orchestration, Bash scripts for VM creation/SSH access/agent installation, and Quadlet systemd unit files for container-based agent deployment.

Changes

VM Provisioning and Agent Installation Workflow

Layer / File(s) Summary
GitHub Actions Workflow Entry Point
.github/workflows/add-vms-to-cluster.yml, scripts/ci/add-vms/action.yml
Manual workflow trigger accepts cluster name, VM parameters, agent type, and optional image tag/SSH key; composite action orchestrates infractl artifact download, kubectl verification, virt-operator installation, virtctl download, and invokes add-vms.sh with constructed arguments and secrets.
OpenShift Virtualization Setup
scripts/ci/add-vms/install-virt-operator.sh
Idempotent KubeVirt operator installation: creates namespace/operatorgroup/subscription, waits for CSV and HyperConverged readiness, patches HyperConverged to enable VSOCK feature gate, and patches Subscription to enable KVM emulation.
VM Deployment & SSH Access
scripts/ci/add-vms/add-vms.sh, scripts/ci/add-vms/deploy-vms.sh
Provisions RHEL VirtualMachines via CloudInit with automation and optional user SSH keys, manages Kubernetes secrets for SSH keypairs and image pull, probes SSH readiness, and adopts pre-existing VMs via password-based authentication when SSH probe fails.
Native Agent Installation
scripts/ci/add-vms/install-agent-native.sh
Builds roxagent native Go binary for linux/amd64, generates systemd unit and timer files, and deploys binary plus systemd artifacts to VMs via virtctl ssh and virtctl scp.
Quadlet Agent Installation & Image Handling
scripts/ci/add-vms/install-agent-quadlet.sh
Detects roxagent container image tag from central deployment or uses explicit tag, renders Quadlet templates with correct image, and performs idempotent installation by comparing installed vs desired image before deployment.
Main Orchestration & Summary
scripts/ci/add-vms/add-vms.sh (main/summary/cleanup)
Selects agent installer based on user input (native or quadlet), aggregates VM deployment results from managed/adopted/skipped arrays, prints summary with SSH access instructions and kubectl/virtctl commands, and cleans up temporary artifacts on exit.
Quadlet Installation Transport & Staging
scripts/ci/add-vms/quadlet/install.sh
Supports local, SSH, and virtctl-based installation modes; stages Quadlet files to target machine, validates staged content, filters container mounts for missing optional host paths, applies systemd configuration, and enables timer.
Quadlet Unit Definitions & Documentation
scripts/ci/add-vms/quadlet/roxagent.container, scripts/ci/add-vms/quadlet/roxagent.timer, scripts/ci/add-vms/quadlet/roxagent-prep.service, scripts/ci/add-vms/quadlet/roxagent-tmpfiles.conf, scripts/ci/add-vms/quadlet/README.md
Defines Quadlet container unit with vsock and RPM database mounts, timer for 3h40m periodic execution, prep service for RPM database setup, tmpfiles config for lock directory, and comprehensive documentation for prerequisites, installation, configuration, troubleshooting, and uninstallation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 12.73% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly describes the main objective of the changeset: adding a GitHub Action to automate adding VMs to OpenShift clusters, which aligns with all the files added in the PR.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description check ✅ Passed The PR description is substantially complete with clear objectives, validation evidence, and proper checkbox usage matching the template.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch piotr/ROX-35008-action-add-VMs

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

🚀 Build Images Ready

Images are ready for commit 332f73c. To use with deploy scripts:

export MAIN_IMAGE_TAG=4.12.x-136-g332f73ce15

@vikin91

vikin91 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 4

🧹 Nitpick comments (1)
scripts/ci/add-vms/action.yml (1)

71-86: 💤 Low value

Consider certificate verification when downloading virtctl.

Line 83 uses curl -k (insecure) to download virtctl from the cluster's ConsoleCLIDownload URL. While this is typically necessary for in-cluster service endpoints, it bypasses certificate validation and could expose the download to MitM attacks if the cluster or network is compromised.

If the cluster provides a valid certificate or a trusted CA bundle is available, prefer validating the connection. Otherwise, document this security tradeoff.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@scripts/ci/add-vms/action.yml` around lines 71 - 86, The download step
"Install virtctl from cluster" currently uses curl -k which disables certificate
verification; change it to validate TLS by removing -k and instead allow
supplying a CA bundle (e.g., via an environment variable like CA_BUNDLE or
KUBECONFIG_CACERT) to curl with --cacert, and only fall back to an explicit
opt-in (e.g., SKIP_CERT_VERIFY=true) to keep -k for environments that truly
require it; update the shell logic around the DOWNLOAD_URL retrieval and the
curl invocation to use the CA_BUNDLE variable when present, and ensure the step
still writes the downloaded binary to /usr/local/bin/virtctl and sets executable
permissions for virtctl.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @.github/workflows/add-vms-to-cluster.yml:
- Around line 49-50: The Checkout step currently uses actions/checkout without
disabling credential persistence; update the step that has uses:
actions/checkout@df4cb1c069e1874edd31b4311f1884172cec0e10 (the "Checkout" step)
to add the input persist-credentials: false so the GitHub token is not written
to .git/config and credentials are not persisted in the workspace.

In `@scripts/ci/add-vms/install-agent-quadlet.sh`:
- Around line 60-61: The temporary directory created in install-agent-quadlet.sh
(RENDERED_QUADLET_DIR="$(mktemp -d)") is never removed; add a cleanup trap that
rm -rfs "$RENDERED_QUADLET_DIR" on EXIT (and on ERR if desired) so the temp dir
is always removed after cp -a "${QUADLET_DIR}/." "${RENDERED_QUADLET_DIR}/";
implement this by defining a cleanup function and registering it with trap
(e.g., trap cleanup EXIT) near where RENDERED_QUADLET_DIR is created to ensure
no resource leak.
- Around line 75-94: The idempotency check in installed_image_tag_matches only
verifies the container Image= line and can skip reinstallation if the container
file exists but the timer/service or prep script weren't installed; modify
installed_image_tag_matches to also verify the presence and/or enabled state of
the agent systemd artifacts before returning true: after validating the
Image=${IMAGE_TAG} line (use IMAGE_TAG/installed_image/desired_line as currently
done), SSH into the VM (reuse the existing virtctl SSH invocation parameters:
NAMESPACE, AUTOMATION_SSH_PRIVKEY, SSH_USER) and assert that the roxagent timer
and service unit files (e.g., the roxagent.timer and roxagent-prep.service names
used in install.sh) and the prep script/binary installed by install.sh exist and
are executable (or that systemctl reports the timer/service enabled), and only
return success when both the image line and these files/services are present and
correct.

In `@scripts/ci/add-vms/install-virt-operator.sh`:
- Around line 129-156: After annotating the HyperConverged CR to add the VSOCK
gate (use the vsock_patch / HCO_NAME / OLM_NAMESPACE block), add a wait that
ensures the virt-handler DaemonSet has rolled out with the new config before
proceeding: after detecting VSOCK in kv_gates, call the same rollout check used
in tests/e2e/lib.sh (i.e., wait for the daemonset/virt-handler in the KubeVirt
namespace to reach desired number of available pods or use kubectl rollout
status) and fail via die if the rollout does not complete within a timeout;
ensure this check references the virt-handler daemonset name and the appropriate
KubeVirt namespace.

---

Nitpick comments:
In `@scripts/ci/add-vms/action.yml`:
- Around line 71-86: The download step "Install virtctl from cluster" currently
uses curl -k which disables certificate verification; change it to validate TLS
by removing -k and instead allow supplying a CA bundle (e.g., via an environment
variable like CA_BUNDLE or KUBECONFIG_CACERT) to curl with --cacert, and only
fall back to an explicit opt-in (e.g., SKIP_CERT_VERIFY=true) to keep -k for
environments that truly require it; update the shell logic around the
DOWNLOAD_URL retrieval and the curl invocation to use the CA_BUNDLE variable
when present, and ensure the step still writes the downloaded binary to
/usr/local/bin/virtctl and sets executable permissions for virtctl.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Enterprise

Run ID: 02e70d78-3b1b-4b33-8de7-22777da1e1a6

📥 Commits

Reviewing files that changed from the base of the PR and between f9dc960 and c5addcb.

📒 Files selected for processing (13)
  • .github/workflows/add-vms-to-cluster.yml
  • scripts/ci/add-vms/action.yml
  • scripts/ci/add-vms/add-vms.sh
  • scripts/ci/add-vms/deploy-vms.sh
  • scripts/ci/add-vms/install-agent-native.sh
  • scripts/ci/add-vms/install-agent-quadlet.sh
  • scripts/ci/add-vms/install-virt-operator.sh
  • scripts/ci/add-vms/quadlet/README.md
  • scripts/ci/add-vms/quadlet/install.sh
  • scripts/ci/add-vms/quadlet/roxagent-prep.service
  • scripts/ci/add-vms/quadlet/roxagent-tmpfiles.conf
  • scripts/ci/add-vms/quadlet/roxagent.container
  • scripts/ci/add-vms/quadlet/roxagent.timer

Comment thread .github/workflows/add-vms-to-cluster.yml
Comment thread scripts/ci/add-vms/install-agent-quadlet.sh Outdated
Comment thread scripts/ci/add-vms/install-agent-quadlet.sh Outdated
Comment thread scripts/ci/add-vms/install-virt-operator.sh
Split the add-vms work so the base action branch contains the workflow, VM provisioning, virt operator setup, and native agent path only, leaving the Quadlet mode for a stacked follow-up PR.

User request: "okay, we split that. All the non-quadlet action goes to `piotr/ROX-35008-action-add-VMs` all quadlet related to `piotr/ROX-35008-VM-action-quadlet`. Note that the branches are on top of each other. Split that cleanly. You can commit".
Partially generated by AI.

Co-authored-by: Cursor <cursoragent@cursor.com>
@vikin91 vikin91 force-pushed the piotr/ROX-35008-action-add-VMs branch from bb8db35 to ff281a0 Compare June 11, 2026 15:18
@vikin91

vikin91 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 11, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@openshift-ci

openshift-ci Bot commented Jun 11, 2026

Copy link
Copy Markdown

@vikin91: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/gke-ui-e2e-tests 332f73c link true /test gke-ui-e2e-tests
ci/prow/ocp-4-21-nongroovy-e2e-tests 332f73c link false /test ocp-4-21-nongroovy-e2e-tests
ci/prow/ocp-4-21-scanner-v4-install-tests 332f73c link false /test ocp-4-21-scanner-v4-install-tests
ci/prow/ocp-4-12-scanner-v4-install-tests 332f73c link false /test ocp-4-12-scanner-v4-install-tests
ci/prow/ocp-4-22-scanner-v4-install-tests 332f73c link false /test ocp-4-22-scanner-v4-install-tests
ci/prow/ocp-4-22-operator-e2e-tests 332f73c link false /test ocp-4-22-operator-e2e-tests
ci/prow/ocp-4-12-operator-e2e-tests 332f73c link false /test ocp-4-12-operator-e2e-tests
ci/prow/ocp-4-21-operator-e2e-tests 332f73c link false /test ocp-4-21-operator-e2e-tests
ci/prow/ocp-4-22-qa-e2e-tests 332f73c link false /test ocp-4-22-qa-e2e-tests

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant