Skip to content

feat(cli): dstack + dstackup — hands-off single-node onboarding#731

Open
h4x3rotab wants to merge 8 commits into
masterfrom
research/onboarding-friction
Open

feat(cli): dstack + dstackup — hands-off single-node onboarding#731
h4x3rotab wants to merge 8 commits into
masterfrom
research/onboarding-friction

Conversation

@h4x3rotab

Copy link
Copy Markdown
Contributor

What

Adds a hands-off single-node onboarding path for dstack, cutting the route to a first app from ~20 manual steps to two commands:

# host: SGX preflight → render configs → systemd units → bootstrap a KMS-in-CVM
dstackup install --image <ver>

# deploy an app
dstack run <compose> --image <ver> --port <p> --allowlist <path>

Four new crates under crates/ (no existing crate is modified):

  • dstack-core — shared lib: typed VMM prpc client, config rendering (vmm.toml / kms.toml / auth-allowlist.json), app-compose build, host/SGX detection + CID/port coexistence scan, atomic-write + flock helpers.
  • dstack — client CLI: run / ls / logs (info/upgrade/init are scaffolds). Talks to a local VMM over its unix socket, or a remote one over http(s).
  • dstackup — host-setup CLI: install / status / destroy. Privileged; manages dstack-vmm + dstack-auth as systemd units; deploys and bootstraps a single-node KMS-in-CVM. Idempotent install, deterministic cgroup teardown.
  • dstack-auth — Rust reimplementation of the single-operator KMS auth webhook (compose-hash + os-image allowlist; fails closed).

Design doc: docs/onboarding-redesign.md. Part of #699.

Architecture

Two binaries, mirroring kubeadm / kubectl: dstackup (host setup, local + privileged) and dstack (client, local-or-remote). The dashboard binds localhost by default and is reached via an SSH tunnel — a browser secure context for the env-encryption crypto without needing a cert.

Security model — and the deliberate single-node trade-offs

The single-node flow makes a few scoped relaxations. Each is confined to the single-node KMS-in-CVM config and does not touch the per-app key path or multi-node replication (verified against kms/):

  • enforce_self_authorization=false — removes only the KMS's self-attestation gate at bootstrap; per-app quote verification + the compose-hash allowlist stay fully enforced.
  • verify_os_image=false (KMS download-verify) is compensated by pinning the app's measured OS image (digest.txt) into the webhook's osImages — and it's fail-closed: a missing pin aborts install unless --allow-unpinned-image is passed.
  • --expose is intentionally disabled — the VMM management RPCs are unauthenticated and the TLS+token transport isn't built yet, so it fails with SSH-tunnel guidance rather than opening an unauthenticated control plane.
  • TCB-status enforcement is intentionally deferred vs auth-simple (the single-node operator controls/trusts their own host, and real TDX hosts routinely report a non-UpToDate TCB) — documented in dstack-auth.

The auth webhook is fail-closed by construction; the allowlist read-modify-write is atomic (temp + rename + dir fsync) under an flock; and install runs all CID/port collision checks in a preflight before any side effect, so a clash refuses cleanly instead of half-installing.

Validation

Hardware-validated end-to-end on an Intel TDX host (alongside an unrelated VMM, left undisturbed throughout):

dstackup install (real dstack-vmm built from this repo + the official meta-dstack v0.5.11 image) → KMS-in-CVM bootstraps → dstack run nginxHTTP 200 at the default 1 GB with the os-image pin enforced → clean destroy to baseline. Fail-closed paths (missing pin, port collision) verified to bail with zero side effects, and CID/host-api coexistence verified against a second running VMM. cargo fmt, cargo clippy -- -D warnings, and the unit tests pass.

Out of scope / follow-ups (tracked in #699)

Remote transport (TLS+token for --expose, remote dstack logs, --token), the gateway tier, env-var encryption for dstack run, and OS packaging are deliberately not in this PR.

This went through two rounds of adversarial ("Linus-style") review; the findings are addressed across the commit history.

🤖 Generated with Claude Code

h4x3rotab and others added 8 commits June 17, 2026 00:18
Captures the problem (the ~22-step path to a first app), the
hardware-validated findings, and the locked design for hands-off
single-node onboarding: dstackup (host setup) + dstack (client),
single-node KMS bootstrap, a Rust auth webhook, and the crates/ layout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds four crates under crates/ (no changes to existing crates):

- dstack-core: shared library — typed VMM prpc client, config rendering
  (vmm.toml / kms.toml / auth-allowlist.json), app-compose build,
  host/SGX detection, free-port + --port spec helpers.
- dstack: client CLI — run / ls / logs (info / upgrade / init are
  scaffolds); talks to a local VMM over its unix socket or a remote
  one over http(s).
- dstackup: host setup CLI — install / status / destroy. SGX preflight,
  renders configs, manages dstack-vmm + dstack-auth as systemd units,
  deploys and bootstraps a single-node KMS-in-CVM. Idempotent install,
  deterministic cgroup teardown on destroy.
- dstack-auth: Rust reimplementation of the single-operator KMS auth
  webhook (compose-hash allowlist, re-read per request, fails closed).

Validated end-to-end on a TDX host against the official meta-dstack
v0.5.11 release image: dstackup install -> KMS bootstrap -> dstack run
-> app serves HTTP 200 -> dstackup destroy, all at the default 1 GB.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses code-review blockers:

- allowlist + state files are now written atomically (temp + rename), and the
  allowlist read-modify-write holds an exclusive flock — so a concurrent
  `dstack run` or a crash can no longer corrupt it. A torn allowlist matters:
  the webhook fails closed on invalid JSON, i.e. denies keys to every app on
  the host. New `dstack-core::fsutil` (write_atomic, lock_exclusive), tested.
- `dstackup destroy` now finds the KMS CVM by recorded id OR by name, so an
  install that died before persisting kms_vm_id can't orphan the CVM.
- `--expose` fails fast with guidance (use an SSH tunnel): it would otherwise
  bind the VM-control plane with neither TLS nor an auth token.

Minors: align hex normalization between `dstack run` and the webhook and store
the normalized hash; command stubs exit non-zero; dedupe the KMS image default
against `config::DEFAULT_KMS_IMAGE`; fix a stale doc comment and duplicate
step numbering.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`dstackup install` now reads the guest image's digest.txt and renders it into
the webhook allowlist's `osImages`, so the auth webhook enforces which OS image
an app may boot — even though the KMS's own download-verify stays off for the
single-node flow. Previously both gates were open: an app could boot under a
different, unmeasured image and still receive keys. `bootAuth/kms` ignores
`osImages`, so the KMS bootstrap is unaffected.

Validated on a TDX host with the official meta-dstack v0.5.11 image: the pin
(digest.txt c2aa0186…) matches the KMS-reported image hash — nginx still gets
keys and serves HTTP 200 with the pin active — while a wrong image hash is now
denied ("os image not allowed"); 0x/case variants normalize correctly.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- `dstack run`: the "registered" line no longer implies the KMS will honor it
  regardless of path — it states keys are issued only if the file is the
  allowlist the auth webhook actually serves.
- `dstack logs`: clearer gate for a remote endpoint (remote support lands with
  the TLS+token transport) instead of a terse "unix only".
- `dstackup`: document that the auth webhook's 127.0.0.1 bind is deliberate (it
  decides key release; CVMs still reach it at 10.0.2.2 via user-mode networking).

Message/comment-only; no behavior change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ust readiness

- `dstackup install` auto-picks a CID window that avoids any VMM already
  running on the host: it reads other `dstack-vmm` processes' configs for their
  reserved `[cid_start, cid_start+pool)` and any live `guest-cid`, then offsets
  past them. `--cid-start` is now optional (auto by default) and refuses an
  explicit value that overlaps a reserved range.
- external tools (systemctl/docker/curl) run with a sanitized `PATH`, so a
  hijacked environment can't substitute a binary while we run as root.
- KMS readiness now requires curl to succeed AND a parsed, non-empty `ca_cert`
  field rather than a substring match (an error body can't read as "ready"),
  which also confirms our KMS is bound to the expected port.
- dstack-auth: a BootInfo wire-contract test pins the camelCase field names the
  webhook depends on, so a future KMS rename breaks a test, not production.

Re-validated end-to-end on a TDX host: with an existing VMM reserving
[1000,2000), install with no --cid-start auto-picks 2000; KMS + app CVMs land
at 2001/2002 (no collision), nginx serves HTTP 200 with the os-image pin
active, clean destroy to baseline.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…coexistence

From a second clean-context review of the hardened branch:

- B1 (fail-open): a missing/empty digest.txt silently disabled the OS-image pin
  (osImages=[] => webhook allow-any-image) while install only warned. Now the
  pin is resolved in preflight and `dstackup install` BAILS in KMS mode unless
  --allow-unpinned-image is passed. Fail-closed.
- B2 (half-install): coexistence was handled only for CIDs; the dashboard/auth
  TCP ports and the host-api vsock port had fixed defaults with no detection, so
  a second install half-installed then failed on bind. All collision checks
  (CIDs + ports) now run in a preflight BEFORE any side effect — TCP ports are
  bind-tested, the host-api port is checked against other dstack-vmm configs —
  and refuse with guidance.
- M1: `dstack run --allowlist <missing>` no longer misreports ENOENT as a
  permissions problem; distinct "run dstackup install first" message.
- M2: TCB-status enforcement intentionally NOT added (single-node operator
  trusts their own host; real TDX hosts often report non-UpToDate) — documented
  as a deliberate deviation from auth-simple.
- minors: device_ok matches auth-simple (empty list = any device); write_atomic
  fsyncs the parent dir (rename durability); lowercase two error messages;
  comments on the CID-block math, the compose-hash clone, and the /logs transport.

Validated on a TDX host: missing-pin and port-collision installs both bail in
preflight with zero side effects; --allow-unpinned-image opt-out works; happy
path (real image) -> KMS bootstrap -> nginx HTTP 200 -> clean destroy.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI runs `cargo clippy -- -D warnings -D clippy::expect_used -D clippy::unwrap_used`
(stricter than the `-D warnings` documented in CLAUDE.md). The three infallible
`to_string_pretty(&value).expect(...)` calls now pretty-print via the Value's
Display (`{:#}`), which is byte-identical output — so the compose hash and the
rendered configs are unchanged — and trips neither lint.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant