feat(cli): dstack + dstackup — hands-off single-node onboarding#731
Open
h4x3rotab wants to merge 8 commits into
Open
feat(cli): dstack + dstackup — hands-off single-node onboarding#731h4x3rotab wants to merge 8 commits into
h4x3rotab wants to merge 8 commits into
Conversation
Captures the problem (the ~22-step path to a first app), the hardware-validated findings, and the locked design for hands-off single-node onboarding: dstackup (host setup) + dstack (client), single-node KMS bootstrap, a Rust auth webhook, and the crates/ layout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds four crates under crates/ (no changes to existing crates): - dstack-core: shared library — typed VMM prpc client, config rendering (vmm.toml / kms.toml / auth-allowlist.json), app-compose build, host/SGX detection, free-port + --port spec helpers. - dstack: client CLI — run / ls / logs (info / upgrade / init are scaffolds); talks to a local VMM over its unix socket or a remote one over http(s). - dstackup: host setup CLI — install / status / destroy. SGX preflight, renders configs, manages dstack-vmm + dstack-auth as systemd units, deploys and bootstraps a single-node KMS-in-CVM. Idempotent install, deterministic cgroup teardown on destroy. - dstack-auth: Rust reimplementation of the single-operator KMS auth webhook (compose-hash allowlist, re-read per request, fails closed). Validated end-to-end on a TDX host against the official meta-dstack v0.5.11 release image: dstackup install -> KMS bootstrap -> dstack run -> app serves HTTP 200 -> dstackup destroy, all at the default 1 GB. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Addresses code-review blockers: - allowlist + state files are now written atomically (temp + rename), and the allowlist read-modify-write holds an exclusive flock — so a concurrent `dstack run` or a crash can no longer corrupt it. A torn allowlist matters: the webhook fails closed on invalid JSON, i.e. denies keys to every app on the host. New `dstack-core::fsutil` (write_atomic, lock_exclusive), tested. - `dstackup destroy` now finds the KMS CVM by recorded id OR by name, so an install that died before persisting kms_vm_id can't orphan the CVM. - `--expose` fails fast with guidance (use an SSH tunnel): it would otherwise bind the VM-control plane with neither TLS nor an auth token. Minors: align hex normalization between `dstack run` and the webhook and store the normalized hash; command stubs exit non-zero; dedupe the KMS image default against `config::DEFAULT_KMS_IMAGE`; fix a stale doc comment and duplicate step numbering. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
`dstackup install` now reads the guest image's digest.txt and renders it into
the webhook allowlist's `osImages`, so the auth webhook enforces which OS image
an app may boot — even though the KMS's own download-verify stays off for the
single-node flow. Previously both gates were open: an app could boot under a
different, unmeasured image and still receive keys. `bootAuth/kms` ignores
`osImages`, so the KMS bootstrap is unaffected.
Validated on a TDX host with the official meta-dstack v0.5.11 image: the pin
(digest.txt c2aa0186…) matches the KMS-reported image hash — nginx still gets
keys and serves HTTP 200 with the pin active — while a wrong image hash is now
denied ("os image not allowed"); 0x/case variants normalize correctly.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- `dstack run`: the "registered" line no longer implies the KMS will honor it regardless of path — it states keys are issued only if the file is the allowlist the auth webhook actually serves. - `dstack logs`: clearer gate for a remote endpoint (remote support lands with the TLS+token transport) instead of a terse "unix only". - `dstackup`: document that the auth webhook's 127.0.0.1 bind is deliberate (it decides key release; CVMs still reach it at 10.0.2.2 via user-mode networking). Message/comment-only; no behavior change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ust readiness - `dstackup install` auto-picks a CID window that avoids any VMM already running on the host: it reads other `dstack-vmm` processes' configs for their reserved `[cid_start, cid_start+pool)` and any live `guest-cid`, then offsets past them. `--cid-start` is now optional (auto by default) and refuses an explicit value that overlaps a reserved range. - external tools (systemctl/docker/curl) run with a sanitized `PATH`, so a hijacked environment can't substitute a binary while we run as root. - KMS readiness now requires curl to succeed AND a parsed, non-empty `ca_cert` field rather than a substring match (an error body can't read as "ready"), which also confirms our KMS is bound to the expected port. - dstack-auth: a BootInfo wire-contract test pins the camelCase field names the webhook depends on, so a future KMS rename breaks a test, not production. Re-validated end-to-end on a TDX host: with an existing VMM reserving [1000,2000), install with no --cid-start auto-picks 2000; KMS + app CVMs land at 2001/2002 (no collision), nginx serves HTTP 200 with the os-image pin active, clean destroy to baseline. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…coexistence From a second clean-context review of the hardened branch: - B1 (fail-open): a missing/empty digest.txt silently disabled the OS-image pin (osImages=[] => webhook allow-any-image) while install only warned. Now the pin is resolved in preflight and `dstackup install` BAILS in KMS mode unless --allow-unpinned-image is passed. Fail-closed. - B2 (half-install): coexistence was handled only for CIDs; the dashboard/auth TCP ports and the host-api vsock port had fixed defaults with no detection, so a second install half-installed then failed on bind. All collision checks (CIDs + ports) now run in a preflight BEFORE any side effect — TCP ports are bind-tested, the host-api port is checked against other dstack-vmm configs — and refuse with guidance. - M1: `dstack run --allowlist <missing>` no longer misreports ENOENT as a permissions problem; distinct "run dstackup install first" message. - M2: TCB-status enforcement intentionally NOT added (single-node operator trusts their own host; real TDX hosts often report non-UpToDate) — documented as a deliberate deviation from auth-simple. - minors: device_ok matches auth-simple (empty list = any device); write_atomic fsyncs the parent dir (rename durability); lowercase two error messages; comments on the CID-block math, the compose-hash clone, and the /logs transport. Validated on a TDX host: missing-pin and port-collision installs both bail in preflight with zero side effects; --allow-unpinned-image opt-out works; happy path (real image) -> KMS bootstrap -> nginx HTTP 200 -> clean destroy. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
CI runs `cargo clippy -- -D warnings -D clippy::expect_used -D clippy::unwrap_used`
(stricter than the `-D warnings` documented in CLAUDE.md). The three infallible
`to_string_pretty(&value).expect(...)` calls now pretty-print via the Value's
Display (`{:#}`), which is byte-identical output — so the compose hash and the
rendered configs are unchanged — and trips neither lint.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
What
Adds a hands-off single-node onboarding path for dstack, cutting the route to a first app from ~20 manual steps to two commands:
Four new crates under
crates/(no existing crate is modified):dstack-core— shared lib: typed VMMprpcclient, config rendering (vmm.toml/kms.toml/auth-allowlist.json), app-compose build, host/SGX detection + CID/port coexistence scan, atomic-write +flockhelpers.dstack— client CLI:run/ls/logs(info/upgrade/initare scaffolds). Talks to a local VMM over its unix socket, or a remote one over http(s).dstackup— host-setup CLI:install/status/destroy. Privileged; managesdstack-vmm+dstack-authas systemd units; deploys and bootstraps a single-node KMS-in-CVM. Idempotent install, deterministic cgroup teardown.dstack-auth— Rust reimplementation of the single-operator KMS auth webhook (compose-hash + os-image allowlist; fails closed).Design doc:
docs/onboarding-redesign.md. Part of #699.Architecture
Two binaries, mirroring
kubeadm/kubectl:dstackup(host setup, local + privileged) anddstack(client, local-or-remote). The dashboard binds localhost by default and is reached via an SSH tunnel — a browser secure context for the env-encryption crypto without needing a cert.Security model — and the deliberate single-node trade-offs
The single-node flow makes a few scoped relaxations. Each is confined to the single-node KMS-in-CVM config and does not touch the per-app key path or multi-node replication (verified against
kms/):enforce_self_authorization=false— removes only the KMS's self-attestation gate at bootstrap; per-app quote verification + the compose-hash allowlist stay fully enforced.verify_os_image=false(KMS download-verify) is compensated by pinning the app's measured OS image (digest.txt) into the webhook'sosImages— and it's fail-closed: a missing pin abortsinstallunless--allow-unpinned-imageis passed.--exposeis intentionally disabled — the VMM management RPCs are unauthenticated and the TLS+token transport isn't built yet, so it fails with SSH-tunnel guidance rather than opening an unauthenticated control plane.auth-simple(the single-node operator controls/trusts their own host, and real TDX hosts routinely report a non-UpToDateTCB) — documented indstack-auth.The auth webhook is fail-closed by construction; the allowlist read-modify-write is atomic (temp + rename + dir fsync) under an
flock; andinstallruns all CID/port collision checks in a preflight before any side effect, so a clash refuses cleanly instead of half-installing.Validation
Hardware-validated end-to-end on an Intel TDX host (alongside an unrelated VMM, left undisturbed throughout):
dstackup install(realdstack-vmmbuilt from this repo + the official meta-dstack v0.5.11 image) → KMS-in-CVM bootstraps →dstack run nginx→ HTTP 200 at the default 1 GB with the os-image pin enforced → cleandestroyto baseline. Fail-closed paths (missing pin, port collision) verified to bail with zero side effects, and CID/host-api coexistence verified against a second running VMM.cargo fmt,cargo clippy -- -D warnings, and the unit tests pass.Out of scope / follow-ups (tracked in #699)
Remote transport (TLS+token for
--expose, remotedstack logs,--token), the gateway tier, env-var encryption fordstack run, and OS packaging are deliberately not in this PR.This went through two rounds of adversarial ("Linus-style") review; the findings are addressed across the commit history.
🤖 Generated with Claude Code