Skip to content

feat(droid-control): add desktop-control atom backed by cua-driver#27

Open
factory-ain3sh wants to merge 1 commit into
masterfrom
ainesh/cli-900-desktop-control
Open

feat(droid-control): add desktop-control atom backed by cua-driver#27
factory-ain3sh wants to merge 1 commit into
masterfrom
ainesh/cli-900-desktop-control

Conversation

@factory-ain3sh

Copy link
Copy Markdown
Contributor

Description

droid-control routes terminals (tuistory/true-input) and web/Electron (agent-browser), but native desktop GUI apps had no atom -- the gap was papered over by a standalone, hand-maintained mac-control skill that was macOS-only and already stale against upstream. This PR adds a desktop-control driver atom backed by upstream trycua/cua cua-driver (cross-platform Rust rewrite, v0.5.x): one routing row, an atom owning the droid-control integration (RUN_ID -> cua-session isolation, delegation, evidence handoff), and per-platform mechanics files. The atom is deliberately a thin overlay: the deep tool reference defers to upstream's versioned skill pack (cua-driver skills install), which updates with the binary -- the anti-staleness property mac-control lacked.

Deviation from the ticket's written scope: Linux ships as a supported third platform (user call mid-implementation) at upstream's pre-release tier, with empirically verified caveats instead of the originally planned hard exclusion.

Out of scope: filing the Linux input-loss/AT-SPI findings upstream; teaching the command docs (/qa-test etc.) to enumerate native desktop targets.

Related Issue

Closes CLI-900

Reviewer Guide

Read order: skills/desktop-control/SKILL.md > platforms/linux.md > skills/droid-control/SKILL.md (routing hunks) > platforms/macos.md > platforms/windows.md > capture/ARCHITECTURE/README touches.
Review depth: Standard -- docs-only diff, but the platform files encode operational claims; linux.md's are the ones to scrutinize (each is backed by a smoke observation).
Open for pushback: linux.md recommends pixel-first + a probe-one-keystroke pattern on the pre-release tier rather than refusing the Act stage outright (skills/desktop-control/platforms/linux.md, toolkit-boundary section).

Risk & Impact

Markdown-only; no code paths change. (1) Routing drift: Linux droids may route native-GUI work to desktop-control and hit pre-release limits -- contained by the caveat in the routing note plus explicit per-symptom fallbacks to agent-browser/true-input in linux.md. (2) Upstream drift: the atom pins only stable CLI verbs (verified against upstream cli.rs) and defers the volatile tool surface to the auto-updating upstream pack. Retiring the superseded mac-control skill happened in a personal skills repo, not this diff.

Verification

Behavior verified. Windows (production tier): end-to-end smoke on a Win11 KVM VM -- install via upstream install.ps1, daemon via autostart kick, then start_session -> launch_app notepad -> get_window_state (19-element UIA tree) -> type_text -> re-snapshot; all 46 chars written losslessly via UIA ValuePattern, before/after PNGs + tree value confirm (verified @ 99720f4). Linux (pre-release tier): same loop locally on CachyOS Plasma Wayland -- lifecycle, doctor, sessions, window discovery, per-window screenshots all green; the smoke also disproved the initially drafted clipboard mitigation and surfaced the real constraint (Qt/GTK4 silently drop XSendEvent input; alacritty accepts it but drops trailing chars) -- linux.md was rewritten claim-by-claim to observed behavior.
Regression coverage. The repo's CI gate (check-skills.yml: every SKILL.md symlinked or .skillsrc-listed) mirrored locally: PASS. No other test harness exists for skill prose.
Not tested. platforms/macos.md -- no Mac in the loop this session; content derives from upstream MACOS.md semantics plus patterns carried over from the battle-tested mac-control skill. Follow-up: run the recipe below on real macOS.
Standard validators. Markdown-only repo: format/lint/typecheck/tests genuinely unconfigured (no signal); check-skills clean.

Repro Recipe

# any platform (Windows: irm .../scripts/install.ps1 | iex); macOS: run `cua-driver permissions grant` first
curl -fsSL https://raw.githubusercontent.com/trycua/cua/main/libs/cua-driver/scripts/install.sh | bash
cua-driver doctor
cua-driver serve &
cua-driver launch_app '{"name":"TextEdit"}'   # any editor; note returned pid + window_id
cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/before.png
cua-driver type_text '{"pid":<pid>,"window_id":<wid>,"element_index":0,"text":"hello"}'
cua-driver get_window_state '{"pid":<pid>,"window_id":<wid>}' --screenshot-out-file /tmp/after.png
# Expect: after.png and the tree's value field show "hello"
Implementation Notes

Source: .agents/specs/2026-06-10-droid-control-desktop-control-atom-backed-by-cua-driver.notes.md

Deviations from spec: Linux promoted from excluded to supported mid-implementation (user decision); the ticket AC "does NOT route Linux there" is intentionally superseded. platforms/linux.md added; routing row lost its macOS/Windows qualifier.
Tradeoffs: thin-overlay atom over upstream's skill pack instead of a self-contained tool reference -- duplicating the fast-moving surface recreates exactly the staleness that killed mac-control.
Discovered constraints: upstream Linux input is XSendEvent; Qt/GTK4 drop send_event-flagged events entirely (silent no-op), and paste hotkeys/middle-click don't land even where typing does -- linux.md documents the probe + verify-and-repair patterns instead of a clipboard workaround. Upside: XID-targeted events cannot leak into the user's focused app. Windows installer corrections (binary in %LOCALAPPDATA%\Programs\Cua, install.ps1 path) came out of the VM smoke.
Follow-ups not in this PR: file the Linux findings upstream (optional per ticket); macOS smoke on real hardware; command docs could enumerate native desktop targets.

… cua-driver (CLI-900)

droid-control routed terminals and web/Electron but had no atom for native
desktop GUI apps; that gap was papered over by a stale, macOS-only standalone
mac-control skill. The new atom is a thin overlay over upstream's maintained
cua-driver skill pack: it owns droid-control integration (routing, RUN_ID ->
cua-session isolation, delegation, evidence handoff) and defers the fast-moving
tool surface upstream, so it cannot go stale the way mac-control did.

Platform files carry verified mechanics: macOS TCC/LaunchServices flow plus
patterns absorbed from mac-control (now retired); Windows UIA + Session 0 +
PS 5.1 quoting, smoke-tested end-to-end on a Win11 KVM VM (lossless UIA
type_text with before/after evidence); Linux included at upstream's
pre-release tier with empirically verified caveats (Qt/GTK4 silently drop
XSendEvent input, lossy typing -> verify-and-repair loop).

Co-authored-by: factory-droid[bot] <138933559+factory-droid[bot]@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant