diff --git a/docs/adr/0038-build-verification-loop.md b/docs/adr/0038-build-verification-loop.md new file mode 100644 index 000000000..4955110ec --- /dev/null +++ b/docs/adr/0038-build-verification-loop.md @@ -0,0 +1,155 @@ +# ADR-0038: Build Verification Loop — the agent builds, verifies, and corrects itself + +**Status**: Proposed (2026-06-11) +**Deciders**: ObjectStack Protocol Architects +**Builds on**: [ADR-0033](./0033-ai-assisted-metadata-authoring.md) (drafts as the staging layer — this ADR **replaces its human-approval assumption for AI builds** with a machine gate; HITL stays for destructive actions), [ADR-0021](./0021-analytics-dataset-semantic-layer.md) (datasets — what most verification probes exercise), ADR-0037 / [framework#1694](https://github.com/objectstack-ai/framework/pull/1694) (Live Canvas — the *human-visibility* complement to this ADR's *machine-verification*) +**Consumers**: `@objectstack/service-analytics` + `@objectstack/objectql` (verification probes), `../cloud/service-ai-studio` (graph lint, `verify_build` tool, self-correction protocol, eval intelligence), `../objectui` (build health card in chat), `ai_eval_cases`/`ai_eval_runs` (existing, currently-unused storage) + +**Premise**: pre-launch, no back-compat debt — specify the target end-state directly. + +**Design center**: **never make correctness depend on a human looking.** Humans are the laziest component in the system — they will not review, and the magic moment auto-publishes before they could. The agent that builds an app must be the same loop that verifies it and corrects it; a human is *informed* of the outcome, never *required* for it. The only reviewer that scales with AI-speed authorship is the machine. + +--- + +## TL;DR + +**The problem, measured.** In one day of live verification (2026-06-10/11), six agent-authored defects shipped to a staging tenant — every one of them **passed schema validation** (`_diagnostics: valid`) and every one was found by a *human manually browsing*: + +| # | Defect | Why validation missed it | +|---|---|---| +| 1 | Dashboard widgets bound a `dataset` that didn't exist | reference *between* artifacts — single-artifact Zod can't see it | +| 2 | Widget `values:["amount"]` matched no measure name in the dataset | cross-artifact name agreement | +| 3 | Seed staged but rows never materialized on publish | runtime effect, invisible to schema | +| 4 | Dataset queries returned 0/empty on populated objects (4 stacked infra bugs) | only a *real query* reveals it | +| 5 | "Published!" while sample data silently failed to load | result-reporting gap | +| 6 | View metadata `type:'list'` rendered as a red "Unknown component type" box | renderability is a *renderer* contract, not a schema | + +Schema-valid ≠ renders ≠ returns data ≠ matches intent. Each is a separate verification plane, and today only the first exists. + +**Decision.** Ship a five-layer **Build Verification Loop (BVL)** that runs *inside* the build turn and *after* publish, feeding every failure back to the agent for bounded self-correction: + +- **L1 Graph lint** (draft-time, deterministic): cross-artifact reference resolution over the staged set. +- **L2 Renderability check** (pre-publish, deterministic): every artifact's renderer translation produces a *registered, typed* schema; every dataset compiles. +- **L3 Runtime probes** (post-publish): row counts per seeded object; one real query per widget; generalizes the `seedApplied` pattern. +- **L4 Self-correction protocol**: L1–L3 results are returned **to the agent** (`issues[]` in tool envelopes + a `verify_build` tool); the agent fixes and re-verifies, bounded retries; the chat shows a build-health card, not a plea for review. +- **L5 Agent CI**: golden-prompt eval suite run headlessly per deploy/nightly against an ephemeral environment, persisted in the existing `ai_eval_cases`/`ai_eval_runs` objects. + +**Gate semantics change**: for AI whole-app builds, **the verification loop replaces the human approval gate**. A build auto-publishes only when L1+L2 pass; L3 failures trigger self-correction; an unrecoverable build *stays draft* and says so honestly. HITL approval remains **only** for destructive/irreversible actions (ADR-0033 pending-actions) — that is safety, not quality review. + +**Open-core boundary**: verification *mechanisms* (graph resolution, render contracts, query probes, eval storage) are open framework; *what to verify and how to judge intent* (lint rule packs, golden prompts, the LLM intent-review) is cloud intelligence. + +--- + +## Context + +### Why the existing defenses don't compose into a loop + +| Defense (exists today) | Catches | Systemic gap | +|---|---|---| +| Per-type Zod at `stageDraft` (ADR-0033) | malformed single artifacts | all six defects were single-artifact-valid | +| Draft gate + human Publish | nothing in practice | magic moment auto-publishes; humans don't review | +| propose → confirm → apply | wrong *plan* | confirms intent, not product quality | +| Deterministic normalization (dataset auto-create, widget-ref derivation, viewType→grid; shipped 2026-06-10/11) | makes *known* mistakes impossible | reactive: one fix per discovered failure mode | +| `seedApplied` reporting | seed materialization failures | one probe, one artifact type — the pattern, not the system | +| `ai_eval_cases` / `ai_eval_runs` objects | — | empty skeleton, wired to nothing | + +The strongest defense — deterministic normalization — should remain the first resort ("make the mistake impossible"). The BVL is for the unbounded remainder: an agent is a generator of *novel* mistakes, so the system needs a *general* verifier, not an ever-growing list of special cases. + +### The correction loop already works — it's just manual + +The live incident that motivates L4: a human told the agent *"the Spending Dashboard shows Dataset 'expense' not found — fix it"*, and the agent diagnosed, created the missing dataset with exactly-referenced measure names, and offered Publish — correctly, in one turn. The agent's repair capability is not the gap. **The gap is that a human had to be the error transport.** The BVL is, at its core, replacing that human with the build pipeline. + +--- + +## Decision + +### The verification contract + +Every layer emits the same shape, so the agent, the chat UI, and the eval harness consume one stream: + +```ts +interface BuildIssue { + layer: 'graph' | 'render' | 'runtime' | 'intent'; + severity: 'error' | 'warning'; + artifact: { type: string; name: string }; // what is broken + ref?: { type: string; name: string; member?: string }; // what it points at + code: string; // e.g. 'dangling_dataset', 'unknown_measure', 'typeless_schema', + // 'empty_query', 'seed_not_applied', 'intent_mismatch' + message: string; // agent-actionable, names the exact artifact + member + fix?: string; // machine hint, e.g. 'create dataset "expense" with measure "amount"' +} +``` + +`issues[]` is carried (a) in every authoring tool's result envelope, (b) in the `verify_build` tool result, (c) on the build-health card in chat, (d) in `ai_eval_runs` rows. + +### L1 — Graph lint at draft time (cloud `service-ai-studio`, deterministic) + +After `apply_blueprint` / `create_metadata` stage their drafts and **before** the envelope returns, resolve every cross-artifact reference over the *draft-overlaid* registry (`previewDrafts` reads): + +- `widget.dataset` exists; every `values[]` name ∈ dataset measures; every `dimensions[]` name ∈ dataset dimensions; +- `view.objectName`/`object` exists; `fields[]` ⊆ object fields; kanban `groupField` exists and is a select; +- `app` navigation targets (object/dashboard/view) all exist; +- `seed.object` exists; record keys ⊆ object fields; lookup/external-id references resolvable within the staged set; +- `dataset.object` exists; measure/dimension `field`s exist on it. + +Violations return as `issues[]` **in the same tool result**, so the agent sees them in the turn that caused them and fixes before the user ever could. Incidents #1 and #2 die here. + +### L2 — Renderability check, pre-publish (framework + objectui contract) + +"Will this artifact mount, or red-box?" is decidable without a browser because both halves are deterministic: + +- the **view → component schema** translation and the **component registry** (the `'list'`→typeless-schema bug was exactly this contract breaking) — export the translation as a pure function and ship the registry's type list as data, so the check runs server-side; +- **dataset compilation** (`compileDataset`) — compile every drafted dataset; compile errors are issues; +- dashboards: every widget translates to a registered type with a satisfiable query shape. + +Incident #6 dies here. (The renderer keeps its own ErrorBoundary fallbacks — defense in depth, not the primary net.) + +### L3 — Runtime probes, post-publish (framework mechanisms, generalizing `seedApplied`) + +Immediately after an auto-publish, the build pipeline (not the user) exercises the published app: + +- **per seeded object**: row count > 0 (else issue `seed_not_applied`, carrying the existing `seedApplied` error detail); +- **per dashboard widget**: execute its real dataset selection once (the same `/analytics/dataset/query` path users hit); empty-on-populated-object or error → issue `empty_query` with the compiled SQL/strategy detail; +- **per view**: a `limit 1` list read through the same governed path. + +Incidents #3, #4, #5 die here — they were all invisible until something *actually queried*. Probe results attach to the publish response (like `seedApplied` today) and flow into the same `issues[]` stream. + +### L4 — Self-correction protocol (cloud, the loop itself) + +- A `verify_build` tool (cloud service-ai-studio) runs L1+L2 on demand and L3 when the build is published; the system prompt instructs the agent: **after building, verify; if issues, fix and verify again** — at most N rounds (default 2), then stop and report honestly. +- The **auto-publish gate becomes machine-conditional**: publish fires only when L1+L2 are clean. A failing build stays draft, the agent attempts repair; if still failing, the chat shows the health card with remaining issues and a Review affordance. *No silent broken publishes, no waiting on a human either.* +- An **intent review** (cloud intelligence, cheap model): after mechanical verification passes, one LLM pass judges the built app against the user's original goal — nav coherent, labels sensible, fields match the domain, dashboards answer the asked questions. Output: `intent` issues (warnings, non-blocking by default). This is the "AI reviews itself" half that no deterministic check covers. +- The chat renders one **build-health card**: ✓ structure, ✓ renders, ✓ 12 rows, ✓ 3/3 widgets return data, ⚠ warnings — replacing both the silent success and the human-review plea. + +### L5 — Agent CI: the golden-prompt eval suite (cloud, uses the existing skeleton) + +- `ai_eval_cases`: golden prompts ("build an expense tracker with a spending dashboard", "build a recruiting app", …) each with machine-checkable assertions: app exists; objects/views/dashboards present; row counts > 0; every widget query returns rows; zero render-error boxes (headless DOM sweep — the same browser automation + `mint-session.mjs` used to find this week's bugs). +- Runs headlessly against an **ephemeral environment** per cloud deploy (and nightly), recording to `ai_eval_runs`. A red run blocks nothing initially (report-only), then graduates to a deploy gate once stable. +- Every production incident becomes a new eval case — the suite is the immune system's memory: none of this week's six defects can recur silently once encoded. + +### Sequencing + +| Phase | Scope | Effort | +|---|---|---| +| 1 | L1 graph lint + envelope `issues[]` + L4 wiring (verify-fix-reverify, gate-on-clean) | ~1 week, cloud | +| 2 | L3 runtime probes + build-health card in chat | ~3–4 days, framework + objectui | +| 3 | L2 renderability contract | ~3–5 days, framework + objectui | +| 4 | L5 eval suite on `ai_eval_*` + intent review | ~1 week, cloud | + +Each phase is independently shippable; Phase 1 alone converts this week's discovery latency from *human-hours* to *same-turn*. + +## Non-goals + +- **Not** removing HITL for destructive/irreversible actions — ADR-0033's pending-action approvals remain; that is safety, not quality review. +- **Not** a general-purpose test runner for user code — the BVL verifies *agent-authored metadata and its runtime behavior*, nothing else. +- **Not** a replacement for deterministic normalization — "make the mistake impossible" stays the first resort; the BVL catches what normalization hasn't met yet. + +## Risks + +| Risk | Mitigation | +|---|---| +| Verification latency inflates the magic moment | L1/L2 are in-memory and sub-second; L3 runs post-publish in parallel with the user exploring; probes are `limit 1`/single-aggregate | +| Self-correction loops forever | bounded rounds (2), then honest surface | +| Probes mutate state | all probes are reads; seeds are upserts keyed on externalId (idempotent) | +| Eval env drift vs prod | eval runs on the same image staging runs; ephemeral env per run | +| LLM intent-review cost | cheap model, single pass, warnings-only |