Skip to content

refactor(profile): use agent-interface AgentProfile#264

Merged
drewstone merged 7 commits into
mainfrom
refactor/use-agent-interface-profile
Jun 21, 2026
Merged

refactor(profile): use agent-interface AgentProfile#264
drewstone merged 7 commits into
mainfrom
refactor/use-agent-interface-profile

Conversation

@drewstone

@drewstone drewstone commented Jun 21, 2026

Copy link
Copy Markdown
Contributor

Summary

  • Replace agent-eval local AgentProfile shapes with the canonical @tangle-network/agent-interface AgentProfile.
  • Keep eval-owned helpers for profile id, model id, hashing, and AgentProfileCell stamping.
  • Harden profile ids: agentProfileId() now returns a path-safe label-<hash16> id, so run directories and profile-matrix keys stay readable without collapsing same-label profiles.
  • Add defineAgentEval(): define scenarios, agent, judge, and baseline once, then call .evaluate() or .improve() with per-call overrides.
  • Make defineAgentEval.improve() merge nested budget, llm, and hostedTenant overrides field-by-field; hosted tenant config now fails loudly if the merge still lacks endpoint/apiKey/tenantId.
  • Make empty judge lists and zero reps fail loudly instead of producing unscored or zero-cell runs.
  • Remove the stale sandbox peer/dev dependency, old profile-versioning guidance, and obsolete strategy/spec docs; retarget surviving docs/examples to current concepts.
  • Make package builds CI-safe by using tsup for JS and tsc --emitDeclarationOnly for declarations, with declaration aliases for package export paths.
  • Address the follow-up review audit: migration notes, path-safe ids, resource-order hash tests, public HostedTenant export, run-output ignore, stronger hosted-tenant tests, and a persisted critical-audit record.

Verification

  • pnpm test src/agent-profile.test.ts tests/contract-define-agent-eval.test.ts tests/campaign/run-profile-matrix.test.ts (38 passed)
  • pnpm typecheck
  • pnpm lint (exit 0; existing Biome warnings remain in untouched files)
  • pnpm test (246 files, 2,500 passed, 2 skipped)
  • pnpm build
  • pnpm verify:package
  • git diff --check
  • git diff --check HEAD~1..HEAD
  • Stale-reference scan: 0 hits for the old marketing README pairing/Phase-B phrases in the example/docs scope
  • git merge-tree --write-tree origin/main HEADa6d8882d04654fd28ffe3f63ec09c977a2de4a84
  • GitHub Actions on 5468922c: success

Notes

  • This intentionally removes the old buildSandboxAgentProfileCell / SANDBOX_AGENT_PROFILE public names instead of keeping stale compatibility aliases.
  • Profile identity intentionally changes with the canonical AgentProfile migration: old persisted scorecard/profile cells may not join with new agentProfileHash values or sourceProfile.kind = agent-interface-profile rows. This package is greenfield for us, so this PR treats that as a clean migration rather than adding legacy shims.
  • agentProfileId() is now human-readable, path-safe, and collision-resistant for ordinary eval matrices. Consumers that used it as a stable display-only label should use profile.name; run/matrix keys should use the helper.
  • “Proposer” naming is kept; no proposer rename in this PR.

tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 2423088f

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T09:35:00Z

tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 28d74376

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T09:45:24Z

tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — d78c9dcc

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T09:49:43Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 3 (2 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 170.9s (2 bridge agents)
Total 170.9s

💰 Value — sound

Replaces agent-eval's local AgentProfile with the canonical @tangle-network/agent-interface shape, moves profile hashing/id/model helpers into eval where they belong, renames the internal prompt-builder type to PromptProfile, and prunes stale sandbox deps and obsolete strategy docs — a coherent subs

  • What it does: The change swaps agent-eval's hand-rolled AgentProfile interface ({ id, model, skills?, promptVersion?, tools?, metadata? }) for the canonical @tangle-network/agent-interface AgentProfile ({ name?, description?, version?, tags?, prompt?, model?, permissions?, tools?, mcp?, subagents?, resources?, hooks?, modes?, ... }). It rewrites eval-owned helpers (agentProfileHash, agentProfileId, `a
  • Goals it achieves: It unifies the profile contract across the Tangle stack so that eval consumers, runtime consumers, and sandbox consumers all speak the same AgentProfile language. It reduces maintenance burden by eliminating a duplicate local type, a stale peer dependency, and outdated guidance that no longer matches the architecture. It keeps eval as the owner of eval-specific behaviour-identity logic (hash/id/
  • Assessment: This is a good change on its merits. It is coherent with the codebase's stated layering (CLAUDE.md: "agent-eval has zero upward dependencies on agent-runtime or agent-knowledge" and "If a type is genuinely substrate, move it INTO agent-eval"). The implementation is careful: the hash is deterministic, order-insensitive where it matters (tags), excludes human-facing labels (name/description), fails
  • Better / existing approach: none — this is the right approach. agent-interface does not provide profile hashing/id/model utilities (verified by grep in node_modules/@tangle-network/agent-interface; only defineAgentProfile and mergeAgentProfiles exist), so eval correctly owns the behaviour-identity helpers. Keeping a local AgentProfile with a mapper would be extra types and conversion overhead; using agent-interface direc
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound-with-nits

Replaces agent-eval's bespoke AgentProfile with the canonical agent-interface type and adds three eval-owned helpers (id/model-id/hash); the new shape is consumed consistently across scorecard, run-profile-matrix, and the cell builder, with all old-shape accesses gone.

  • Integration: Strong. The re-exported AgentProfile and three new helpers (agentProfileId/agentProfileModelId/agentProfileHash) are wired through every internal caller: src/scorecard.ts:117-118, src/campaign/presets/run-profile-matrix.ts:228-229/336/350-351/382-383, src/agent-profile-cell.ts, src/index.ts:635-636. External consumer tax-agent has already updated to agent-eval ^0.94.0 alongside agent-interface ^0.
  • Fit with existing patterns: Excellent. The agent-interface package already lives in node_modules as the canonical cross-Tangle profile contract (re-exported by sandbox per its own header docstring), and agent-eval's CLAUDE.md substrate rule explicitly says substrate primitives should not duplicate consumer/runtime types — the old local AgentProfile was that smell. The replacement keeps eval-owned helpers where they belong (h
  • Real-world viability: Solid on the paths that matter. agentProfileModelId throws ValidationError on blank/missing model.default (fail-loud per repo doctrine; tested at src/agent-profile.test.ts:65-69). The compact() helper normalizes undefined vs missing fields before hashing so equivalent profiles don't split cells. Hash determinism is tested for tag order, name/description invariance, model/skill/version sensitivity
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts

  • console.log('Marketing agent self-improvement loop')

🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts

  • composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),

🎯 Usefulness Audit

🟡 Breaking public-API change without a version bump [integration] ``

package.json remains 0.94.0 despite removing exported symbols (SandboxAgentProfileLike, buildSandboxAgentProfileCell) and reshaping AgentProfile fundamentally (id+model-string -> name+model.default). External consumers pinned by range rather than tarball would break on install. Currently safe only because blueprint-agent pins 0.91.0 by URL and tax-agent coordinated its bump in lockstep. Out of scope for this audit (process/paperwork, not code usefulness), but worth flagging for the release condu


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T095423Z

tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 3c16ed69

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T10:05:16Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 4 (2 low, 2 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 227.5s (2 bridge agents)
Total 227.5s

💰 Value — sound-with-nits

Aligns agent-eval's local flat AgentProfile with the canonical @tangle-network/agent-interface shape, adds a thin defineAgentEval() DX wrapper over existing runEval()/selfImprove() primitives, and prunes ~2.4k lines of stale strategy docs — coherent and in the codebase's stated grain.

  • What it does: Three concrete deltas: (1) src/agent-profile.ts drops the local {id, model: string, skills?, promptVersion?, tools?, metadata?} shape and re-exports AgentProfile from @tangle-network/agent-interface (nested model.default, prompt, resources, permissions, hooks, modes, etc.); src/agent-profile-cell.ts, src/multishot/{matrix,multishot}.ts, src/campaign/presets/run-profile-matrix.ts, src/scorecard.ts
  • Goals it achieves: (a) Make agent-eval use the SAME AgentProfile shape the rest of the Tangle ecosystem uses, so cross-product profile-cell joins (same canonical profile → same sourceProfile.hash across agent-runtime/sandbox/evals) actually work instead of silently disagreeing. This realizes the substrate-direction rule in CLAUDE.md ('types that belong in a shared package should be imported from there, not redefined
  • Assessment: Coherent and in-grain. The substrate rule in CLAUDE.md explicitly calls out AgentProfile-style types as cross-package primitives that should not be duplicated; the local flat shape was the outlier against agent-interface/agent-runtime, and the migration closes that gap. The new helpers (agentProfileId, agentProfileModelId) preserve eval-owned behaviour (hashing, RunRecord.model flattening, scoreca
  • Better / existing approach: Looked for an existing 'define-once' wrapper or factory pattern; none exists. selfImprove() (src/contract/self-improve.ts:370) and runEval() (src/campaign/presets/run-eval) are the primitives defineAgentEval wraps — no duplication. PromptProfile (src/profile/index.ts:53) is deliberately retained as a distinct layer (builds the system-prompt string that gets rendered into AgentProfile.prompt.system
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 2
  • Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🎯 Usefulness — sound

Converges to the canonical @tangle-network/agent-interface AgentProfile type across all consumers, keeps eval-owned helpers, adds defineAgentEval() DX wrapper, and prunes 13 obsolete docs — a clean adoption of the upstream substrate type with no dead surface or competing pattern.

  • Integration: The canonical AgentProfile from @tangle-network/agent-interface is imported in 7 source files: agent-profile.ts (re-export + helpers), agent-profile-cell.ts (cell stamping), playback.ts (driver context), multishot.ts, multishot/matrix.ts, and campaign/presets/playback.ts. scorecard.ts and run-profile-matrix.ts consume through the re-export in ./agent-profile. All call sites use the new shape (prof
  • Fit with existing patterns: No competing pattern. The old local AgentProfile (flat id/model/skills/tools strings) is fully replaced; no backward-compat shims remain. The distinct PromptProfile / AgentProfileSection types in src/profile/index.ts are a separate concept (structured system prompt sections) and explicitly documented as such (line 15-18). The @tangle-network/agent-interface AgentProfile is the canonical provider-n
  • Real-world viability: Tests cover: determinism of hash for identical profiles, tag-order insensitivity, name/description exclusion from hash, model-change sensitivity, resource-change sensitivity, version-change sensitivity, and fail-loud on missing model.default. The cell infrastructure (buildAgentInterfaceProfileCell, verifyAgentProfileCell, assertRunAgentProfileCell) validates cellId integrity, model/promptHash cros
  • Model: opencode/deepseek/deepseek-v4-pro
  • Bridge attempts: 3
  • Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts

  • console.log('Marketing agent self-improvement loop')

🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts

  • composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),

💰 Value Audit

🟡 sourceProfile.kind rename is a silent break for any persisted cells [maintenance] ``

src/agent-profile-cell.ts now writes kind='agent-interface-profile' where it previously wrote 'sandbox-agent-profile'. Any downstream consumer that joins on the kind literal (scorecard groupings, hosted ingest filters, cross-product rollups) will treat old + new cells as different profile kinds even when the underlying profile hashes to the same value. The PR body says this is intentional for a greenfield package, but a one-line migration note in docs/ or a constant alias (AGENT_PROFILE_KINDS.LE

🟡 Biome formatter pass inflated the file count on examples that didn't need semantic edits [proportion] ``

A large share of the 46-file diff is trailing-comma + line-wrap churn in examples/marketing-agent-canonical/{agent,judge,scenarios,index}.ts (e.g. agent.ts +35/-8 semantic vs ~60 lines of reformatted call-site args). Not wrong — the repo runs Biome, and the files were touched anyway — but it makes the semantic delta harder to review and bloats the PR. Future: run the formatter on a separate commit or a repo-wide biome --write chore so semantic refactors stay readable.


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T101053Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — a096a3f8

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T10:11:02Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 Value Audit — sound-with-nits

Verdict sound-with-nits
Concerns 3 (2 low, 1 weak-concern)
Heuristic 0.0s
Duplication 0.0s
Interrogation 259.3s (2 bridge agents)
Total 259.3s

💰 Value — sound

Aligns agent-eval with the shared @tangle-network/agent-interface AgentProfile, removes stale sandbox/docs baggage, and adds a clean defineAgentEval() DX wrapper over runEval/selfImprove; verified end-to-end.

  • What it does: Three concrete moves: (1) replaces agent-eval's local AgentProfile shape with the canonical @tangle-network/agent-interface AgentProfile, keeping eval-owned helpers for id/model/hashing/cell-stamping; (2) adds defineAgentEval() in src/contract/define-agent-eval.ts, a thin factory that lets callers define scenarios/agent/judge/baseline once and then call .evaluate() or .improve() with per-call over
  • Goals it achieves: Makes the eval substrate consistent with the stack's shared agent-interface contract (the repo's own layering rule says agent-eval must not depend upward on consumers), collapses the common eval+improve DX into one object, deletes misleading roadmap/Phase-B/spec docs so surviving docs point at current concepts, and makes declaration emission CI-safe.
  • Assessment: Good. The profile migration is exactly what the codebase's layering rule demands; the hash semantics are preserved and tested. defineAgentEval is a small, well-typed wrapper that does not hide the underlying runEval/selfImprove primitives. The doc cleanup and build change are pragmatic. All verification passes: pnpm typecheck, pnpm test (2486 passed), pnpm build, pnpm verify:package, and the quick
  • Better / existing approach: none — this is the right approach. The codebase had its own local AgentProfile; moving to the shared agent-interface shape removes duplication and follows the documented dependency direction. defineAgentEval is additive; the functional runEval/selfImprove entry points remain available for callers that need them. No existing helper already provides the define-once/evaluate-or-improve pattern.
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound-with-nits

Cleanly swaps the local AgentProfile for the canonical @tangle-network/agent-interface shape, retargets the public surface (cell builder, scorecard, profile matrix, multishot, playback) onto it, and adds a small, well-tested defineAgentEval() DX wrapper — all in the grain of the codebase.

  • Integration: Strong. defineAgentEval is exported from /contract (src/contract/index.ts:185), is documented as #1 of 'the five functions you'll call', is used in README.md:63, docs/concepts.md:14, docs/customer-journeys.md:61, examples/selfimprove-quickstart/index.ts:11, and is covered by 4 contract tests at tests/contract-define-agent-eval.test.ts. agentProfileId/agentProfileModelId have real internal callers
  • Fit with existing patterns: Strong. CLAUDE.md's layering rule forbids upward deps on agent-runtime/agent-knowledge only; @tangle-network/agent-interface is a substrate sibling (its package.json shows only a zod runtime dep), and README.md:167 explicitly blesses it as the shared contract. The rename of profile.AgentProfile -> profile.PromptProfile (src/profile/index.ts:55) dissolves the exact name clash the old comment block
  • Real-world viability: Reasonable on the happy paths and common edge cases. agentProfileModelId throws loudly on missing model.default (src/agent-profile.ts:22) — preserves the old fail-loud guarantee. agentProfileId falls back to a hash-derived id when name/version are absent (src/agent-profile.ts:12), so optional agent-interface fields don't crash. defineAgentEval's mem:// runDir default avoids the filesystem-pollutio
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts

  • console.log('Marketing agent self-improvement loop')

🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts

  • composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),

🎯 Usefulness Audit

🟡 resources.skills order now affects agentProfileHash (old shape sorted) [robustness] ``

The previous AgentProfile treated skills/tools as sets: src/agent-profile.ts used to do [...(profile.skills ?? [])].sort() and the same for tools, so reordering did not split a scorecard cell. The new hash (src/agent-profile.ts:46-54) sorts tags but does NOT sort array-of-object resource bundles like resources.skills, resources.tools, resources.agents. The test at src/agent-profile.test.ts:40 only asserts 'adding a skill changes the hash', not order-insensitivity. For the common case o


What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T101710Z

@tangletools

Copy link
Copy Markdown
Contributor

❌ Needs Work — a096a3f8

Readiness 35/100 · Confidence 95/100 · 21 findings (1 high, 4 medium, 16 low)

deepseek glm aggregate
Readiness 35 42 35
Confidence 95 95 95
Correctness 35 42 35
Security 35 42 35
Testing 35 42 35
Architecture 35 42 35

Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision.

Blocking

🔴 HIGH New public API functions agentProfileId and agentProfileModelId have zero test coverage — src/agent-profile.ts

agentProfileId (line 12) and agentProfileModelId (line 22) are new public exports consumed by src/scorecard.ts:118, src/campaign/presets/run-profile-matrix.ts:228-229, and re-exported via src/index.ts:636. Neither function has a single direct test. The test file src/agent-profile.test.ts only imports and tests agentProfileHash. Run-profile-matrix.ts wraps agentProfileModelId errors with profileId (line 370-372) and scorecard.ts uses

Other

🟠 MEDIUM No unit tests for agentProfileId or agentProfileModelId — src/agent-profile.ts

src/agent-profile.test.ts imports only agentProfileHash (line 2). The two new exported helpers — agentProfileId (whitespace trim, name→version→hash fallback chain) and agentProfileModelId (the throw on missing/blank model.default) — have no direct unit tests. agentProfileModelId's throw is only indirectly covered via agentProfileHash's throw test at line 66. The whitespace, fallback-ordering, and version-collision behaviours of agentProfileId are unverified. Add a describe block covering: trim, name-wins-over-version, version-fallback, hash-fallba

🟠 MEDIUM agentProfileId can return non-unique labels across distinct profiles — src/agent-profile.ts

Lines 12-16: agentProfileId returns profile.name?.trim() || profile.version?.trim(), falling back to version when name is missing. version is behaviour-bearing (affects agentProfileHash), so two profiles with different skills/models but the same version: 'v3' and no name both get id 'v3'. In src/campaign/presets/run-profile-matrix.ts:336 profileIds = opts.profiles.map(agentProfileId) feeds both experimentId (line 339) and matrixId ([line 340](https://github.com/tangle-network/agent-eval/blob/a096a3f8927999a6281f9320dffdadb097e7cea1/src

🟠 MEDIUM agentProfileModelId error message lost profile-identifying information — src/agent-profile.ts

Line 25: throws ValidationError with bare string 'AgentProfile has no model.default — cannot record eval run'. The old code (base commit 9e6f0d1) included the profile id: AgentProfile "${profile.id}" has no model — cannot hash. Direct callers like scorecard.ts:118 get no clue which profile failed. The run-profile-matrix.ts caller (line 370-372) wraps the error with profileId, but other callers rely on the error message alone. Could include profile.name ?? 'unknown' or the hash prefix.

🟠 MEDIUM Shallow merge silently drops nested-object fields in improve() partial overrides — src/contract/define-agent-eval.ts

mergeDefined (lines 151-158) does key-level shallow merge. In improve() (lines 111-118), when a caller passes a partial override for object-typed fields like hostedTenant (which has required fields endpoint, apiKey, tenantId — see src/hosted/client.ts:29-42), the entire object is replaced. Example: defaults set hostedTenant: { endpoint: 'https://orig', apiKey: 'k1', tenantId: 't1' }, caller passes hostedTenant: { endpoint: 'https://new' }apiKey and tenantId are silently lost, causing runtime auth

🟡 LOW README retains Phase-B/pairing language that judge.ts and index.ts removed — examples/marketing-agent-canonical/README.md

The PR systematically strips Phase-B/pairing wording from index.ts (titles, report name, run dir) and from judge.ts header (removed 'STRAWMAN the founder modifies during the Phase B pairing'), but README.md missed two spots: line 13 ('Strawman — partner modifies during the pairing.') and line 63 ('co-author during the pairing's hour 3'). These now contradict judge.ts's updated header ('Treat these dimensions as the product quality bar for the demo'). Impact: doc inconsistency introduced by the same PR; no cod

🟡 LOW Stale 'partner/pairing' language remains in README — examples/marketing-agent-canonical/README.md

Two references to the old 'partner/pairing' framing were not updated: (1) Line 13 table row for judge.ts still reads 'Strawman — partner modifies during the pairing.' — while judge.ts itself was rebranded to 'Treat these dimensions as the product quality bar'. (2) Line 63 still says 'co-author during the pairing's hour 3' — the surrounding section was rebranded to 'Swap for a real product'. These are inconsistent with the thorough rebranding applied everywhere else in this shot. Fix: update both to match the

🟡 LOW Breaking change to agentProfileHash output with no scorecard migration path — src/agent-profile.ts

Old hash inputs: {model: string, skills: string[], promptVersion, tools: string[], metadata}. New hash inputs: full nested AgentProfile (model.default, resources.skills content, permissions, mcp, hooks, modes, etc.). Any pre-PR scorecard JSONL persisted under the old schema (src/scorecard.ts:209 keys profiles by profileHash) will have stale hashes that no longer match what agentProfileHash produces. loadScorecard happily reads them but diffScorecard will treat old and new as different cells. The schema swap is intentional/coordinated, but there is no version marker on scorecard log lines and no migration helper. Acceptable for a coordinated rewrite, but call it out in the PR description and bump the scorecard format tag if one exists.

🟡 LOW agentProfileId fallback derives ID from behavior hash — semantic shift for unnamed profiles — src/agent-profile.ts

Line 12-16: when profile.name and profile.version are both empty, agentProfileId falls back to profile-${agentProfileHash(profile).slice(0,12)}. This means the profileId (used as runId/candidateId/runDir components) changes when the profile's behavior changes. Previously profile.id was a stable caller-supplied label independent of behavior. The fallback is clearly documented and only activates for unnamed/unversioned profiles, but consumers relying on profileId stability across profile mutations would be affected. Not a bug — just a behavioral note for callers passing unnamed profiles.

🟡 LOW agentProfileId has unreachable fallback for model-less profiles — src/agent-profile.ts

Line 12-15: agentProfileId falls back to profile-<hash> when name and version are both empty. However agentProfileHash (line 47) calls agentProfileModelId which throws on missing model.default, meaning the hash fallback path is unreachable for model-less profiles — the function throws before reaching line 15. Either the fallback should handle these gracefully, or the docstring should note the throw-on-no-model behavior.

🟡 LOW agentProfileModelId error message no longer identifies which profile is broken — src/agent-profile.ts

Line 25: the new error 'AgentProfile has no model.default — cannot record eval run' drops the profile identity that the old code included (AgentProfile "${profile.id}" has no model). In runProfileMatrix's preflight loop (line 349-351), agentProfileHash/agentProfileModelId are called OUTSIDE the try/catch, so this error propagates uncaught — the caller sees a generic error with no indication of which profile in opts.profiles is the offender. The old error at least named the profile.id. Fix: either pass an optional label arg to agentProfileModelId

🟡 LOW compact helper keeps null but strips undefined — subtle hash difference — src/agent-profile.ts

Line 30-36: compact() uses value !== undefined as the filter, so null values survive into the model object serialized in agentProfileHash. If a caller passes profile.model.provider: null, it produces a different hash than omitting the field entirely. This is likely intentional (null = explicitly empty vs undefined = absent) and aligns with the 'no fallbacks' doctrine, but is undocumented, and JSON naturally treats null and absent differently. Not a bug, but worth a one-line comment.

🟡 LOW compact() return type is a lie — claims T but returns a subset — src/agent-profile.ts

Lines 30-36: function compact<T extends Record<string, unknown>>(input: T): T returns out as T after dropping undefined-valued keys. The runtime shape is correct (JSON.stringify drops undefined anyway), so the cast is harmless at runtime, but the signature overstates the invariant. Since this is a private helper and the only caller reconstructs the model object immediately, the practical risk is zero. Either rename to make the cast explicit (compact<T>(input: T): Partial<T>) or inline it once — not worth a dedicated helper for a single use site.

🟡 LOW resources.skills array order now affects hash — semantic change undocumented — src/agent-profile.ts

Old code sorted skills and tools before hashing (order-insensitive). New code only sorts tags (line 52); resources.skills, resources.tools, resources.agents, resources.commands, resources.files are hashed in array order. canonicalize() in src/pre-registration.ts preserves array order (maps, doesn't sort). This is defensible — skill mount order can change agent behaviour — but the docstring at lines 38-44 doesn't state it. If two consumers mount the same skills in different order, they get different cells. Add one line to the docstring:

🟡 LOW Inconsistent AgentProfile import paths across changed files — src/campaign/presets/playback.ts

Line 14: playback.ts imports AgentProfile from @tangle-network/agent-interface directly, while run-profile-matrix.ts (line 37-42) imports it from ../../agent-profile (which re-exports the same type). Both resolve to the identical type — the inconsistency is purely cosmetic. Convention would be: import from ../../agent-profile when you also need the helper functions (run-profile-matrix does), import the type-only from either path otherwise. No fix required.

🟡 LOW Inconsistent import path for AgentProfile type — src/campaign/presets/playback.ts

playback.ts imports AgentProfile directly from @tangle-network/agent-interface (line 14) while run-profile-matrix.ts imports from ../../agent-profile (line 38-42). Both resolve to the same type since agent-profile.ts re-exports from agent-interface, but the inconsistency could confuse future readers. Harmless — no behavioral impact.

🟡 LOW evaluate() treats judges: [] as an explicit empty judge list, silently disabling scoring — src/contract/define-agent-eval.ts

Line 106: judges: judges ?? [judge ?? defaults.judge]. The ?? operator only falls back on null/undefined, so a caller passing judges: [] (empty array) gets zero judges wired into RunCampaignOptions.judges. run-campaign.ts:121 (const judges = opts.judges ?? []) then runs every cell without scoring, producing a CampaignResult whose aggregates.byJudge is empty. This is consistent with the docstring ('Ignored when judges is set') and with runCampaign's own behavior, but it is a silent footgun — no error, just an unscored result. Impact: low (documented, recoverable, caller error). Optional fix: coerce empty arrays to the default, e.g. `judges: (judges && ju

🟡 LOW improve() shallow-merges non-budget nested options, surprising callers who override one nested field — src/contract/define-agent-eval.ts

Lines 111-118 + mergeDefined (151-158): only budget gets field-by-field merge. If a caller does improve({ llm: { model: 'x' } }), the entire defaults.llm object is replaced, dropping any baseUrl/apiKey set on the definition. The improve() docstring (line 65-67) documents shallow-merge for everything except budget, so this is by design — but it is the one place eager callers will trip. Impact: low (documented, recoverable by passing the full nested object). No fix required; consider mentioning llm/proposer by name in

🟡 LOW Budget-merge test asserts outcomes, not the merged budget itself — tests/contract-define-agent-eval.test.ts

Test 4 ('runs selfImprove and merges budget overrides without dropping default knobs') asserts generationsExplored === 1, winner.surface === 'base better', and lift > 0. These prove the loop ran end-to-end, but do not explicitly assert that holdoutFraction: 0.5 and generations: 1 survived the mergeBudget call when only populationSize was overridden — the test name's claim ('without dropping default knobs') is verified only indirectly (non-empty train/holdout split would also succeed at fraction 0.25 default). Consider asserting against an introspectable merged budget, or adding a parallel test that varies holdoutFraction to prove the default is preserved vs. reset.

🟡 LOW Negative fs-write assertion is brittle — tests/contract-define-agent-eval.test.ts

Test 2 ('keeps default evaluate runs in memory...') asserts expect(existsSync(join(dir, 'mem:'))).toBe(false) — this only verifies that no literal directory named mem: was created. It would not catch accidental fs writes to any other path (e.g. a stray fsCampaignStorage default under a different key). Stronger form: expect(readdirSync(dir)).toHaveLength(0) to prove the temp dir stays empty. Not a bug — the underlying implementation does the right thing via the mem:// storage selector at define-agent-eval.ts:94-98 — but the test under-claims the contract it claims to lock down.

🟡 LOW mem: canary check is narrower than intended — tests/contract-define-agent-eval.test.ts

Line 97: existsSync(join(dir, 'mem:')) checks for a literal mem: directory entry. The actual in-memory runDir is mem://defineAgentEval-<timestamp>. If filesystem leakage occurred through direct path operations, the created entry would be the full URI string, not just mem:, so the check would miss it. The in-memory storage backend (inMemoryCampaignStorage()) is the correct guard; this canary is supplementary. Consider checking existsSync(join(dir, 'mem://')) or using a glob for mem:* patterns for broader coverage.


tangletools · 2026-06-21T10:33:50Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

❌ 1 Blocking Finding — a096a3f8

Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision.

Full immutable report for this review: trace

Summary comment for this run: full summary


tangletools · 2026-06-21T10:33:50Z · immutable trace

tangletools
tangletools previously approved these changes Jun 21, 2026

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 5a339d14

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:17:16Z

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 273.1s (2 bridge agents)
Total 273.1s

💰 Value — sound

Refactors agent-eval onto the canonical @tangle-network/agent-interface AgentProfile, adds a defineAgentEval() DX wrapper with fail-loud overrides, and fixes the declaration build — a coherent, worthwhile change.

  • What it does: Replaces agent-eval's local AgentProfile interface with the canonical type from @tangle-network/agent-interface (src/agent-profile.ts:1-6), keeps eval-specific hashing/id/model helpers (src/agent-profile.ts:13-31), introduces defineAgentEval() as a once-define / call-many wrapper over runEval() and selfImprove() (src/contract/define-agent-eval.ts:89-132), makes .evaluate({ judges: [] }) th
  • Goals it achieves: Aligns the eval substrate with the shared agent-interface profile contract so consumers don't translate between two profile shapes; gives product agents a single object they define once and call .evaluate() or .improve() on; eliminates silent unscored runs and silent partial-tenant misconfiguration; removes dead sandbox/docs surface; and makes declaration emission CI-reliable after prior heap
  • Assessment: Good. The change is internally consistent: it reuses existing primitives (canonicalize, runEval, selfImprove, runImprovementLoop) instead of reimplementing them, respects the repo's substrate-upward-dependency rule by deleting the sandbox peer/dev dep, and the new helpers are directly tested (src/agent-profile.test.ts, tests/contract-define-agent-eval.test.ts, tests/campaign/run-profile-ma
  • Better / existing approach: none — this is the right approach. I checked: @tangle-network/agent-interface exposes the AgentProfile type and mergeAgentProfiles, but no profile hashing/id/model-id helpers (node_modules/@tangle-network/agent-interface/dist/agent-profile.d.ts), so eval keeping its own hash/id helpers is correct. No existing shallow-merge utility for SelfImproveBudget/SelfImproveLlm/HostedTenant exists in
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

A coherent refactor that consolidates on the canonical @tangle-network/agent-interface AgentProfile, drops the redundant @tangle-network/sandbox dep, and adds a thin defineAgentEval() DX wrapper — fits the substrate grain, fail-loud defaults hold up, all callers updated, typecheck + targeted tests p

  • Integration: Fully wired. @tangle-network/agent-interface@^0.10.0 is a real dep (package.json:209) and exports AgentProfile (node_modules/@tangle-network/agent-interface/dist/agent-profile.d.ts:206). The new helpers (agentProfileId, agentProfileModelId, agentProfileHash) are re-exported from the root barrel (src/index.ts:636) and consumed by scorecard (src/scorecard.ts:117-118) and runProfileMatrix (src/campai
  • Fit with existing patterns: Matches the substrate grain. CLAUDE.md's layering rule says agent-eval has zero upward deps on consumers; agent-interface is the canonical type home (per its package docstring: 'This is the canonical home; @tangle-network/sandbox re-exports these symbols'), so pulling the type DOWN into agent-eval is correct layering, not an inversion. defineAgentEval follows the existing DX-wrapper pattern (selfI
  • Real-world viability: Holds up. Fail-loud paths are wired correctly: agentProfileModelId throws ValidationError naming the broken profile when model.default is missing/blank (src/agent-profile.ts:24-29); agentProfileId propagates that throw (no silent fallback) — exercised by tests at src/agent-profile.test.ts:96-100. defineAgentEval.evaluate({ judges: [] }) throws 'judges must not be empty' instead of producing an uns
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts

  • console.log('Marketing agent self-improvement loop')

🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts

  • composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T122338Z

@tangletools

Copy link
Copy Markdown
Contributor

✅ No Blockers — 5a339d14

Readiness 65/100 · Confidence 95/100 · 15 findings (1 medium, 14 low)

deepseek glm aggregate
Readiness 73 65 65
Confidence 95 95 95
Correctness 73 65 65
Security 73 65 65
Testing 73 65 65
Architecture 73 65 65

Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision.

🟠 MEDIUM Breaking hash algorithm change with no migration path visible — src/agent-profile.ts

The old agentProfileHash (lines removed) hashed only {model, skills, promptVersion, tools, metadata} with sorted arrays and null defaults for absent fields. The new hash (line 55-66) spreads the entire AgentProfile from @tangle-network/agent-interface minus name/description/tags-sorted. This is a structural incompatibility — old hashes and new hashes can never match, even for semantically identical configurations, because the old model field was a flat string while the new is {default: string, ...hints}. All existing scorecard JSONL files, matrix results, and any artifacts keyed by profileHash become orphaned. Impact: historical scorecard comparisons, diffScorecard bas

🟡 LOW Run output directory not in .gitignore — examples/marketing-agent-canonical/index.ts

index.ts:96 writes to .marketing-agent-runs/<timestamp>/. .gitignore:19 only excludes .runs/, not .marketing-agent-runs/. Pre-existing (old .phase-b-runs/ had the same gap), but the rename was the natural moment to add .marketing-agent-runs/ to .gitignore. Risk: a user running the demo then git add . accidentally commits run artifacts + generated reports. Fix: add .marketing-agent-runs/ to .gitignore. Not a regression — the rename preserves the pre-existing gap exactly.

🟡 LOW Documented 'resource array order is hash-bearing' invariant has no test — src/agent-profile.ts

Comment at agent-profile.ts:50-53 makes an explicit correctness claim: 'Resource array order is hash-bearing because mount order can change agent behaviour.' The implementation honour this (resources spread through without sorting), but src/agent-profile.test.ts has no test locking it in. A future contributor adding .sort() to resources would silently flip the invariant and corrupt every persisted cell key. Add expect(agentProfileHash({...base, resources: {files: [A,B]}})).not.toBe(agentProfileHash({...base, resources: {files: [B,A]}})) alongside the existing tag-order test at line 27-31.

🟡 LOW agentProfileId collision guarantee is probabilistic at 48 bits — src/agent-profile.ts

agentProfileId (line 13-16) truncates SHA-256 to 12 hex characters (48 bits) with a docstring claiming 'Collision-resistant' and 'two profiles must not collapse onto one row.' The birthday bound for 48 bits is ~2^24 ≈ 16M profiles for a 50% collision chance. With the label prefix (name or version), same-label collisions require both profiles to share the label AND collide on 48 bits, making it practically safe for typical eval matrices. However, the guarantee is probabilistic, not deterministic, and a full 64-char hash is available elsewhere (agentProfileHash for scorecard joins). Consider documenting the probabilistic bound so callers can assess risk for their scale.

🟡 LOW agentProfileId doc claims 'directory names' but does not sanitize path-unsafe label characters — src/agent-profile.ts

Doc at agent-profile.ts:9-12 says the id is for 'run ids, matrix keys, and directory names where two profiles must not collapse onto one row.' Implementation at line 14 uses profile.name?.trim() || profile.version?.trim() verbatim — a name like 'sonnet/legal' or 'sonnet:legal' would produce id 'sonnet/legal-', which is unsafe as a filesystem path or matrix-key substring. Today no caller uses it as a directory name (verified: run-profile-matrix.ts only feeds it into runId/candidateId/sha inputs, all opaque), so no production bug. Either narrow the doc to 'string identifiers' or sanitize /, :, whitespace in agentProfileLabel before composing the id.

🟡 LOW compact() strips only top-level undefined — insufficient if model hints grow nested fields — src/agent-profile.ts

The compact() helper (line 37-43) removes only top-level undefined values from the model hints object. If profile.model ever grows nested structures (e.g., model: {default: 'claude', hints: {timeoutMs: undefined, provider: 'anthropic'}}), the nested undefined (hints.timeoutMs) would pass through to the hash, making {hints: {timeoutMs: undefined, provider: 'x'}} differ from {hints: {provider: 'x'}} despite semantic equivalence. Currently safe because the @tangle-network/agent-interface model type appears flat, but this is a latent fragility for future evolution.

🟡 LOW Breaking change: byProfile/campaigns keys now hash-suffixed — src/campaign/presets/run-profile-matrix.ts

Lines 438 & 443: campaigns[profileId] and byProfile[profileId] now key on agentProfileId(profile) = ${label}-${hash.slice(0,12)} instead of the old user-supplied profile.id. Two profiles that previously shared a label (e.g. two 'sonnet-v3' iterations) no longer collide, which is the intended fix. Impact: consumers indexing result.byProfile['sonnet-v3'] silently get undefined after upgrade. TypeScript catches direct profile.id access at the call site (AgentProfile no longer has .id), but a consumer that stored the old key string and looks it up later would silently miss. Recommend a CHANGELOG note framing this as an intentional key-sha

🟡 LOW agentProfileHash recomputed multiple times per profile — src/campaign/presets/run-profile-matrix.ts

Lines 349-351, 381-383, and inside buildRunRecord (called per cell, line 228-229): agentProfileHash and agentProfileModelId are re-derived in three separate code regions for the same profile. sha256 over a small canonical object is microseconds, so this is not a hot-path concern — but for matrices with 10k+ cells, memoizing the per-profile {profileHash, profileId, model} triple once at the top of the profile loop would remove the redundancy. Cosmetic; not blocking.

🟡 LOW Shared defaults.runDir between evaluate() and improve() can interact with resumable cell caching — src/contract/define-agent-eval.ts

When a caller sets runDir on the definition (e.g. defineAgentEval({ runDir: '/tmp/run1', ... })), both evaluate() (define-agent-eval.ts:102, runDir ?? defaults.runDir ?? mem://...) and improve() (via mergeDefined → selfImprove, line 425 of self-improve.ts) will use the same /tmp/run1. runCampaign defaults resumable: true (run-campaign.ts:118), caching completed cells by (manifestHash, scenarioId, rep, generation). A baseline evaluate() run followed by improve() could therefore reuse cached baseline cells — which is arguably desirable (no recompute) but is undocumented at the defineAgentEval layer and could surprise callers who expect isola

🟡 LOW evaluateDefaults forwards reps=0 without guard, producing zero-cell campaign — src/contract/define-agent-eval.ts

Line 151: if (defaults.budget?.reps !== undefined) out.reps = defaults.budget.reps — when budget.reps === 0, this check passes and forwards 0 to runCampaign, which runs 0 cells (empty result, all means zero). selfImprove defaults reps to 1; exposing the zero case through the DX wrapper without validation is a footgun. Guard: if (defaults.budget?.reps !== undefined && defaults.budget.reps > 0) or document that reps=0 is valid but degenerate.

🟡 LOW mergeDefined uses unsound as T cast when copying Partial overrides into a full-type defaults object — src/contract/define-agent-eval.ts

At define-agent-eval.ts:183-190, mergeDefined<T>(defaults: T, overrides: object | undefined): T iterates Object.entries(overrides) and assigns each value into merged typed as T, then returns merged as T. In improve(), opts is AgentEvalImproveOptions = Omit<Partial<SelfImproveOptions>, 'budget'|'hostedTenant'|'llm'> & { budget?: Partial<...>; ... }. The Partial / Partial / Partial values from opts get assigned into merged (typed as if they were full HostedTenant / SelfImproveBudget / SelfImproveLlm). The subsequent spreads ...(budget ? { budget } : {}) etc. (lines 125-129) overwrite th

🟡 LOW AgentEvalImproveOptions references Partial but HostedTenant is not re-exported from /contract — src/contract/index.ts

Line 54: AgentEvalImproveOptions includes hostedTenant?: Partial<HostedTenant>. Consumers calling evalKit.improve({ hostedTenant: {...} }) get correct type inference, but if they want to explicitly type the hostedTenant value before passing it, HostedTenant is not available at /contract. The type is exported at src/hosted/client.ts but not re-exported here. Consider re-exporting HostedTenant in index.ts for DX completeness.

🟡 LOW Budget-merge test cannot distinguish merge correctness from gate-early-ship — tests/contract-define-agent-eval.test.ts

Lines 117-134: 'runs selfImprove and merges budget overrides' asserts result.generationsExplored === 1. Defaults set generations: 1; override is { populationSize: 1 }. But with a deterministic proposer returning a 0.9-scoring surface vs 0.3 baseline, the defaultProductionGate (deltaThreshold 0.05) ships on generation 1 regardless of the max — so generationsExplored would be 1 even if the merge had dropped generations entirely (it would fall back to selfImprove's default of 3 and still stop at 1). The merge logic at define-agent-eval.ts:155-167 is correct, but this test does not prove it. Same issue for populationSize: the proposer always r

🟡 LOW Hosted-tenant URL assertion depends on call ordering across 3 fetches — tests/contract-define-agent-eval.test.ts

Lines 152-191: the mock fetchImpl captures only the LAST request into a single request variable. The improve() path actually fires 3 POSTs in order: (1) emitLoopProvenance→ingestEvalRun (/v1/ingest/eval-runs), (2) emitLoopProvenance→ingestTraces (/v1/ingest/traces), (3) shipEvalRunToHosted→ingestEvalRuns (/v1/ingest/eval-runs). The assertion request?.url === 'https://new.example/v1/ingest/eval-runs' only passes because step (3) happens to be last. If shipEvalRunToHosted is ever removed or reordered before emitLoopProvenance, the URL would become /v1/ingest/traces and the test would break for the wrong reason. The header assertions (authorizati

🟡 LOW process.chdir mutates process-global state during in-memory test — tests/contract-define-agent-eval.test.ts

Lines 67-90: the 'keeps default evaluate runs in memory' test calls process.chdir(dir) and restores in finally. Vitest runs test files in parallel worker pool; if another file shares this worker and reads cwd during the await evalKit.evaluate() window, it sees the temp dir. Try/finally handles throws but not a process kill mid-test. Low risk in practice (the await is fast, in-memory), but a non-global approach (e.g. asserting runDir starts with 'mem://' directly, or stubbing fs) would be safer. Not blocking.


tangletools · 2026-06-21T12:43:35Z · trace

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

✅ Auto-approved PR — 5468922c

Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.

tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T13:01:29Z

@drewstone

Copy link
Copy Markdown
Contributor Author

Addressed the relevant points from the 5a339d14 audit in 5468922c:

  • Added a 0.94.0 migration note for the canonical AgentProfile hash/id and sourceProfile.kind changes.
  • Made agentProfileId() path-safe and widened the behavior-hash suffix to 16 hex chars.
  • Added a regression test that resource array order is behavior-bearing.
  • Re-exported HostedTenant from /contract.
  • Made default and per-call zero reps fail loudly instead of producing zero-cell runs.
  • Strengthened hosted-tenant override tests to assert every ingest request.
  • Ignored .marketing-agent-runs/ and persisted the critical-audit record under .evolve/critical-audit/2026-06-21T12-57-05Z/.

Reviewed the value-audit lows too: the example console.log output is intentional CLI output, and the inline duration/cost formatting is not a DX blocker. Local gates pass and GitHub Actions is green on 5468922c.

@tangletools tangletools left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟢 Value Audit — sound

Verdict sound
Concerns 2 (2 low)
Heuristic 0.0s
Duplication 0.0s
Interrogation 232.1s (2 bridge agents)
Total 232.1s

💰 Value — sound

Adopts the canonical @tangle-network/agent-interface AgentProfile in agent-eval, adds a thin defineAgentEval() DX wrapper, hardens profile ids, and fails loudly on invalid eval configs.

  • What it does: Replaces local AgentProfile handling with the canonical @tangle-network/agent-interface shape while keeping eval-owned helpers (agentProfileHash, agentProfileId, agentProfileModelId) and profile-cell stamping. Adds defineAgentEval() so callers define scenarios, agent, judge, and baseline once and then call .evaluate() or .improve() with per-call overrides. Hardens profile ids to a pa
  • Goals it achieves: Unify agent profile identity across the Tangle stack on a single canonical shape; reduce boilerplate for the common 'define once, score or improve' flow; prevent silent misconfiguration (zero reps, empty judges, incomplete hosted tenant, colliding profile directory names); and keep the public /contract surface stable and CI-safe.
  • Assessment: Good. The change follows the repo's own layering rule (agent-eval is substrate, depends on agent-interface, no upward deps) and lands in the right files. defineAgentEval is a thin wrapper that delegates to existing runEval and selfImprove rather than duplicating loop logic. The loud failures and migration notes are proportionate for a breaking 0.94.0 release.
  • Better / existing approach: none — this is the right approach. I searched src/ and found no existing 'define once, evaluate/improve' wrapper; the nearest primitives are lower-level (runEval, selfImprove, runImprovementLoop). The hash/id helpers are new but do not duplicate existing profile-identity code; agentProfileHash uses the existing canonicalize helper from pre-registration.ts.
  • Model: opencode/kimi-for-coding/k2p7
  • Bridge attempts: 1

🎯 Usefulness — sound

Replaces agent-eval's local flat AgentProfile with the canonical @tangle-network/agent-interface shape and adds a thin defineAgentEval() DX wrapper that delegates to runEval/selfImprove — substrate-correct, fully wired, all tests + build + verify:package pass.

  • Integration: defineAgentEval is exported from /contract (src/contract/index.ts:186), re-exported in README/docs, and exercised by 3 examples + 9 contract tests (tests/contract-define-agent-eval.test.ts). agentProfileId/Hash/ModelId are consumed by scorecard.ts, run-profile-matrix.ts, multishot/matrix.ts, and the root barrel (src/index.ts:636). No dead surface.
  • Fit with existing patterns: Aligns with the substrate layering rule (CLAUDE.md: agent-eval has zero upward deps; lean substrate). Old flat AgentProfile (id/model/skills/tools) was a duplicate of the canonical nested shape already used by agent-runtime and sandbox; this PR subtracts the duplication. defineAgentEval is documented inline as a DX wrapper over runEval + selfImprove (src/contract/define-agent-eval.ts:86) — no para
  • Real-world viability: Tested: typecheck clean, build clean, verify:package clean, 26 targeted tests pass. Hash semantics intentionally changed (CHANGELOG §0.94.0 calls out the clean greenfield migration). Fail-loud guards on zero reps (define-agent-eval.ts:224) and empty judges (define-agent-eval.ts:216) close the silent-zero-cell path. Hosted-tenant merge test (tests/contract-define-agent-eval.test.ts:164) verifies en
  • Model: opencode/zai-coding-plan/glm-5.2
  • Bridge attempts: 1

🔎 Heuristic Signals

🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts

  • console.log('Marketing agent self-improvement loop')

🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts

  • composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),

What this audit checks

It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.

Pass What it asks
Heuristic Vague title? Whitespace-only or cruft-bearing diff? (content signals only)
Duplication Do added function/class names already exist elsewhere in the repo?
Value Audit What does it do? What goal does it achieve? Is it good? Better architecture or already-exists?
Usefulness Audit Does it integrate and fit? Will it hold up in real use and actually get used?

Findings are concerns, not blocks — the human reviewer decides what to do with them.

value-audit · 20260621T130710Z

@drewstone drewstone merged commit 36fefac into main Jun 21, 2026
1 check passed
@drewstone drewstone mentioned this pull request Jun 21, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants