refactor(profile): use agent-interface AgentProfile#264
Conversation
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 2423088f
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T09:35:00Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 28d74376
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T09:45:24Z
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — d78c9dcc
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T09:49:43Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 3 (2 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 170.9s (2 bridge agents) |
| Total | 170.9s |
💰 Value — sound
Replaces agent-eval's local AgentProfile with the canonical @tangle-network/agent-interface shape, moves profile hashing/id/model helpers into eval where they belong, renames the internal prompt-builder type to PromptProfile, and prunes stale sandbox deps and obsolete strategy docs — a coherent subs
- What it does: The change swaps agent-eval's hand-rolled AgentProfile interface (
{ id, model, skills?, promptVersion?, tools?, metadata? }) for the canonical@tangle-network/agent-interfaceAgentProfile ({ name?, description?, version?, tags?, prompt?, model?, permissions?, tools?, mcp?, subagents?, resources?, hooks?, modes?, ... }). It rewrites eval-owned helpers (agentProfileHash,agentProfileId, `a - Goals it achieves: It unifies the profile contract across the Tangle stack so that eval consumers, runtime consumers, and sandbox consumers all speak the same
AgentProfilelanguage. It reduces maintenance burden by eliminating a duplicate local type, a stale peer dependency, and outdated guidance that no longer matches the architecture. It keeps eval as the owner of eval-specific behaviour-identity logic (hash/id/ - Assessment: This is a good change on its merits. It is coherent with the codebase's stated layering (CLAUDE.md: "agent-eval has zero upward dependencies on agent-runtime or agent-knowledge" and "If a type is genuinely substrate, move it INTO agent-eval"). The implementation is careful: the hash is deterministic, order-insensitive where it matters (tags), excludes human-facing labels (name/description), fails
- Better / existing approach: none — this is the right approach. agent-interface does not provide profile hashing/id/model utilities (verified by grep in node_modules/@tangle-network/agent-interface; only
defineAgentProfileandmergeAgentProfilesexist), so eval correctly owns the behaviour-identity helpers. Keeping a local AgentProfile with a mapper would be extra types and conversion overhead; using agent-interface direc - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
Replaces agent-eval's bespoke AgentProfile with the canonical agent-interface type and adds three eval-owned helpers (id/model-id/hash); the new shape is consumed consistently across scorecard, run-profile-matrix, and the cell builder, with all old-shape accesses gone.
- Integration: Strong. The re-exported AgentProfile and three new helpers (agentProfileId/agentProfileModelId/agentProfileHash) are wired through every internal caller: src/scorecard.ts:117-118, src/campaign/presets/run-profile-matrix.ts:228-229/336/350-351/382-383, src/agent-profile-cell.ts, src/index.ts:635-636. External consumer tax-agent has already updated to agent-eval ^0.94.0 alongside agent-interface ^0.
- Fit with existing patterns: Excellent. The agent-interface package already lives in node_modules as the canonical cross-Tangle profile contract (re-exported by sandbox per its own header docstring), and agent-eval's CLAUDE.md substrate rule explicitly says substrate primitives should not duplicate consumer/runtime types — the old local AgentProfile was that smell. The replacement keeps eval-owned helpers where they belong (h
- Real-world viability: Solid on the paths that matter. agentProfileModelId throws ValidationError on blank/missing model.default (fail-loud per repo doctrine; tested at src/agent-profile.test.ts:65-69). The compact() helper normalizes undefined vs missing fields before hashing so equivalent profiles don't split cells. Hash determinism is tested for tag order, name/description invariance, model/skill/version sensitivity
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts
- console.log('Marketing agent self-improvement loop')
🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts
composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),
🎯 Usefulness Audit
🟡 Breaking public-API change without a version bump [integration] ``
package.json remains 0.94.0 despite removing exported symbols (SandboxAgentProfileLike, buildSandboxAgentProfileCell) and reshaping AgentProfile fundamentally (id+model-string -> name+model.default). External consumers pinned by range rather than tarball would break on install. Currently safe only because blueprint-agent pins 0.91.0 by URL and tax-agent coordinated its bump in lockstep. Out of scope for this audit (process/paperwork, not code usefulness), but worth flagging for the release condu
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 3c16ed69
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T10:05:16Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 4 (2 low, 2 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 227.5s (2 bridge agents) |
| Total | 227.5s |
💰 Value — sound-with-nits
Aligns agent-eval's local flat AgentProfile with the canonical @tangle-network/agent-interface shape, adds a thin defineAgentEval() DX wrapper over existing runEval()/selfImprove() primitives, and prunes ~2.4k lines of stale strategy docs — coherent and in the codebase's stated grain.
- What it does: Three concrete deltas: (1) src/agent-profile.ts drops the local {id, model: string, skills?, promptVersion?, tools?, metadata?} shape and re-exports AgentProfile from @tangle-network/agent-interface (nested model.default, prompt, resources, permissions, hooks, modes, etc.); src/agent-profile-cell.ts, src/multishot/{matrix,multishot}.ts, src/campaign/presets/run-profile-matrix.ts, src/scorecard.ts
- Goals it achieves: (a) Make agent-eval use the SAME AgentProfile shape the rest of the Tangle ecosystem uses, so cross-product profile-cell joins (same canonical profile → same sourceProfile.hash across agent-runtime/sandbox/evals) actually work instead of silently disagreeing. This realizes the substrate-direction rule in CLAUDE.md ('types that belong in a shared package should be imported from there, not redefined
- Assessment: Coherent and in-grain. The substrate rule in CLAUDE.md explicitly calls out AgentProfile-style types as cross-package primitives that should not be duplicated; the local flat shape was the outlier against agent-interface/agent-runtime, and the migration closes that gap. The new helpers (agentProfileId, agentProfileModelId) preserve eval-owned behaviour (hashing, RunRecord.model flattening, scoreca
- Better / existing approach: Looked for an existing 'define-once' wrapper or factory pattern; none exists. selfImprove() (src/contract/self-improve.ts:370) and runEval() (src/campaign/presets/run-eval) are the primitives defineAgentEval wraps — no duplication. PromptProfile (src/profile/index.ts:53) is deliberately retained as a distinct layer (builds the system-prompt string that gets rendered into AgentProfile.prompt.system
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound
Converges to the canonical @tangle-network/agent-interface AgentProfile type across all consumers, keeps eval-owned helpers, adds defineAgentEval() DX wrapper, and prunes 13 obsolete docs — a clean adoption of the upstream substrate type with no dead surface or competing pattern.
- Integration: The canonical AgentProfile from @tangle-network/agent-interface is imported in 7 source files: agent-profile.ts (re-export + helpers), agent-profile-cell.ts (cell stamping), playback.ts (driver context), multishot.ts, multishot/matrix.ts, and campaign/presets/playback.ts. scorecard.ts and run-profile-matrix.ts consume through the re-export in ./agent-profile. All call sites use the new shape (prof
- Fit with existing patterns: No competing pattern. The old local AgentProfile (flat id/model/skills/tools strings) is fully replaced; no backward-compat shims remain. The distinct PromptProfile / AgentProfileSection types in src/profile/index.ts are a separate concept (structured system prompt sections) and explicitly documented as such (line 15-18). The @tangle-network/agent-interface AgentProfile is the canonical provider-n
- Real-world viability: Tests cover: determinism of hash for identical profiles, tag-order insensitivity, name/description exclusion from hash, model-change sensitivity, resource-change sensitivity, version-change sensitivity, and fail-loud on missing model.default. The cell infrastructure (buildAgentInterfaceProfileCell, verifyAgentProfileCell, assertRunAgentProfileCell) validates cellId integrity, model/promptHash cros
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts
- console.log('Marketing agent self-improvement loop')
🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts
composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),
💰 Value Audit
🟡 sourceProfile.kind rename is a silent break for any persisted cells [maintenance] ``
src/agent-profile-cell.ts now writes kind='agent-interface-profile' where it previously wrote 'sandbox-agent-profile'. Any downstream consumer that joins on the kind literal (scorecard groupings, hosted ingest filters, cross-product rollups) will treat old + new cells as different profile kinds even when the underlying profile hashes to the same value. The PR body says this is intentional for a greenfield package, but a one-line migration note in docs/ or a constant alias (AGENT_PROFILE_KINDS.LE
🟡 Biome formatter pass inflated the file count on examples that didn't need semantic edits [proportion] ``
A large share of the 46-file diff is trailing-comma + line-wrap churn in examples/marketing-agent-canonical/{agent,judge,scenarios,index}.ts (e.g. agent.ts +35/-8 semantic vs ~60 lines of reformatted call-site args). Not wrong — the repo runs Biome, and the files were touched anyway — but it makes the semantic delta harder to review and bloats the PR. Future: run the formatter on a separate commit or a repo-wide biome --write chore so semantic refactors stay readable.
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — a096a3f8
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T10:11:02Z
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 3 (2 low, 1 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 259.3s (2 bridge agents) |
| Total | 259.3s |
💰 Value — sound
Aligns agent-eval with the shared @tangle-network/agent-interface AgentProfile, removes stale sandbox/docs baggage, and adds a clean defineAgentEval() DX wrapper over runEval/selfImprove; verified end-to-end.
- What it does: Three concrete moves: (1) replaces agent-eval's local AgentProfile shape with the canonical @tangle-network/agent-interface AgentProfile, keeping eval-owned helpers for id/model/hashing/cell-stamping; (2) adds defineAgentEval() in src/contract/define-agent-eval.ts, a thin factory that lets callers define scenarios/agent/judge/baseline once and then call .evaluate() or .improve() with per-call over
- Goals it achieves: Makes the eval substrate consistent with the stack's shared agent-interface contract (the repo's own layering rule says agent-eval must not depend upward on consumers), collapses the common eval+improve DX into one object, deletes misleading roadmap/Phase-B/spec docs so surviving docs point at current concepts, and makes declaration emission CI-safe.
- Assessment: Good. The profile migration is exactly what the codebase's layering rule demands; the hash semantics are preserved and tested. defineAgentEval is a small, well-typed wrapper that does not hide the underlying runEval/selfImprove primitives. The doc cleanup and build change are pragmatic. All verification passes: pnpm typecheck, pnpm test (2486 passed), pnpm build, pnpm verify:package, and the quick
- Better / existing approach: none — this is the right approach. The codebase had its own local AgentProfile; moving to the shared agent-interface shape removes duplication and follows the documented dependency direction. defineAgentEval is additive; the functional runEval/selfImprove entry points remain available for callers that need them. No existing helper already provides the define-once/evaluate-or-improve pattern.
- Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound-with-nits
Cleanly swaps the local AgentProfile for the canonical @tangle-network/agent-interface shape, retargets the public surface (cell builder, scorecard, profile matrix, multishot, playback) onto it, and adds a small, well-tested defineAgentEval() DX wrapper — all in the grain of the codebase.
- Integration: Strong. defineAgentEval is exported from /contract (src/contract/index.ts:185), is documented as #1 of 'the five functions you'll call', is used in README.md:63, docs/concepts.md:14, docs/customer-journeys.md:61, examples/selfimprove-quickstart/index.ts:11, and is covered by 4 contract tests at tests/contract-define-agent-eval.test.ts. agentProfileId/agentProfileModelId have real internal callers
- Fit with existing patterns: Strong. CLAUDE.md's layering rule forbids upward deps on agent-runtime/agent-knowledge only; @tangle-network/agent-interface is a substrate sibling (its package.json shows only a zod runtime dep), and README.md:167 explicitly blesses it as the shared contract. The rename of profile.AgentProfile -> profile.PromptProfile (src/profile/index.ts:55) dissolves the exact name clash the old comment block
- Real-world viability: Reasonable on the happy paths and common edge cases. agentProfileModelId throws loudly on missing model.default (src/agent-profile.ts:22) — preserves the old fail-loud guarantee. agentProfileId falls back to a hash-derived id when name/version are absent (src/agent-profile.ts:12), so optional agent-interface fields don't crash. defineAgentEval's mem:// runDir default avoids the filesystem-pollutio
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts
- console.log('Marketing agent self-improvement loop')
🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts
composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),
🎯 Usefulness Audit
🟡 resources.skills order now affects agentProfileHash (old shape sorted) [robustness] ``
The previous AgentProfile treated skills/tools as sets: src/agent-profile.ts used to do
[...(profile.skills ?? [])].sort()and the same for tools, so reordering did not split a scorecard cell. The new hash (src/agent-profile.ts:46-54) sortstagsbut does NOT sort array-of-object resource bundles likeresources.skills,resources.tools,resources.agents. The test at src/agent-profile.test.ts:40 only asserts 'adding a skill changes the hash', not order-insensitivity. For the common case o
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
❌ Needs Work —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 35 | 42 | 35 |
| Confidence | 95 | 95 | 95 |
| Correctness | 35 | 42 | 35 |
| Security | 35 | 42 | 35 |
| Testing | 35 | 42 | 35 |
| Architecture | 35 | 42 | 35 |
Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision.
Blocking
🔴 HIGH New public API functions agentProfileId and agentProfileModelId have zero test coverage — src/agent-profile.ts
agentProfileId (line 12) and agentProfileModelId (line 22) are new public exports consumed by src/scorecard.ts:118, src/campaign/presets/run-profile-matrix.ts:228-229, and re-exported via src/index.ts:636. Neither function has a single direct test. The test file src/agent-profile.test.ts only imports and tests agentProfileHash. Run-profile-matrix.ts wraps agentProfileModelId errors with profileId (line 370-372) and scorecard.ts uses
Other
🟠 MEDIUM No unit tests for agentProfileId or agentProfileModelId — src/agent-profile.ts
src/agent-profile.test.ts imports only
agentProfileHash(line 2). The two new exported helpers —agentProfileId(whitespace trim, name→version→hash fallback chain) andagentProfileModelId(the throw on missing/blankmodel.default) — have no direct unit tests. agentProfileModelId's throw is only indirectly covered via agentProfileHash's throw test at line 66. The whitespace, fallback-ordering, andversion-collision behaviours of agentProfileId are unverified. Add a describe block covering: trim, name-wins-over-version, version-fallback, hash-fallba
🟠 MEDIUM agentProfileId can return non-unique labels across distinct profiles — src/agent-profile.ts
Lines 12-16:
agentProfileIdreturnsprofile.name?.trim() || profile.version?.trim(), falling back to version when name is missing. version is behaviour-bearing (affects agentProfileHash), so two profiles with different skills/models but the sameversion: 'v3'and nonameboth get id'v3'. In src/campaign/presets/run-profile-matrix.ts:336profileIds = opts.profiles.map(agentProfileId)feeds bothexperimentId(line 339) andmatrixId([line 340](https://github.com/tangle-network/agent-eval/blob/a096a3f8927999a6281f9320dffdadb097e7cea1/src
🟠 MEDIUM agentProfileModelId error message lost profile-identifying information — src/agent-profile.ts
Line 25: throws ValidationError with bare string 'AgentProfile has no model.default — cannot record eval run'. The old code (base commit 9e6f0d1) included the profile id:
AgentProfile "${profile.id}" has no model — cannot hash. Direct callers like scorecard.ts:118 get no clue which profile failed. The run-profile-matrix.ts caller (line 370-372) wraps the error with profileId, but other callers rely on the error message alone. Could include profile.name ?? 'unknown' or the hash prefix.
🟠 MEDIUM Shallow merge silently drops nested-object fields in improve() partial overrides — src/contract/define-agent-eval.ts
mergeDefined(lines 151-158) does key-level shallow merge. Inimprove()(lines 111-118), when a caller passes a partial override for object-typed fields likehostedTenant(which has required fieldsendpoint,apiKey,tenantId— see src/hosted/client.ts:29-42), the entire object is replaced. Example: defaults sethostedTenant: { endpoint: 'https://orig', apiKey: 'k1', tenantId: 't1' }, caller passeshostedTenant: { endpoint: 'https://new' }—apiKeyandtenantIdare silently lost, causing runtime auth
🟡 LOW README retains Phase-B/pairing language that judge.ts and index.ts removed — examples/marketing-agent-canonical/README.md
The PR systematically strips Phase-B/pairing wording from index.ts (titles, report name, run dir) and from judge.ts header (removed 'STRAWMAN the founder modifies during the Phase B pairing'), but README.md missed two spots: line 13 ('Strawman — partner modifies during the pairing.') and line 63 ('co-author during the pairing's hour 3'). These now contradict judge.ts's updated header ('Treat these dimensions as the product quality bar for the demo'). Impact: doc inconsistency introduced by the same PR; no cod
🟡 LOW Stale 'partner/pairing' language remains in README — examples/marketing-agent-canonical/README.md
Two references to the old 'partner/pairing' framing were not updated: (1) Line 13 table row for judge.ts still reads 'Strawman — partner modifies during the pairing.' — while judge.ts itself was rebranded to 'Treat these dimensions as the product quality bar'. (2) Line 63 still says 'co-author during the pairing's hour 3' — the surrounding section was rebranded to 'Swap for a real product'. These are inconsistent with the thorough rebranding applied everywhere else in this shot. Fix: update both to match the
🟡 LOW Breaking change to agentProfileHash output with no scorecard migration path — src/agent-profile.ts
Old hash inputs:
{model: string, skills: string[], promptVersion, tools: string[], metadata}. New hash inputs: full nested AgentProfile (model.default, resources.skills content, permissions, mcp, hooks, modes, etc.). Any pre-PR scorecard JSONL persisted under the old schema (src/scorecard.ts:209 keysprofilesbyprofileHash) will have stale hashes that no longer match what agentProfileHash produces. loadScorecard happily reads them but diffScorecard will treat old and new as different cells. The schema swap is intentional/coordinated, but there is no version marker on scorecard log lines and no migration helper. Acceptable for a coordinated rewrite, but call it out in the PR description and bump the scorecard format tag if one exists.
🟡 LOW agentProfileId fallback derives ID from behavior hash — semantic shift for unnamed profiles — src/agent-profile.ts
Line 12-16: when profile.name and profile.version are both empty, agentProfileId falls back to
profile-${agentProfileHash(profile).slice(0,12)}. This means the profileId (used as runId/candidateId/runDir components) changes when the profile's behavior changes. Previously profile.id was a stable caller-supplied label independent of behavior. The fallback is clearly documented and only activates for unnamed/unversioned profiles, but consumers relying on profileId stability across profile mutations would be affected. Not a bug — just a behavioral note for callers passing unnamed profiles.
🟡 LOW agentProfileId has unreachable fallback for model-less profiles — src/agent-profile.ts
Line 12-15: agentProfileId falls back to
profile-<hash>when name and version are both empty. However agentProfileHash (line 47) calls agentProfileModelId which throws on missing model.default, meaning the hash fallback path is unreachable for model-less profiles — the function throws before reaching line 15. Either the fallback should handle these gracefully, or the docstring should note the throw-on-no-model behavior.
🟡 LOW agentProfileModelId error message no longer identifies which profile is broken — src/agent-profile.ts
Line 25: the new error
'AgentProfile has no model.default — cannot record eval run'drops the profile identity that the old code included (AgentProfile "${profile.id}" has no model). In runProfileMatrix's preflight loop (line 349-351), agentProfileHash/agentProfileModelId are called OUTSIDE the try/catch, so this error propagates uncaught — the caller sees a generic error with no indication of which profile in opts.profiles is the offender. The old error at least named the profile.id. Fix: either pass an optional label arg to agentProfileModelId
🟡 LOW compact helper keeps null but strips undefined — subtle hash difference — src/agent-profile.ts
Line 30-36: compact() uses
value !== undefinedas the filter, so null values survive into the model object serialized in agentProfileHash. If a caller passesprofile.model.provider: null, it produces a different hash than omitting the field entirely. This is likely intentional (null = explicitly empty vs undefined = absent) and aligns with the 'no fallbacks' doctrine, but is undocumented, and JSON naturally treats null and absent differently. Not a bug, but worth a one-line comment.
🟡 LOW compact() return type is a lie — claims T but returns a subset — src/agent-profile.ts
Lines 30-36:
function compact<T extends Record<string, unknown>>(input: T): Treturnsout as Tafter dropping undefined-valued keys. The runtime shape is correct (JSON.stringify drops undefined anyway), so the cast is harmless at runtime, but the signature overstates the invariant. Since this is a private helper and the only caller reconstructs the model object immediately, the practical risk is zero. Either rename to make the cast explicit (compact<T>(input: T): Partial<T>) or inline it once — not worth a dedicated helper for a single use site.
🟡 LOW resources.skills array order now affects hash — semantic change undocumented — src/agent-profile.ts
Old code sorted
skillsandtoolsbefore hashing (order-insensitive). New code only sortstags(line 52);resources.skills,resources.tools,resources.agents,resources.commands,resources.filesare hashed in array order. canonicalize() in src/pre-registration.ts preserves array order (maps, doesn't sort). This is defensible — skill mount order can change agent behaviour — but the docstring at lines 38-44 doesn't state it. If two consumers mount the same skills in different order, they get different cells. Add one line to the docstring:
🟡 LOW Inconsistent AgentProfile import paths across changed files — src/campaign/presets/playback.ts
Line 14: playback.ts imports
AgentProfilefrom@tangle-network/agent-interfacedirectly, while run-profile-matrix.ts (line 37-42) imports it from../../agent-profile(which re-exports the same type). Both resolve to the identical type — the inconsistency is purely cosmetic. Convention would be: import from../../agent-profilewhen you also need the helper functions (run-profile-matrix does), import the type-only from either path otherwise. No fix required.
🟡 LOW Inconsistent import path for AgentProfile type — src/campaign/presets/playback.ts
playback.ts imports
AgentProfiledirectly from@tangle-network/agent-interface(line 14) while run-profile-matrix.ts imports from../../agent-profile(line 38-42). Both resolve to the same type since agent-profile.ts re-exports from agent-interface, but the inconsistency could confuse future readers. Harmless — no behavioral impact.
🟡 LOW evaluate() treats judges: [] as an explicit empty judge list, silently disabling scoring — src/contract/define-agent-eval.ts
Line 106:
judges: judges ?? [judge ?? defaults.judge]. The ?? operator only falls back on null/undefined, so a caller passingjudges: [](empty array) gets zero judges wired into RunCampaignOptions.judges. run-campaign.ts:121 (const judges = opts.judges ?? []) then runs every cell without scoring, producing a CampaignResult whose aggregates.byJudge is empty. This is consistent with the docstring ('Ignored when judges is set') and with runCampaign's own behavior, but it is a silent footgun — no error, just an unscored result. Impact: low (documented, recoverable, caller error). Optional fix: coerce empty arrays to the default, e.g. `judges: (judges && ju
🟡 LOW improve() shallow-merges non-budget nested options, surprising callers who override one nested field — src/contract/define-agent-eval.ts
Lines 111-118 + mergeDefined (151-158): only
budgetgets field-by-field merge. If a caller doesimprove({ llm: { model: 'x' } }), the entire defaults.llm object is replaced, dropping any baseUrl/apiKey set on the definition. The improve() docstring (line 65-67) documents shallow-merge for everything except budget, so this is by design — but it is the one place eager callers will trip. Impact: low (documented, recoverable by passing the full nested object). No fix required; consider mentioning llm/proposer by name in
🟡 LOW Budget-merge test asserts outcomes, not the merged budget itself — tests/contract-define-agent-eval.test.ts
Test 4 ('runs selfImprove and merges budget overrides without dropping default knobs') asserts
generationsExplored === 1,winner.surface === 'base better', andlift > 0. These prove the loop ran end-to-end, but do not explicitly assert thatholdoutFraction: 0.5andgenerations: 1survived themergeBudgetcall when onlypopulationSizewas overridden — the test name's claim ('without dropping default knobs') is verified only indirectly (non-empty train/holdout split would also succeed at fraction 0.25 default). Consider asserting against an introspectable merged budget, or adding a parallel test that varies holdoutFraction to prove the default is preserved vs. reset.
🟡 LOW Negative fs-write assertion is brittle — tests/contract-define-agent-eval.test.ts
Test 2 ('keeps default evaluate runs in memory...') asserts
expect(existsSync(join(dir, 'mem:'))).toBe(false)— this only verifies that no literal directory namedmem:was created. It would not catch accidental fs writes to any other path (e.g. a stray fsCampaignStorage default under a different key). Stronger form:expect(readdirSync(dir)).toHaveLength(0)to prove the temp dir stays empty. Not a bug — the underlying implementation does the right thing via themem://storage selector at define-agent-eval.ts:94-98 — but the test under-claims the contract it claims to lock down.
🟡 LOW mem: canary check is narrower than intended — tests/contract-define-agent-eval.test.ts
Line 97:
existsSync(join(dir, 'mem:'))checks for a literalmem:directory entry. The actual in-memory runDir ismem://defineAgentEval-<timestamp>. If filesystem leakage occurred through direct path operations, the created entry would be the full URI string, not justmem:, so the check would miss it. The in-memory storage backend (inMemoryCampaignStorage()) is the correct guard; this canary is supplementary. Consider checkingexistsSync(join(dir, 'mem://'))or using a glob formem:*patterns for broader coverage.
tangletools · 2026-06-21T10:33:50Z · trace
tangletools
left a comment
There was a problem hiding this comment.
❌ 1 Blocking Finding — a096a3f8
Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision.
Full immutable report for this review: trace
Summary comment for this run: full summary
tangletools · 2026-06-21T10:33:50Z · immutable trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 5a339d14
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T12:17:16Z
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 273.1s (2 bridge agents) |
| Total | 273.1s |
💰 Value — sound
Refactors agent-eval onto the canonical @tangle-network/agent-interface AgentProfile, adds a defineAgentEval() DX wrapper with fail-loud overrides, and fixes the declaration build — a coherent, worthwhile change.
- What it does: Replaces agent-eval's local AgentProfile interface with the canonical type from @tangle-network/agent-interface (src/agent-profile.ts:1-6), keeps eval-specific hashing/id/model helpers (src/agent-profile.ts:13-31), introduces
defineAgentEval()as a once-define / call-many wrapper overrunEval()andselfImprove()(src/contract/define-agent-eval.ts:89-132), makes.evaluate({ judges: [] })th - Goals it achieves: Aligns the eval substrate with the shared agent-interface profile contract so consumers don't translate between two profile shapes; gives product agents a single object they define once and call
.evaluate()or.improve()on; eliminates silent unscored runs and silent partial-tenant misconfiguration; removes dead sandbox/docs surface; and makes declaration emission CI-reliable after prior heap - Assessment: Good. The change is internally consistent: it reuses existing primitives (
canonicalize,runEval,selfImprove,runImprovementLoop) instead of reimplementing them, respects the repo's substrate-upward-dependency rule by deleting the sandbox peer/dev dep, and the new helpers are directly tested (src/agent-profile.test.ts, tests/contract-define-agent-eval.test.ts, tests/campaign/run-profile-ma - Better / existing approach: none — this is the right approach. I checked: @tangle-network/agent-interface exposes the AgentProfile type and
mergeAgentProfiles, but no profile hashing/id/model-id helpers (node_modules/@tangle-network/agent-interface/dist/agent-profile.d.ts), so eval keeping its own hash/id helpers is correct. No existing shallow-merge utility forSelfImproveBudget/SelfImproveLlm/HostedTenantexists in - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
A coherent refactor that consolidates on the canonical @tangle-network/agent-interface AgentProfile, drops the redundant @tangle-network/sandbox dep, and adds a thin defineAgentEval() DX wrapper — fits the substrate grain, fail-loud defaults hold up, all callers updated, typecheck + targeted tests p
- Integration: Fully wired. @tangle-network/agent-interface@^0.10.0 is a real dep (package.json:209) and exports AgentProfile (node_modules/@tangle-network/agent-interface/dist/agent-profile.d.ts:206). The new helpers (agentProfileId, agentProfileModelId, agentProfileHash) are re-exported from the root barrel (src/index.ts:636) and consumed by scorecard (src/scorecard.ts:117-118) and runProfileMatrix (src/campai
- Fit with existing patterns: Matches the substrate grain. CLAUDE.md's layering rule says agent-eval has zero upward deps on consumers; agent-interface is the canonical type home (per its package docstring: 'This is the canonical home; @tangle-network/sandbox re-exports these symbols'), so pulling the type DOWN into agent-eval is correct layering, not an inversion. defineAgentEval follows the existing DX-wrapper pattern (selfI
- Real-world viability: Holds up. Fail-loud paths are wired correctly: agentProfileModelId throws ValidationError naming the broken profile when model.default is missing/blank (src/agent-profile.ts:24-29); agentProfileId propagates that throw (no silent fallback) — exercised by tests at src/agent-profile.test.ts:96-100. defineAgentEval.evaluate({ judges: [] }) throws 'judges must not be empty' instead of producing an uns
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts
- console.log('Marketing agent self-improvement loop')
🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts
composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
✅ No Blockers —
|
| deepseek | glm | aggregate | |
|---|---|---|---|
| Readiness | 73 | 65 | 65 |
| Confidence | 95 | 95 | 95 |
| Correctness | 73 | 65 | 65 |
| Security | 73 | 65 | 65 |
| Testing | 73 | 65 | 65 |
| Architecture | 73 | 65 | 65 |
Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 8/8 planned shots over 63 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Breaking hash algorithm change with no migration path visible — src/agent-profile.ts
The old agentProfileHash (lines removed) hashed only {model, skills, promptVersion, tools, metadata} with sorted arrays and null defaults for absent fields. The new hash (line 55-66) spreads the entire AgentProfile from @tangle-network/agent-interface minus name/description/tags-sorted. This is a structural incompatibility — old hashes and new hashes can never match, even for semantically identical configurations, because the old model field was a flat string while the new is {default: string, ...hints}. All existing scorecard JSONL files, matrix results, and any artifacts keyed by profileHash become orphaned. Impact: historical scorecard comparisons, diffScorecard bas
🟡 LOW Run output directory not in .gitignore — examples/marketing-agent-canonical/index.ts
index.ts:96 writes to
.marketing-agent-runs/<timestamp>/..gitignore:19only excludes.runs/, not.marketing-agent-runs/. Pre-existing (old.phase-b-runs/had the same gap), but the rename was the natural moment to add.marketing-agent-runs/to .gitignore. Risk: a user running the demo thengit add .accidentally commits run artifacts + generated reports. Fix: add.marketing-agent-runs/to .gitignore. Not a regression — the rename preserves the pre-existing gap exactly.
🟡 LOW Documented 'resource array order is hash-bearing' invariant has no test — src/agent-profile.ts
Comment at agent-profile.ts:50-53 makes an explicit correctness claim: 'Resource array order is hash-bearing because mount order can change agent behaviour.' The implementation honour this (resources spread through without sorting), but src/agent-profile.test.ts has no test locking it in. A future contributor adding
.sort()to resources would silently flip the invariant and corrupt every persisted cell key. Addexpect(agentProfileHash({...base, resources: {files: [A,B]}})).not.toBe(agentProfileHash({...base, resources: {files: [B,A]}}))alongside the existing tag-order test at line 27-31.
🟡 LOW agentProfileId collision guarantee is probabilistic at 48 bits — src/agent-profile.ts
agentProfileId (line 13-16) truncates SHA-256 to 12 hex characters (48 bits) with a docstring claiming 'Collision-resistant' and 'two profiles must not collapse onto one row.' The birthday bound for 48 bits is ~2^24 ≈ 16M profiles for a 50% collision chance. With the label prefix (name or version), same-label collisions require both profiles to share the label AND collide on 48 bits, making it practically safe for typical eval matrices. However, the guarantee is probabilistic, not deterministic, and a full 64-char hash is available elsewhere (agentProfileHash for scorecard joins). Consider documenting the probabilistic bound so callers can assess risk for their scale.
🟡 LOW agentProfileId doc claims 'directory names' but does not sanitize path-unsafe label characters — src/agent-profile.ts
Doc at agent-profile.ts:9-12 says the id is for 'run ids, matrix keys, and directory names where two profiles must not collapse onto one row.' Implementation at line 14 uses
profile.name?.trim() || profile.version?.trim()verbatim — a name like 'sonnet/legal' or 'sonnet:legal' would produce id 'sonnet/legal-', which is unsafe as a filesystem path or matrix-key substring. Today no caller uses it as a directory name (verified: run-profile-matrix.ts only feeds it into runId/candidateId/sha inputs, all opaque), so no production bug. Either narrow the doc to 'string identifiers' or sanitize/,:, whitespace in agentProfileLabel before composing the id.
🟡 LOW compact() strips only top-level undefined — insufficient if model hints grow nested fields — src/agent-profile.ts
The compact() helper (line 37-43) removes only top-level undefined values from the model hints object. If profile.model ever grows nested structures (e.g., model: {default: 'claude', hints: {timeoutMs: undefined, provider: 'anthropic'}}), the nested undefined (hints.timeoutMs) would pass through to the hash, making {hints: {timeoutMs: undefined, provider: 'x'}} differ from {hints: {provider: 'x'}} despite semantic equivalence. Currently safe because the @tangle-network/agent-interface model type appears flat, but this is a latent fragility for future evolution.
🟡 LOW Breaking change: byProfile/campaigns keys now hash-suffixed — src/campaign/presets/run-profile-matrix.ts
Lines 438 & 443:
campaigns[profileId]andbyProfile[profileId]now key onagentProfileId(profile)=${label}-${hash.slice(0,12)}instead of the old user-suppliedprofile.id. Two profiles that previously shared a label (e.g. two 'sonnet-v3' iterations) no longer collide, which is the intended fix. Impact: consumers indexingresult.byProfile['sonnet-v3']silently get undefined after upgrade. TypeScript catches directprofile.idaccess at the call site (AgentProfile no longer has.id), but a consumer that stored the old key string and looks it up later would silently miss. Recommend a CHANGELOG note framing this as an intentional key-sha
🟡 LOW agentProfileHash recomputed multiple times per profile — src/campaign/presets/run-profile-matrix.ts
Lines 349-351, 381-383, and inside
buildRunRecord(called per cell, line 228-229):agentProfileHashandagentProfileModelIdare re-derived in three separate code regions for the same profile. sha256 over a small canonical object is microseconds, so this is not a hot-path concern — but for matrices with 10k+ cells, memoizing the per-profile{profileHash, profileId, model}triple once at the top of the profile loop would remove the redundancy. Cosmetic; not blocking.
🟡 LOW Shared defaults.runDir between evaluate() and improve() can interact with resumable cell caching — src/contract/define-agent-eval.ts
When a caller sets
runDiron the definition (e.g.defineAgentEval({ runDir: '/tmp/run1', ... })), bothevaluate()(define-agent-eval.ts:102,runDir ?? defaults.runDir ?? mem://...) andimprove()(via mergeDefined → selfImprove, line 425 of self-improve.ts) will use the same/tmp/run1. runCampaign defaultsresumable: true(run-campaign.ts:118), caching completed cells by(manifestHash, scenarioId, rep, generation). A baselineevaluate()run followed byimprove()could therefore reuse cached baseline cells — which is arguably desirable (no recompute) but is undocumented at the defineAgentEval layer and could surprise callers who expect isola
🟡 LOW evaluateDefaults forwards reps=0 without guard, producing zero-cell campaign — src/contract/define-agent-eval.ts
Line 151:
if (defaults.budget?.reps !== undefined) out.reps = defaults.budget.reps— whenbudget.reps === 0, this check passes and forwards 0 torunCampaign, which runs 0 cells (empty result, all means zero).selfImprovedefaultsrepsto 1; exposing the zero case through the DX wrapper without validation is a footgun. Guard:if (defaults.budget?.reps !== undefined && defaults.budget.reps > 0)or document that reps=0 is valid but degenerate.
🟡 LOW mergeDefined uses unsound as T cast when copying Partial overrides into a full-type defaults object — src/contract/define-agent-eval.ts
At define-agent-eval.ts:183-190,
mergeDefined<T>(defaults: T, overrides: object | undefined): TiteratesObject.entries(overrides)and assigns each value intomergedtyped asT, then returnsmerged as T. Inimprove(),optsisAgentEvalImproveOptions=Omit<Partial<SelfImproveOptions>, 'budget'|'hostedTenant'|'llm'> & { budget?: Partial<...>; ... }. The Partial / Partial / Partial values from opts get assigned intomerged(typed as if they were full HostedTenant / SelfImproveBudget / SelfImproveLlm). The subsequent spreads...(budget ? { budget } : {})etc. (lines 125-129) overwrite th
🟡 LOW AgentEvalImproveOptions references Partial but HostedTenant is not re-exported from /contract — src/contract/index.ts
Line 54:
AgentEvalImproveOptionsincludeshostedTenant?: Partial<HostedTenant>. Consumers callingevalKit.improve({ hostedTenant: {...} })get correct type inference, but if they want to explicitly type the hostedTenant value before passing it,HostedTenantis not available at/contract. The type is exported atsrc/hosted/client.tsbut not re-exported here. Consider re-exportingHostedTenantin index.ts for DX completeness.
🟡 LOW Budget-merge test cannot distinguish merge correctness from gate-early-ship — tests/contract-define-agent-eval.test.ts
Lines 117-134: 'runs selfImprove and merges budget overrides' asserts
result.generationsExplored === 1. Defaults setgenerations: 1; override is{ populationSize: 1 }. But with a deterministic proposer returning a 0.9-scoring surface vs 0.3 baseline, the defaultProductionGate (deltaThreshold 0.05) ships on generation 1 regardless of the max — so generationsExplored would be 1 even if the merge had droppedgenerationsentirely (it would fall back to selfImprove's default of 3 and still stop at 1). The merge logic at define-agent-eval.ts:155-167 is correct, but this test does not prove it. Same issue for populationSize: the proposer always r
🟡 LOW Hosted-tenant URL assertion depends on call ordering across 3 fetches — tests/contract-define-agent-eval.test.ts
Lines 152-191: the mock fetchImpl captures only the LAST request into a single
requestvariable. The improve() path actually fires 3 POSTs in order: (1) emitLoopProvenance→ingestEvalRun (/v1/ingest/eval-runs), (2) emitLoopProvenance→ingestTraces (/v1/ingest/traces), (3) shipEvalRunToHosted→ingestEvalRuns (/v1/ingest/eval-runs). The assertionrequest?.url === 'https://new.example/v1/ingest/eval-runs'only passes because step (3) happens to be last. If shipEvalRunToHosted is ever removed or reordered before emitLoopProvenance, the URL would become /v1/ingest/traces and the test would break for the wrong reason. The header assertions (authorizati
🟡 LOW process.chdir mutates process-global state during in-memory test — tests/contract-define-agent-eval.test.ts
Lines 67-90: the 'keeps default evaluate runs in memory' test calls process.chdir(dir) and restores in finally. Vitest runs test files in parallel worker pool; if another file shares this worker and reads cwd during the
await evalKit.evaluate()window, it sees the temp dir. Try/finally handles throws but not a process kill mid-test. Low risk in practice (the await is fast, in-memory), but a non-global approach (e.g. asserting runDir starts with 'mem://' directly, or stubbing fs) would be safer. Not blocking.
tangletools · 2026-06-21T12:43:35Z · trace
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 5468922c
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T13:01:29Z
|
Addressed the relevant points from the
Reviewed the value-audit lows too: the example |
tangletools
left a comment
There was a problem hiding this comment.
🟢 Value Audit — sound
| Verdict | sound |
| Concerns | 2 (2 low) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 232.1s (2 bridge agents) |
| Total | 232.1s |
💰 Value — sound
Adopts the canonical @tangle-network/agent-interface AgentProfile in agent-eval, adds a thin defineAgentEval() DX wrapper, hardens profile ids, and fails loudly on invalid eval configs.
- What it does: Replaces local AgentProfile handling with the canonical
@tangle-network/agent-interfaceshape while keeping eval-owned helpers (agentProfileHash,agentProfileId,agentProfileModelId) and profile-cell stamping. AddsdefineAgentEval()so callers define scenarios, agent, judge, and baseline once and then call.evaluate()or.improve()with per-call overrides. Hardens profile ids to a pa - Goals it achieves: Unify agent profile identity across the Tangle stack on a single canonical shape; reduce boilerplate for the common 'define once, score or improve' flow; prevent silent misconfiguration (zero reps, empty judges, incomplete hosted tenant, colliding profile directory names); and keep the public
/contractsurface stable and CI-safe. - Assessment: Good. The change follows the repo's own layering rule (agent-eval is substrate, depends on agent-interface, no upward deps) and lands in the right files.
defineAgentEvalis a thin wrapper that delegates to existingrunEvalandselfImproverather than duplicating loop logic. The loud failures and migration notes are proportionate for a breaking 0.94.0 release. - Better / existing approach: none — this is the right approach. I searched
src/and found no existing 'define once, evaluate/improve' wrapper; the nearest primitives are lower-level (runEval,selfImprove,runImprovementLoop). The hash/id helpers are new but do not duplicate existing profile-identity code;agentProfileHashuses the existingcanonicalizehelper frompre-registration.ts. - Model: opencode/kimi-for-coding/k2p7
- Bridge attempts: 1
🎯 Usefulness — sound
Replaces agent-eval's local flat AgentProfile with the canonical @tangle-network/agent-interface shape and adds a thin defineAgentEval() DX wrapper that delegates to runEval/selfImprove — substrate-correct, fully wired, all tests + build + verify:package pass.
- Integration: defineAgentEval is exported from /contract (src/contract/index.ts:186), re-exported in README/docs, and exercised by 3 examples + 9 contract tests (tests/contract-define-agent-eval.test.ts). agentProfileId/Hash/ModelId are consumed by scorecard.ts, run-profile-matrix.ts, multishot/matrix.ts, and the root barrel (src/index.ts:636). No dead surface.
- Fit with existing patterns: Aligns with the substrate layering rule (CLAUDE.md: agent-eval has zero upward deps; lean substrate). Old flat AgentProfile (id/model/skills/tools) was a duplicate of the canonical nested shape already used by agent-runtime and sandbox; this PR subtracts the duplication. defineAgentEval is documented inline as a DX wrapper over runEval + selfImprove (src/contract/define-agent-eval.ts:86) — no para
- Real-world viability: Tested: typecheck clean, build clean, verify:package clean, 26 targeted tests pass. Hash semantics intentionally changed (CHANGELOG §0.94.0 calls out the clean greenfield migration). Fail-loud guards on zero reps (define-agent-eval.ts:224) and empty judges (define-agent-eval.ts:216) close the silent-zero-cell path. Hosted-tenant merge test (tests/contract-define-agent-eval.test.ts:164) verifies en
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 1
🔎 Heuristic Signals
🟡 Cruft: console debug added examples/marketing-agent-canonical/index.ts
- console.log('Marketing agent self-improvement loop')
🟡 Cruft: magic number added examples/marketing-agent-canonical/index.ts
composite mean: ${formatScore(baselineScore)} (took ${((Date.now() - baselineStart) / 1000).toFixed(1)}s, $${baseline.aggregates.totalCostUsd.toFixed(4)}),
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
Summary
@tangle-network/agent-interfaceAgentProfile.AgentProfileCellstamping.agentProfileId()now returns a path-safelabel-<hash16>id, so run directories and profile-matrix keys stay readable without collapsing same-label profiles.defineAgentEval(): define scenarios, agent, judge, and baseline once, then call.evaluate()or.improve()with per-call overrides.defineAgentEval.improve()merge nestedbudget,llm, andhostedTenantoverrides field-by-field; hosted tenant config now fails loudly if the merge still lacks endpoint/apiKey/tenantId.repsfail loudly instead of producing unscored or zero-cell runs.tsc --emitDeclarationOnlyfor declarations, with declaration aliases for package export paths.HostedTenantexport, run-output ignore, stronger hosted-tenant tests, and a persisted critical-audit record.Verification
pnpm test src/agent-profile.test.ts tests/contract-define-agent-eval.test.ts tests/campaign/run-profile-matrix.test.ts(38 passed)pnpm typecheckpnpm lint(exit 0; existing Biome warnings remain in untouched files)pnpm test(246 files, 2,500 passed, 2 skipped)pnpm buildpnpm verify:packagegit diff --checkgit diff --check HEAD~1..HEADgit merge-tree --write-tree origin/main HEAD→a6d8882d04654fd28ffe3f63ec09c977a2de4a845468922c: successNotes
buildSandboxAgentProfileCell/SANDBOX_AGENT_PROFILEpublic names instead of keeping stale compatibility aliases.AgentProfilemigration: old persisted scorecard/profile cells may not join with newagentProfileHashvalues orsourceProfile.kind = agent-interface-profilerows. This package is greenfield for us, so this PR treats that as a clean migration rather than adding legacy shims.agentProfileId()is now human-readable, path-safe, and collision-resistant for ordinary eval matrices. Consumers that used it as a stable display-only label should useprofile.name; run/matrix keys should use the helper.