fix(trace-analyst): count read/inspect tools as self-verification#271
Conversation
VERIFY_RE only matched verb-named tools (verif/eval/inspect/check/...), but real harnesses verify with Read/Grep/Glob (Claude Code), read_file/ls/cat (codex), and git status/diff — none of which match. So no-self-verification fired on ~any real session that reads or greps state, a false positive. Add the read/search/list/diff/status/test/lint families; leave bare shell (Bash/exec_command) unmatched since the name can't distinguish a test run from a mutation.
tangletools
left a comment
There was a problem hiding this comment.
✅ Auto-approved PR — 2f42430a
Blanket team auto-approval is enabled for this reviewer service.
The full PR reviewer audit still runs separately and will publish findings if it detects issues.
tangletools · auto-approval · reason: blanket_auto_approve · 2026-06-21T16:36:46Z
✅ No Blockers —
|
| opencode-kimi | glm | deepseek | aggregate | |
|---|---|---|---|---|
| Readiness | 85 | 86 | 82 | 82 |
| Confidence | 65 | 65 | 65 | 65 |
| Correctness | 85 | 86 | 82 | 82 |
| Security | 85 | 86 | 82 | 82 |
| Testing | 85 | 86 | 82 | 82 |
| Architecture | 85 | 86 | 82 | 82 |
Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision. | Full multi-shot audit completed 1/1 planned shots over 2 changed files. Global verifier still owns final merge decision.
🟠 MEDIUM Unbounded read substring in VERIFY_RE creates false-positive self-verification — src/trace-analyst/behavioral-metrics.ts
Line 58-59 adds
readwithout word boundaries to VERIFY_RE. Because the regex is case-insensitive, a tool namedthread,ready,readme,pread,readline, etc. will match and be counted as self-verification, incorrectly suppressing theno-self-verificationsignal. Concrete:VERIFY_RE.test('thread')is true. This undermines eval validity for any harness whose tool names contain that substring. Fix: scopereadto real read-tool shapes, e.g.\bRead\b|\bread_file\b|\bread_dir\b|\breadFile\b|\breadDirectory\bor\b(read|Read)(?:_file|_dir|File|Directory|Text)?\b.
🟠 MEDIUM VERIFY_RE: read/view/glob/diff lack word boundaries → false-positive self-verification matches — src/trace-analyst/behavioral-metrics.ts
Line 59. Tokens list, ls, cat, find use \b word boundaries, but read, view, glob, diff, status do not. Confirmed edge cases: 'thread', 'spreadsheet', 'spread_operator', 'already_setup', 'treadmill' match via read substring; 'global' matches via glob; 'preview', 'overview' match via view; 'differential' matches via diff. A tool named 'thread_pool' or 'spreadsheet_export' would be counted as self-verification, silently suppressing the no-self-verification signal — the agent looks better than it is. Impact: eval validity (false negative on behavioral signal). Fix: add \b to read, view, glob, diff, status (e.g. \bread\b|\bview\b|\bglob\b|\bdiff\b|\bstatus\b)
🟡 LOW New test is positive-only; no negative cases anchor the design intent — src/trace-analyst/behavioral-metrics.test.ts
Lines 120-128: the new test only verifies that Read/Grep/read_file/git.status DO match. It does not lock in the documented exclusion of shell wrappers (Bash/exec_command) nor the boundary behavior (\blist\b matching
listbut notlsremote). The existing fixture test at line 60-67 indirectly coversworld.executenot matching, but the new regex adds boundary-anchored patterns whose correctness is currently untested. Suggested addition: `for (const tool of ['Bash','exec_command','lsremote','catapu
🟡 LOW No negative-test coverage for VERIFY_RE boundary — src/trace-analyst/behavioral-metrics.test.ts
Lines 120-128. The new test only verifies that expected tools (Read, Grep, read_file, git.status) ARE counted as self-verification. No test verifies that non-verification tools (Bash, exec_command, Write, Edit, Task, thread, spreadsheet) are NOT counted. A regex that matched everything would pass the current tests. Add a negative case: toolSpan on 'Bash' or 'exec_command' should not set hasSelfVerification.
🟡 LOW Substring matches on unbounded alternatives create false-positive verification classification — src/trace-analyst/behavioral-metrics.ts
Line 59: alternatives
read|view|find|diff|status|searchare unbounded. Probe confirms matches onspreadsheet(read),thread(read),research_agent(search),diff_apply(diff),status_report_writer(status),viewer_metrics_push(view). A tool literally named any of these would be counted as self-verification, suppressing the no-self-verification signal. Realistic impact is low for current major harnesses (Claude Code, codex, Cursor) but a custom harness with e.g. athreadtool would silently flip the signal. Mitigation already applied to the shortest common words (\blist\b, \bls\b, \bcat\b) — consider extending word-boundary anchoring to `
🟡 LOW pytest/vitest don't match \btest despite comment claiming 'test' coverage — src/trace-analyst/behavioral-metrics.ts
Line 59:
\btestonly has a LEFT word boundary. Inpytest/vitest,testis preceded by a word char (p/i), so there is no boundary and the regex does NOT match. Verified by probe: pytest→false, vitest→false. The comment at line 53-57 advertises 'test/lint' as covered. Practical impact is low because test runners are usually invoked through Bash/exec_command (which is intentionally excluded — its name can't discriminate), but the comment overstates coverage. Fix: add|pytest|vitest|jest|mochaexplicitly OR dr
tangletools · 2026-06-21T16:42:11Z · trace
tangletools
left a comment
There was a problem hiding this comment.
🟡 Value Audit — sound-with-nits
| Verdict | sound-with-nits |
| Concerns | 3 (3 weak-concern) |
| Heuristic | 0.0s |
| Duplication | 0.0s |
| Interrogation | 246.2s (2 bridge agents) |
| Total | 246.2s |
💰 Value — sound-with-nits
Extends the existing VERIFY_RE so real coding-agent tool names (Read/Grep/read_file/git status) count as self-verification, fixing a real false-positive; ship.
- What it does: Expands the single VERIFY_RE regex in src/trace-analyst/behavioral-metrics.ts:58-59 to also match the read/search/list/diff/status/test/lint tool-name families. Span tool names like Read, Grep, read_file, git.status now set hasSelfVerification=true (line 120) and suppress the 'no-self-verification' signal (line 177). Bare shell (Bash/exec_command) stays intentionally unmatched. Adds a 4-case regre
- Goals it achieves: Fixes a false positive in the deterministic 'no-self-verification' behavioral signal. Real coding-agent traces (Claude Code, codex) verify by reading/grepping state, but their tools are named Read/Grep/read_file/git status — none matched the old verb-only matcher (verif|eval|inspect|check|assert|validat|review|confirm), so virtually every real session was reported as 'never verifies'. After merge,
- Assessment: Good change on its merits. It is small (one regex expansion + comment), targeted (one false positive), and squarely in the grain of the file's existing design — VERIFY_RE was already a single-regex tool categorizer (line 120), and this extends it. The comment at lines 53-57 honestly documents the boundary (shell excluded because the name can't disambiguate pytest from rm). The 14,541-session claim
- Better / existing approach: Searched src/ for tool-categorization regexes. Found scattered equivalents with DIFFERENT intent: EDIT_TOOLS at src/storyboard/code-edit.ts:20, duplicated read/edit matchers at src/storyboard/index.ts:159,166,175,284,286, and edit/test regexes at src/run-critic.ts:88,101. None of them is a true duplicate of 'is this verification vs. mutation' — they classify for rendering or critic scoring, with d
- Model: opencode/zai-coding-plan/glm-5.2
- Bridge attempts: 2
- Bridge warning: opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
🎯 Usefulness — sound-with-nits
Expands the VERIFY_RE regex with read/search/diff/test/lint tool families to fix a false positive where coding agents that read/grep state were incorrectly flagged as 'never verifies' — the change is minimal, well-targeted, and flows through the existing deterministic-signal pipeline automatically.
- Integration: Fully reachable. The VERIFY_RE regex at behavioral-metrics.ts:58-59 feeds hasSelfVerification at :120, which controls the no-self-verification signal at :177. computeTraceMetrics is called by behavioralAnalyst().analyze() at behavioral-analyst.ts:86, which is always registered in buildDefaultAnalystRegistry() at default-registry.ts:36, and also exported for direct use from src/index.ts:376. Every
- Fit with existing patterns: Natural extension of the existing pattern. The codebase has exactly one self-verification classifier — this regex — and no competing mechanism. The comment at :53-57 explains why each tool family is included and why shell tools are deliberately excluded. Same deterministic, model-agnostic approach as the module's other three behavioral signals.
- Real-world viability: Covers the real tool names used by agent harnesses: Claude Code (Read/Grep/Glob), Codex (read_file/ls/cat), git (status/diff), test/lint runners. Tests at behavioral-metrics.test.ts:120-128 verify Read, Grep, read_file, and git.status all suppress the false signal. Existing 10 tests stay green. The intentional exclusion of bare shell (Bash/exec_command) avoids the opposite false signal. One weak n
- Model: opencode/deepseek/deepseek-v4-pro
- Bridge attempts: 3
- Bridge warning: opencode/zai-coding-plan/glm-5.2: bridge stream ended without value-audit content; opencode/kimi-for-coding/k2p7: bridge stream ended without value-audit content
💰 Value Audit
🟡 Substring alternation read lacks word boundaries the rest of the regex uses [better-architecture] ``
VERIFY_RE mixes anchored alternates (\blist\b, \bls\b, \bcat\b, \bfind\b at line 59) with bare substrings like
read,view,search,eval.readmatches Thread/Already/Treadmill as well as Read/read_file. Real tool names won't trip this today, but the inconsistency is the kind of thing that bites when a new harness lands. Either anchor the substring alternates consistently or add a quick test for known non-matches (Bash, exec_command, thread). Note for the reviewer; does not gate shipping
🟡 N+1 scattered tool-name regex; consolidation overdue [duplication] ``
The codebase already hand-rolls tool-categorization regexes in at least three places (src/storyboard/code-edit.ts:20 EDIT_TOOLS; src/storyboard/index.ts:159,166,175,284,286 with two duplicated read matchers; src/run-critic.ts:88,101). This PR adds a fourth at behavioral-metrics.ts:58. Each has different intent so none is a true duplicate, but the pattern is clearly straining. A future PR could lift a shared tool-categories.ts (EDIT/READ/VERIFY/TEST families) and have each site compose from it. N
🎯 Usefulness Audit
🟡 read without word boundaries could match unintended tool names in edge cases [robustness] ``
The regex uses
read(not\bread\b) at behavioral-metrics.ts:59. This deliberately allows matchingread_fileand similar, but in the highly unlikely scenario of a tool namedthread,bread, orspread, it would count as verification. In practice, OTLP tool names from agent harnesses are sparse identifiers (Read, Bash, Write, read_file, grep, ls) — not prose. This is a theoretical edge, not a practical one. Adding word boundaries would breakread_filematching. No action needed unless
What this audit checks
It judges the change on its merits — not whether it was tasked out in an issue. Unticketed, fast-moving work is fine; the question is whether the change is good and whether a better or existing approach should be used instead.
| Pass | What it asks |
|---|---|
| Heuristic | Vague title? Whitespace-only or cruft-bearing diff? (content signals only) |
| Duplication | Do added function/class names already exist elsewhere in the repo? |
| Value Audit | What does it do? What goal does it achieve? Is it good? Better architecture or already-exists? |
| Usefulness Audit | Does it integrate and fit? Will it hold up in real use and actually get used? |
Findings are concerns, not blocks — the human reviewer decides what to do with them.
The
no-self-verificationbehavioral signal fires a false positive on real coding-agent traces.VERIFY_REonly matched verb-named tools (verif|eval|inspect|check|assert|validat|review|confirm), but the tools agents actually verify with are namedRead/Grep/Glob(Claude Code),read_file/ls/cat(codex), andgit status/diff— none of which match. Result: any session that reads or greps state (≈ all of them) was reported as "never verifies."Fix: add the read / search / list / diff / status / test / lint families to the matcher. Bare shell (
Bash/exec_command) stays unmatched on purpose — its name can't tell apytestfrom anrm, so matching it would just create the opposite false signal.Measured against a 14,541-session corpus: 99% of edit-sessions also Read/Bash, yet the metric reported 0 verification — the bug this fixes.
Tests: existing 10 stay green (synthetic
world.execute/world.evalnames unaffected); added a regression test thatRead/Grep/read_file/git.statuscount as verification. Full suite 2501 pass, typecheck clean.