Skip to content

php-transformer: add content round-trip hallucination check#416

Merged
chubes4 merged 1 commit into
trunkfrom
feature/php-content-round-trip
Jun 30, 2026
Merged

php-transformer: add content round-trip hallucination check#416
chubes4 merged 1 commit into
trunkfrom
feature/php-content-round-trip

Conversation

@borkweb

@borkweb borkweb commented Jun 30, 2026

Copy link
Copy Markdown
Member

Summary

Adds a forward-direction content round-trip check: it flags generated block text that does not appear in the source content — invented, mangled, or merged copy. Ported from the JS blocks-engine output-verify.ts verifyComposedOutput() and adapted for raw-HTML input.

This is a technique being ported from the JS blocks-engine package. It complements what already exists:

  • structural validators (BlockValidityValidator, CanonicalSaveShapeValidator) check that output is well-formed;
  • SemanticParityReporter checks landmarks and navigation;
  • neither verified that the visible copy survived conversion. This closes that gap.

How it works

ContentRoundTripReporter strips block-delimiter comments, splits output into text nodes, and flags any node (≥3 alphanumeric chars, normalized) that is not a substring of the normalized source plaintext as content_not_in_source. It is wired into HtmlTransformer::transform() as an additive source_reports.content_round_trip report plus findings-gated diagnostics, so clean conversions remain byte-identical (the 187 parity fixtures are unaffected).

Test corpus and findings

The two refinements in this PR were driven by running the check against a real-world corpus — 116 hand-built static HTML pages across 23 small sites (nonprofit, restaurant, SaaS, portfolio, docs, WooCommerce-style catalog, editorial magazine, local service business, event/conference, membership, etc.). These exercise entity usage, form markup, logo lockups, numbered lists, and stat counters the unit fixtures don't.

A dedicated harness was added to make this repeatable:

composer content-round-trip -- <dir-of-html> [--verbose] [--output report.json]

It walks *.html recursively, transforms each page, and prints a ranked worklist of flagged text.

The numbers as the check was hardened:

Stage Files flagged Findings
Initial faithful port 84.5% (98/116) 624
+ full HTML-entity decode 62.9% (73/116) 301
+ form-control echo suppression 20.7% (24/116) 72

That is an ~88% reduction in noise, and it came from fixing two real issues the corpus exposed:

  1. Entity-decoding asymmetry (false positive in the check). Sources encode ©/// as &copy;/&rarr;/&mdash;/&rsquo;; the transformer parses the DOM, so its output carries the literal glyph. The faithful port decoded only 7 named entities, so © never matched &copy;. Fixed by decoding the full entity set (html_entity_decode, ENT_QUOTES | ENT_HTML5) plus Unicode-space folding on both sides.

  2. Form-control echoes (expected transformer behavior, not a defect). The transformer flattens form fields into readable paragraphs whose text it synthesizes from label + value/placeholder/required/option state (Email address: your@email.com (required), … (selected)) — text legitimately absent from the source's visible content. The transformer (the only code that knows the text was synthesized) now declares those strings and the reporter excludes them. No serialized output changes; no brittle pattern/class-name matching.

The residual 72 findings are genuine signal — overwhelmingly one real transformer defect class: lost inter-element whitespace when adjacent inline elements merge. Examples surfaced by the corpus:

  • NORTHLINEPlumbing & Heating (logo: <span>NORTHLINE</span><span>Plumbing & Heating</span>)
  • Civic SignalResource Library (brand + sub-label lockup)
  • 01Crushed San Marzano tomatoes (number badge fused to list item text)
  • Error response shapejson (label + code badge)

These are worth filing as a separate transformer fix; this check now provides a deduplicated, reproducible list of them.

Testing

  • composer test — 187 parity fixtures + contract/unit/packaging, all green; no regressions.
  • php tests/unit/content-round-trip-reporter.php — 21 assertions (faithful output, hallucination detection, short-fragment skipping, full entity/whitespace/case normalization, producer-declared ignore set).
  • php tests/unit/content-round-trip-form-echo.php — 5 assertions, end-to-end through the real transformer, proving form-echo suppression is load-bearing (the same output flags without the ignore set).
  • composer content-round-trip -- ~/path/to/html — ad-hoc corpus runs.

## Summary
Add a forward-direction content round-trip check that flags generated block text which does not appear in the source content (invented, mangled, or merged copy). Ported from the JS blocks-engine `output-verify.ts verifyComposedOutput()` and adapted for raw-HTML input. Complements the existing structural validators (which check output well-formedness) and the SemanticParityReporter (landmarks and navigation only); neither previously verified that visible copy survived conversion.

## Why
The transformer hand-rolls block markup, so silent content loss or duplication has no guardrail. A cheap forward check — every visible output text node must appear in the normalized source plaintext — catches it.

## How
New `ContentRoundTripReporter` strips block-delimiter comments, splits output into text nodes, and flags any node (>=3 alphanumeric chars, normalized) absent from the normalized source as `content_not_in_source`. Wired into `HtmlTransformer::transform()` as an additive `source_reports.content_round_trip` report plus findings-gated diagnostics, so clean conversions stay byte-identical.

Two refinements were driven by running the check against a real-world corpus (see PR description):

1. Decode the full HTML entity set (`html_entity_decode` with `ENT_QUOTES | ENT_HTML5`) plus Unicode-space folding on both sides, rather than a 7-entity subset. The transformer parses the source DOM so its output carries literal glyphs while raw HTML still holds entities; decoding both sides identically removes a flood of false positives on `&copy;`/`&rarr;`/`&mdash;`/`&rsquo;`.

2. Suppress form-control echoes. The transformer flattens form fields into readable paragraphs whose text it synthesizes from label + value/placeholder/required/option state — text legitimately absent from the source's visible content. The transformer (the only code that knows the text is synthesized) declares those strings; the reporter excludes them. This avoids brittle pattern or class-name matching and changes no serialized output.

Also adds `tools/content-round-trip/run.php` (composer `content-round-trip`) to run the check over a directory of real HTML pages and report a ranked findings worklist.

## Testing
- [ ] composer test (187 parity fixtures + contract/unit/packaging, all green)
- [ ] php tests/unit/content-round-trip-reporter.php (21 assertions)
- [ ] php tests/unit/content-round-trip-form-echo.php (5 assertions, end-to-end)
- [ ] composer content-round-trip -- <dir-of-html> for ad-hoc corpus runs
@borkweb borkweb requested a review from chubes4 June 30, 2026 17:53
@chubes4

chubes4 commented Jun 30, 2026

Copy link
Copy Markdown
Contributor

Nice! Thanks. Diagnostics are definitely the best way to improve this tooling fast

@chubes4 chubes4 merged commit 936122d into trunk Jun 30, 2026
1 check passed
@chubes4 chubes4 deleted the feature/php-content-round-trip branch June 30, 2026 17:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants