php-transformer: add content round-trip hallucination check#416
Merged
Conversation
## Summary Add a forward-direction content round-trip check that flags generated block text which does not appear in the source content (invented, mangled, or merged copy). Ported from the JS blocks-engine `output-verify.ts verifyComposedOutput()` and adapted for raw-HTML input. Complements the existing structural validators (which check output well-formedness) and the SemanticParityReporter (landmarks and navigation only); neither previously verified that visible copy survived conversion. ## Why The transformer hand-rolls block markup, so silent content loss or duplication has no guardrail. A cheap forward check — every visible output text node must appear in the normalized source plaintext — catches it. ## How New `ContentRoundTripReporter` strips block-delimiter comments, splits output into text nodes, and flags any node (>=3 alphanumeric chars, normalized) absent from the normalized source as `content_not_in_source`. Wired into `HtmlTransformer::transform()` as an additive `source_reports.content_round_trip` report plus findings-gated diagnostics, so clean conversions stay byte-identical. Two refinements were driven by running the check against a real-world corpus (see PR description): 1. Decode the full HTML entity set (`html_entity_decode` with `ENT_QUOTES | ENT_HTML5`) plus Unicode-space folding on both sides, rather than a 7-entity subset. The transformer parses the source DOM so its output carries literal glyphs while raw HTML still holds entities; decoding both sides identically removes a flood of false positives on `©`/`→`/`—`/`’`. 2. Suppress form-control echoes. The transformer flattens form fields into readable paragraphs whose text it synthesizes from label + value/placeholder/required/option state — text legitimately absent from the source's visible content. The transformer (the only code that knows the text is synthesized) declares those strings; the reporter excludes them. This avoids brittle pattern or class-name matching and changes no serialized output. Also adds `tools/content-round-trip/run.php` (composer `content-round-trip`) to run the check over a directory of real HTML pages and report a ranked findings worklist. ## Testing - [ ] composer test (187 parity fixtures + contract/unit/packaging, all green) - [ ] php tests/unit/content-round-trip-reporter.php (21 assertions) - [ ] php tests/unit/content-round-trip-form-echo.php (5 assertions, end-to-end) - [ ] composer content-round-trip -- <dir-of-html> for ad-hoc corpus runs
Contributor
|
Nice! Thanks. Diagnostics are definitely the best way to improve this tooling fast |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a forward-direction content round-trip check: it flags generated block text that does not appear in the source content — invented, mangled, or merged copy. Ported from the JS blocks-engine
output-verify.ts verifyComposedOutput()and adapted for raw-HTML input.This is a technique being ported from the JS blocks-engine package. It complements what already exists:
BlockValidityValidator,CanonicalSaveShapeValidator) check that output is well-formed;SemanticParityReporterchecks landmarks and navigation;How it works
ContentRoundTripReporterstrips block-delimiter comments, splits output into text nodes, and flags any node (≥3 alphanumeric chars, normalized) that is not a substring of the normalized source plaintext ascontent_not_in_source. It is wired intoHtmlTransformer::transform()as an additivesource_reports.content_round_tripreport plus findings-gated diagnostics, so clean conversions remain byte-identical (the 187 parity fixtures are unaffected).Test corpus and findings
The two refinements in this PR were driven by running the check against a real-world corpus — 116 hand-built static HTML pages across 23 small sites (nonprofit, restaurant, SaaS, portfolio, docs, WooCommerce-style catalog, editorial magazine, local service business, event/conference, membership, etc.). These exercise entity usage, form markup, logo lockups, numbered lists, and stat counters the unit fixtures don't.
A dedicated harness was added to make this repeatable:
It walks
*.htmlrecursively, transforms each page, and prints a ranked worklist of flagged text.The numbers as the check was hardened:
That is an ~88% reduction in noise, and it came from fixing two real issues the corpus exposed:
Entity-decoding asymmetry (false positive in the check). Sources encode
©/→/—/’as©/→/—/’; the transformer parses the DOM, so its output carries the literal glyph. The faithful port decoded only 7 named entities, so©never matched©. Fixed by decoding the full entity set (html_entity_decode,ENT_QUOTES | ENT_HTML5) plus Unicode-space folding on both sides.Form-control echoes (expected transformer behavior, not a defect). The transformer flattens form fields into readable paragraphs whose text it synthesizes from
label + value/placeholder/required/optionstate (Email address: your@email.com (required),… (selected)) — text legitimately absent from the source's visible content. The transformer (the only code that knows the text was synthesized) now declares those strings and the reporter excludes them. No serialized output changes; no brittle pattern/class-name matching.The residual 72 findings are genuine signal — overwhelmingly one real transformer defect class: lost inter-element whitespace when adjacent inline elements merge. Examples surfaced by the corpus:
NORTHLINEPlumbing & Heating(logo:<span>NORTHLINE</span><span>Plumbing & Heating</span>)Civic SignalResource Library(brand + sub-label lockup)01Crushed San Marzano tomatoes(number badge fused to list item text)Error response shapejson(label + code badge)These are worth filing as a separate transformer fix; this check now provides a deduplicated, reproducible list of them.
Testing
composer test— 187 parity fixtures + contract/unit/packaging, all green; no regressions.php tests/unit/content-round-trip-reporter.php— 21 assertions (faithful output, hallucination detection, short-fragment skipping, full entity/whitespace/case normalization, producer-declared ignore set).php tests/unit/content-round-trip-form-echo.php— 5 assertions, end-to-end through the real transformer, proving form-echo suppression is load-bearing (the same output flags without the ignore set).composer content-round-trip -- ~/path/to/html— ad-hoc corpus runs.