staging: #272 OC concept overlay + 202606 data (for RY/Eric inspection)#8
Merged
Conversation
isamplesorg#252) Flavor A: a described-by=<concept-uri> deep link filters every explorer surface via the shared search_pids semi-join. Hidden (URL-only, no UI surface) — additive, no change to existing flows. Dual-Codex-approved, deep-link + mutual-exclusivity + A1-regression tested, smoke gate green. Closes isamplesorg#248 (Flavor A). Cursor + UI follow-ups tracked separately.
…ace fix (isamplesorg#261) Closes isamplesorg#253, isamplesorg#254, isamplesorg#255. Consolidated runtime on top of described-by (isamplesorg#252).
… derived parquet
(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.
(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.
Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sts (isamplesorg#273) Rebuilds the Stage-4 derived-parquet pipeline as a real, tested, human-runnable system (no AI in the loop to trust). Closes defects found by EXECUTION that document/AI review missed. build_frontend_derived.py (rewrite): - geometry-agnostic (WKB BLOB *or* DuckDB GEOMETRY) — fixes the silent BinderException on 202601/Zenodo wides - decorrelated concept resolution (unnest+arg_min + joins) — fixes the MAP-cross-join perf blowup (>16 min -> 5.4 s on the 20M-row wide) - material = first NON-ROOT concept (isamplesorg#265/isamplesorg#271); deterministic COPY ORDER BY + tie-broken dominant_source + rounded centroids - strict CLI (unknown --only/--skip fails; --tag required) - emits {tag}_manifest.json: input/output sha256, argv, git SHA, DuckDB + extension versions (machine-checkable build identity) validate_frontend_derived.py (new, algebraic gate): - asserts the derived-file ALGEBRA, not spot checks: summaries == GROUP BY facets; cross_filter == conditional GROUP BY; facets.pid == map_lite.pid; pid uniqueness; H3 counts sum to map_lite; schema. Non-zero exit on failure. tests/test_frontend_derived.py (new): fixture unit tests over tiny synthetic wides (BLOB + GEOMETRY), material/concept/place_name/CLI cases. 6 tests. Makefile (wide/derived/validate/test/all), scripts/requirements.txt (duckdb pinned), .github/workflows/pipeline-tests.yml (CI fixture gate). DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality: Stage-4 now scripted; geometry contract; non-reproducibility of deployed 202601 facets (346,768 vs 528,983); version skew; h3 UBIGINT; cross_filter shape; first-non-root vs leaf. Scope hardened by adversarial Codex audit (epic isamplesorg#273). Supersedes isamplesorg#271. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terminism (Codex round 2) Codex PROVED the validator passed a wrecked rebuild (corrupted material/coords/H3 with self-consistent summaries -> exit 0). Fixes: - validate_frontend_derived.py --wide: re-derive from the source wide and EXCEPT-diff the written facets/map_lite/h3 — catches corruption/stale/ wrong-version that internal consistency cannot. Proven by a new test that corrupts coords (passes internal checks, FAILS the --wide gate). Passes on the real 202604 rebuild. - builder HARD-fails on duplicate pids / duplicate concept row_ids (was a warning) - --threads option; determinism claim made honest: facets/map_lite/summaries/ cross_filter are byte-identical run-to-run (verified); float h3 centroids are display-only (compared on discrete cols only). - tests: semantic-gate-catches-corruption, dup-pid-hard-fail, manifest, wide_h3 (10 total) - docs: SERIALIZATIONS deployed-file caveat (202601 still has root rows) vs builder contract; DATA_PROVENANCE wide_h3 coverage precise. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wide_h3 correctness (Codex round 3) - validator --wide now also diffs h3 resolution (exact) + center_lat/lng (tolerant 1e-4: catches gross corruption, ignores float/thread last-ULP jitter) - facet_summaries.scheme contract checked (must be NULL) - wide_h3 cell correctness test (cross-checked vs map_lite) - tests prove h3 center/resolution corruption + scheme corruption are caught (12 total) Verified: 12 passed; real --wide gate exits 0 on the 202604 rebuild with the new checks; h3 center delta 1e-6 (well within 1e-4). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ance + verify manifest (Codex/workflow round 4) A proof workflow (independent re-exec + adversarial attack) found two real misses: 1. H3 centroids: shifting every cell center ~9m (8e-5 deg) passed the loose 1e-4 tolerance. Tightened to 1e-5 (~1m); residual undetected error now bounded at ~1m on display-only centroids. Re-running the exact attack now FAILS the gate. 2. manifest.json was never validated — corrupting its sha256 attestations passed. Validator now verifies every output file's sha256 (and the input's, with --wide) against the manifest. (Self-attesting, not signed — documented.) Both attacks re-run against the fixed gate now exit 1. Clean real rebuild still exits 0. 14 fixture tests (added regressions for both misses). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…esorg#272, fixes isamplesorg#260) Overlays Eric Kansa's OC PQG concept mappings onto the unified wide: p__has_material_category / p__has_sample_object_type are REPLACED for OC pids — OC wins unconditionally (RY decision 2026-06-10, isamplesorg#272). Mints IdentifiedConcept rows for URIs the frozen export never had (e.g. otheranthropogenicmaterial — the correct isamplesorg#260 value, absent entirely). - scripts/enrich_wide_with_oc_concepts.py: deterministic single-pass DuckDB overlay; ordered-list preservation; hard-fails on dup pids/row_ids and unresolved OC concept refs; emits .manifest.json (input shas, counts). - scripts/validate_oc_concept_enrichment.py: INDEPENDENT trust gate — re-derives expected URI lists from (src, oc) with its own SQL; non-overlay rows must be byte-identical; minted set must be exactly the missing URIs; isamplesorg#260 sentinel. - tests/test_oc_concept_enrichment.py: 13 fixture tests incl. unconditional-win, order preservation, determinism (bit-identical), validator tamper-detection, hard-failure modes. - validate_frontend_derived.py: isamplesorg#260 sentinel parameterized by data vintage (--sentinel-material); default now the post-isamplesorg#272 corrected value. - Makefile: oc-wide / enrich / validate-enrich / all-272 chain; CI runs both fixture suites. - DATA_PROVENANCE.md: Stage 3 split into 3a (thumbnails) + 3b (OC concepts). Scope (documented): overlay only — ~75K new OC records not ingested; p__has_context_category untouched. Both follow-ups tracked in isamplesorg#272. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…acy sentinel - validator: ALL src rows now compared on ALL non-replaced columns (keyed row_id hash join — scales to 20.7M, no full-table EXCEPT) [BLOCKER 1 + perf 4] - validator: overlay pid SET equality + distinctness; sentinel absence is a FAILURE when present in inputs (was a silent N/A) [BLOCKER 2] - validator: minted rows must carry OC label/scheme metadata - Makefile: legacy chain passes --sentinel-material (pre-isamplesorg#272 value); all-272 clears it to use the enriched default [MAJOR 3] - enrich: document []->NULL normalization (pqg #8 convention) [MINOR 5] - tests: +3 — both Codex attacks reproduced (verified to fool the OLD validator, caught by the fixed one) + empty-array normalization pin Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…n, make -j safety - validator: minted rows now compared FULL-ROW against a re-derived expectation (deterministic ids max(src)+rank(uri); NULLs everywhere except pid/otype/label/scheme) — shifted-id and smuggled-column outputs now fail - enricher + validator: hard-reject duplicate OC IdentifiedConcept row_ids (one reference must never fan out into several URIs) - Makefile: $(ENRICHED) is a real file target; validate-enrich depends on it; all/all-272 use ordered sub-makes — safe under make -j - tests: +3 regression (both round-2 attacks verified to fool the previous gate; dup-concept-row_id input rejected by both scripts) Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- validator: reject unresolved OC concept refs (inner joins were silently dropping them from the expectation — standalone runs could pass a wrong output) and duplicate OC MSR pids (grain parity with the enricher) - validator: minted expectation uses NOT EXISTS (NULL-pid src concept made NOT IN evaluate UNKNOWN -> false failure) - tests: +3 regression for all three findings Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…#272) All tagged data URLs 202601->202606; wide_url pinned to the explicit versioned file (popups read corrected OC material/object-type from it). current/wide.parquet alias stays on the previous wide until the production cutover decision. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jun 10, 2026
rdhyee
added a commit
that referenced
this pull request
Jun 11, 2026
…acy sentinel - validator: ALL src rows now compared on ALL non-replaced columns (keyed row_id hash join — scales to 20.7M, no full-table EXCEPT) [BLOCKER 1 + perf 4] - validator: overlay pid SET equality + distinctness; sentinel absence is a FAILURE when present in inputs (was a silent N/A) [BLOCKER 2] - validator: minted rows must carry OC label/scheme metadata - Makefile: legacy chain passes --sentinel-material (pre-isamplesorg#272 value); all-272 clears it to use the enriched default [MAJOR 3] - enrich: document []->NULL normalization (pqg #8 convention) [MINOR 5] - tests: +3 — both Codex attacks reproduced (verified to fool the OLD validator, caught by the fixed one) + empty-array normalization pin Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Staging deploy PR — merging this puts the OC-concept-enriched explorer on rdhyee.github.io for inspection by RY + Eric. Production (isamples.org +
current/aliases) is untouched.Data files are already live on R2 under new names (
isamples_202606_*); this merge just points the staging explorer at them.Full review happens on the upstream PR (see isamplesorg#TBD). Build chain:
make all-272— both trust gates pass on the real data; dual AI sign-off (Claude + Codex, 4 adversarial rounds).🤖 Generated with Claude Code