Skip to content

staging: #272 OC concept overlay + 202606 data (for RY/Eric inspection)#8

Merged
rdhyee merged 13 commits into
mainfrom
feat/oc-concept-reintegration-272
Jun 10, 2026
Merged

staging: #272 OC concept overlay + 202606 data (for RY/Eric inspection)#8
rdhyee merged 13 commits into
mainfrom
feat/oc-concept-reintegration-272

Conversation

@rdhyee

@rdhyee rdhyee commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Staging deploy PR — merging this puts the OC-concept-enriched explorer on rdhyee.github.io for inspection by RY + Eric. Production (isamples.org + current/ aliases) is untouched.

Data files are already live on R2 under new names (isamples_202606_*); this merge just points the staging explorer at them.

Full review happens on the upstream PR (see isamplesorg#TBD). Build chain: make all-272 — both trust gates pass on the real data; dual AI sign-off (Claude + Codex, 4 adversarial rounds).

🤖 Generated with Claude Code

rdhyee and others added 13 commits June 1, 2026 12:19
isamplesorg#252)

Flavor A: a described-by=<concept-uri> deep link filters every explorer surface via the shared search_pids semi-join. Hidden (URL-only, no UI surface) — additive, no change to existing flows. Dual-Codex-approved, deep-link + mutual-exclusivity + A1-regression tested, smoke gate green. Closes isamplesorg#248 (Flavor A). Cursor + UI follow-ups tracked separately.
… derived parquet

(a) DATA_PROVENANCE.md — end-to-end build chain (export → base PQG → sidecar
merge → frontend derived → R2/Worker), per-stage script/command + the key
constraint (the iSamples export is frozen — Central API offline since Aug 2025;
new per-source data must come via the pid sidecar merge, not re-export). Folds
the sidecar pattern (previously only in the Obsidian vault) into the repo.

(c) scripts/build_frontend_derived.py — reproduces the 6 derived files that had
no checked-in build (only ad-hoc notebook SQL): sample_facets_v2, samples_map_lite,
wide_h3, h3_summary_res{4,6,8}, facet_summaries, facet_cross_filter — from one
`wide` input (DuckDB + h3 + spatial). Has --validate-against to diff schema+counts
vs published.

Validated vs the published isamples_202601 files (built from 202604 wide):
EXACT reproduction of sample_facets_v2 (5,980,282), samples_map_lite, and
h3_summary_res4/6/8; all schemas match. facet_summaries (+3) and
facet_cross_filter (+86) are schema-correct, with small deltas from the
202604-vs-202601 version gap + the original cross-filter pruning self-pairs
(this build is an exhaustive superset) — can be reconciled if exact parity is needed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…sts (isamplesorg#273)

Rebuilds the Stage-4 derived-parquet pipeline as a real, tested, human-runnable
system (no AI in the loop to trust). Closes defects found by EXECUTION that
document/AI review missed.

build_frontend_derived.py (rewrite):
- geometry-agnostic (WKB BLOB *or* DuckDB GEOMETRY) — fixes the silent
  BinderException on 202601/Zenodo wides
- decorrelated concept resolution (unnest+arg_min + joins) — fixes the
  MAP-cross-join perf blowup (>16 min -> 5.4 s on the 20M-row wide)
- material = first NON-ROOT concept (isamplesorg#265/isamplesorg#271); deterministic COPY ORDER BY +
  tie-broken dominant_source + rounded centroids
- strict CLI (unknown --only/--skip fails; --tag required)
- emits {tag}_manifest.json: input/output sha256, argv, git SHA, DuckDB +
  extension versions (machine-checkable build identity)

validate_frontend_derived.py (new, algebraic gate):
- asserts the derived-file ALGEBRA, not spot checks: summaries == GROUP BY
  facets; cross_filter == conditional GROUP BY; facets.pid == map_lite.pid;
  pid uniqueness; H3 counts sum to map_lite; schema. Non-zero exit on failure.

tests/test_frontend_derived.py (new): fixture unit tests over tiny synthetic
wides (BLOB + GEOMETRY), material/concept/place_name/CLI cases. 6 tests.

Makefile (wide/derived/validate/test/all), scripts/requirements.txt (duckdb
pinned), .github/workflows/pipeline-tests.yml (CI fixture gate).

DATA_PROVENANCE.md + SERIALIZATIONS.md reconciled with reality: Stage-4 now
scripted; geometry contract; non-reproducibility of deployed 202601 facets
(346,768 vs 528,983); version skew; h3 UBIGINT; cross_filter shape;
first-non-root vs leaf.

Scope hardened by adversarial Codex audit (epic isamplesorg#273). Supersedes isamplesorg#271.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…terminism (Codex round 2)

Codex PROVED the validator passed a wrecked rebuild (corrupted material/coords/H3
with self-consistent summaries -> exit 0). Fixes:
- validate_frontend_derived.py --wide: re-derive from the source wide and
  EXCEPT-diff the written facets/map_lite/h3 — catches corruption/stale/
  wrong-version that internal consistency cannot. Proven by a new test that
  corrupts coords (passes internal checks, FAILS the --wide gate). Passes on the
  real 202604 rebuild.
- builder HARD-fails on duplicate pids / duplicate concept row_ids (was a warning)
- --threads option; determinism claim made honest: facets/map_lite/summaries/
  cross_filter are byte-identical run-to-run (verified); float h3 centroids are
  display-only (compared on discrete cols only).
- tests: semantic-gate-catches-corruption, dup-pid-hard-fail, manifest, wide_h3 (10 total)
- docs: SERIALIZATIONS deployed-file caveat (202601 still has root rows) vs
  builder contract; DATA_PROVENANCE wide_h3 coverage precise.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…wide_h3 correctness (Codex round 3)

- validator --wide now also diffs h3 resolution (exact) + center_lat/lng
  (tolerant 1e-4: catches gross corruption, ignores float/thread last-ULP jitter)
- facet_summaries.scheme contract checked (must be NULL)
- wide_h3 cell correctness test (cross-checked vs map_lite)
- tests prove h3 center/resolution corruption + scheme corruption are caught (12 total)

Verified: 12 passed; real --wide gate exits 0 on the 202604 rebuild with the new
checks; h3 center delta 1e-6 (well within 1e-4).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ance + verify manifest (Codex/workflow round 4)

A proof workflow (independent re-exec + adversarial attack) found two real misses:
1. H3 centroids: shifting every cell center ~9m (8e-5 deg) passed the loose 1e-4
   tolerance. Tightened to 1e-5 (~1m); residual undetected error now bounded at
   ~1m on display-only centroids. Re-running the exact attack now FAILS the gate.
2. manifest.json was never validated — corrupting its sha256 attestations passed.
   Validator now verifies every output file's sha256 (and the input's, with
   --wide) against the manifest. (Self-attesting, not signed — documented.)

Both attacks re-run against the fixed gate now exit 1. Clean real rebuild still
exits 0. 14 fixture tests (added regressions for both misses).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…esorg#272, fixes isamplesorg#260)

Overlays Eric Kansa's OC PQG concept mappings onto the unified wide:
p__has_material_category / p__has_sample_object_type are REPLACED for OC
pids — OC wins unconditionally (RY decision 2026-06-10, isamplesorg#272). Mints
IdentifiedConcept rows for URIs the frozen export never had (e.g.
otheranthropogenicmaterial — the correct isamplesorg#260 value, absent entirely).

- scripts/enrich_wide_with_oc_concepts.py: deterministic single-pass DuckDB
  overlay; ordered-list preservation; hard-fails on dup pids/row_ids and
  unresolved OC concept refs; emits .manifest.json (input shas, counts).
- scripts/validate_oc_concept_enrichment.py: INDEPENDENT trust gate — re-derives
  expected URI lists from (src, oc) with its own SQL; non-overlay rows must be
  byte-identical; minted set must be exactly the missing URIs; isamplesorg#260 sentinel.
- tests/test_oc_concept_enrichment.py: 13 fixture tests incl. unconditional-win,
  order preservation, determinism (bit-identical), validator tamper-detection,
  hard-failure modes.
- validate_frontend_derived.py: isamplesorg#260 sentinel parameterized by data vintage
  (--sentinel-material); default now the post-isamplesorg#272 corrected value.
- Makefile: oc-wide / enrich / validate-enrich / all-272 chain; CI runs both
  fixture suites.
- DATA_PROVENANCE.md: Stage 3 split into 3a (thumbnails) + 3b (OC concepts).

Scope (documented): overlay only — ~75K new OC records not ingested;
p__has_context_category untouched. Both follow-ups tracked in isamplesorg#272.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…acy sentinel

- validator: ALL src rows now compared on ALL non-replaced columns (keyed
  row_id hash join — scales to 20.7M, no full-table EXCEPT) [BLOCKER 1 + perf 4]
- validator: overlay pid SET equality + distinctness; sentinel absence is a
  FAILURE when present in inputs (was a silent N/A) [BLOCKER 2]
- validator: minted rows must carry OC label/scheme metadata
- Makefile: legacy chain passes --sentinel-material (pre-isamplesorg#272 value); all-272
  clears it to use the enriched default [MAJOR 3]
- enrich: document []->NULL normalization (pqg #8 convention) [MINOR 5]
- tests: +3 — both Codex attacks reproduced (verified to fool the OLD
  validator, caught by the fixed one) + empty-array normalization pin

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…n, make -j safety

- validator: minted rows now compared FULL-ROW against a re-derived
  expectation (deterministic ids max(src)+rank(uri); NULLs everywhere except
  pid/otype/label/scheme) — shifted-id and smuggled-column outputs now fail
- enricher + validator: hard-reject duplicate OC IdentifiedConcept row_ids
  (one reference must never fan out into several URIs)
- Makefile: $(ENRICHED) is a real file target; validate-enrich depends on it;
  all/all-272 use ordered sub-makes — safe under make -j
- tests: +3 regression (both round-2 attacks verified to fool the previous
  gate; dup-concept-row_id input rejected by both scripts)

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- validator: reject unresolved OC concept refs (inner joins were silently
  dropping them from the expectation — standalone runs could pass a wrong
  output) and duplicate OC MSR pids (grain parity with the enricher)
- validator: minted expectation uses NOT EXISTS (NULL-pid src concept made
  NOT IN evaluate UNKNOWN -> false failure)
- tests: +3 regression for all three findings

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
…#272)

All tagged data URLs 202601->202606; wide_url pinned to the explicit
versioned file (popups read corrected OC material/object-type from it).
current/wide.parquet alias stays on the previous wide until the production
cutover decision.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@rdhyee rdhyee merged commit be59404 into main Jun 10, 2026
1 check passed
rdhyee added a commit that referenced this pull request Jun 11, 2026
…acy sentinel

- validator: ALL src rows now compared on ALL non-replaced columns (keyed
  row_id hash join — scales to 20.7M, no full-table EXCEPT) [BLOCKER 1 + perf 4]
- validator: overlay pid SET equality + distinctness; sentinel absence is a
  FAILURE when present in inputs (was a silent N/A) [BLOCKER 2]
- validator: minted rows must carry OC label/scheme metadata
- Makefile: legacy chain passes --sentinel-material (pre-isamplesorg#272 value); all-272
  clears it to use the enriched default [MAJOR 3]
- enrich: document []->NULL normalization (pqg #8 convention) [MINOR 5]
- tests: +3 — both Codex attacks reproduced (verified to fool the OLD
  validator, caught by the fixed one) + empty-array normalization pin

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant