Skip to content

Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale)#174

Merged
mkreyman merged 2 commits into
masterfrom
fix/172-suggested-links-subquery-ann
Jun 24, 2026
Merged

Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale)#174
mkreyman merged 2 commits into
masterfrom
fix/172-suggested-links-subquery-ann

Conversation

@mkreyman

Copy link
Copy Markdown
Owner

Summary

The merged #172 fix (089d592) was deployed and still 500'd in production — I verified this against the live system rather than trusting the green CI:

Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a pure ORDER BY embedding <=> $const LIMIT k. The already-linked anti-join (a JOIN) and the > threshold distance filter, in the same query, make the planner abandon the index and full-scan. (search_semantic has neither — that's why it's fast.) Confirmed by isolating each: anti-join alone → Seq Scan (cost 105k); threshold alone → Seq Scan; pure ANN → Index Scan using articles_embedding_hnsw_idx (cost 742).

Fix

Split into a subquery (the standard pgvector filter-after-ANN pattern, already used by do_distant_pairs):

  • INNER: pure top-pool nearest by cosine — only index-safe filters (tenant / status / not-null / not-self / visibility, each EXPLAIN-verified on prod to keep the index).
  • OUTER: the anti-join + threshold + final limit, over just pool rows.

Verified on the real prod corpus, including the agent visibility filter (the actual failing path): EXPLAIN ANALYZE = 73 ms, Index Scan using articles_embedding_hnsw_idx, no Seq Scan on articles.

pool = max(limit×5, 100) capped 500, floored at limit.

Enhanced review (team round 1 already in branch history; focused adversarial round 2 here)

Security + engineer (Opus), every claim verified against code/prod:

  • Sound: tenant isolation, visibility (filter forced onto the inner subquery), bidirectional anti-join, ordering, DoS bounds — all disproven as attacks.
  • Recall change (documented): exclusions now apply to the candidate pool, not the whole corpus, so a densely-linked hub may return fewer than limit. The alternative (anti-join inside the index scan) is prod-disproven (Seq Scan, cost 105k) — the ceiling is intrinsic to the indexed path. Documented on suggest_links/3 + the OpenAPI description.
  • Pool floored at limit (operator-misconfig hardening); EXPLAIN guard strengthened to require an actual Index Scan using the HNSW index.

Test that would have caught this

The EXPLAIN guard now asserts the corpus is reached via an Index Scan using articles_embedding(_hnsw)?_idx with no Seq Scan on articles (an outer sort over the small pool is expected) — the invariant that both #170 and the first #172 attempt violated. (Caveat: small-DB CI still can't reproduce the planner's cost-based choice at 76k rows — see the architectural note below.)

Notes for the operator

  • Migration drift: prod's index is articles_embedding_hnsw_idx, but migration 20260410022906 creates articles_embedding_idx. Both are HNSW cosine; the query uses whichever exists. Worth reconciling.
  • Verification gap: this class of bug (planner choice at scale) is invisible to the test suite. The durable fix is a scale-representative test/staging env + surfacing the real SQLSTATE instead of a generic 500 — proposed as a follow-up architectural track.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

Fixes #172

mkreyman added 2 commits June 24, 2026 09:45
…HNSW index

The previously-merged #172 fix (089d592) was deployed and STILL 500'd in prod —
verified live in prod logs: ERROR 57014 (query_canceled) "canceling statement" on
suggest_links/3 (knowledge.ex:2844), well after the deploy. EXPLAIN against the real
~76k-row corpus showed a Seq Scan on articles + Sort (cost ~57k): the HNSW index was
NOT chosen. The prior fix proved index ELIGIBILITY (a forced EXPLAIN with sort
disabled) but never that the planner CHOOSES the index at scale — the skeptical-BA
flagged exactly this gap and I closed the issue without prod verification. Owned.

Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a PURE
`ORDER BY embedding <=> $const LIMIT k`. The already-linked anti-join (a JOIN) and the
`GREATEST(0,1-(embedding<=>$t)) > threshold` distance filter, in the SAME query, make
the planner abandon the index and Seq-Scan the whole corpus. (search_semantic has
neither, which is why it stays fast.)

Fix: split into a subquery. The INNER query is the pure top-`pool` nearest by cosine
(only index-safe filters: tenant/status/not-null/not-self/visibility — each verified on
prod to keep the index). The OUTER query applies the anti-join + threshold + final
limit over just `pool` rows. Verified on the prod corpus: inner uses
`articles_embedding_hnsw_idx`, total EXPLAIN ANALYZE 73ms (was a statement timeout),
no Seq Scan on articles — confirmed WITH the agent visibility filter (the actual
failing path).

`pool = max(limit*5, 100)` capped at 500; effective recall additionally bounded by
hnsw.ef_search (~40, observed) — consistent with search_semantic; ample for the
default limit 5. Documented.

Test: the EXPLAIN guard now asserts the corpus is reached via the HNSW index with NO
`Seq Scan on articles` (an outer sort over the small pool is expected) — the assertion
that would have caught both #170 and the first #172 attempt. Structural + behavioral
tests unchanged and green.

Note: prod's index is named `articles_embedding_hnsw_idx`, but the migration
(20260410022906) creates `articles_embedding_idx` — migration drift (prod index
created out-of-band). Both are HNSW cosine; the query uses whichever exists. Flagged
for the operator; not addressed here.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
…IN strengthen)

Round-2 focused adversarial review (security + engineer, Opus). Reshape confirmed
sound on tenant isolation, visibility, anti-join, ordering, DoS (all disproven).
Findings fixed:

- A (security, MEDIUM): the anti-join/threshold now run AFTER the candidate-pool cut,
  so a densely-linked "hub" can return fewer than `limit` (or none) vs master's
  full-corpus exclusion. The reviewer's proposed alternative — move the anti-join back
  inside the index scan — is PROD-DISPROVEN: EXPLAIN of (anti-join, no threshold) on the
  real corpus is a Parallel Seq Scan + Sort (cost 105k), i.e. it reintroduces the 500.
  So the recall ceiling is intrinsic to the indexed path. Documented the exact mechanism
  and consequence on `suggest_links/3` and the controller OpenAPI description.

- engineer: `suggestion_candidate_pool/1` could cap the pool below `limit` if an operator
  misconfigures `:max_suggestion_candidate_pool`. Floored at `limit` so the pool can
  never truncate below the requested count before exclusions run.

- B (LOW): strengthened the EXPLAIN guard to require an actual `Index Scan using
  articles_embedding(_hnsw)?_idx` (not just the index name appearing), tolerant of the
  test-vs-prod index-name drift.

- C (LOW): no audit log on the BYPASSRLS read path — pre-existing, identical on master,
  and a module-wide concern (every AdminRepo read); out of scope for this hotfix.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
@mkreyman mkreyman merged commit 8a93029 into master Jun 24, 2026
6 checks passed
@mkreyman mkreyman deleted the fix/172-suggested-links-subquery-ann branch June 24, 2026 15:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify follow-up (#168/#170): suggested_links STILL 500s in prod — unbounded full-corpus scan (scale), not param interpolation

1 participant