Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale)#174
Merged
Merged
Conversation
…HNSW index The previously-merged #172 fix (089d592) was deployed and STILL 500'd in prod — verified live in prod logs: ERROR 57014 (query_canceled) "canceling statement" on suggest_links/3 (knowledge.ex:2844), well after the deploy. EXPLAIN against the real ~76k-row corpus showed a Seq Scan on articles + Sort (cost ~57k): the HNSW index was NOT chosen. The prior fix proved index ELIGIBILITY (a forced EXPLAIN with sort disabled) but never that the planner CHOOSES the index at scale — the skeptical-BA flagged exactly this gap and I closed the issue without prod verification. Owned. Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a PURE `ORDER BY embedding <=> $const LIMIT k`. The already-linked anti-join (a JOIN) and the `GREATEST(0,1-(embedding<=>$t)) > threshold` distance filter, in the SAME query, make the planner abandon the index and Seq-Scan the whole corpus. (search_semantic has neither, which is why it stays fast.) Fix: split into a subquery. The INNER query is the pure top-`pool` nearest by cosine (only index-safe filters: tenant/status/not-null/not-self/visibility — each verified on prod to keep the index). The OUTER query applies the anti-join + threshold + final limit over just `pool` rows. Verified on the prod corpus: inner uses `articles_embedding_hnsw_idx`, total EXPLAIN ANALYZE 73ms (was a statement timeout), no Seq Scan on articles — confirmed WITH the agent visibility filter (the actual failing path). `pool = max(limit*5, 100)` capped at 500; effective recall additionally bounded by hnsw.ef_search (~40, observed) — consistent with search_semantic; ample for the default limit 5. Documented. Test: the EXPLAIN guard now asserts the corpus is reached via the HNSW index with NO `Seq Scan on articles` (an outer sort over the small pool is expected) — the assertion that would have caught both #170 and the first #172 attempt. Structural + behavioral tests unchanged and green. Note: prod's index is named `articles_embedding_hnsw_idx`, but the migration (20260410022906) creates `articles_embedding_idx` — migration drift (prod index created out-of-band). Both are HNSW cosine; the query uses whichever exists. Flagged for the operator; not addressed here. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
…IN strengthen) Round-2 focused adversarial review (security + engineer, Opus). Reshape confirmed sound on tenant isolation, visibility, anti-join, ordering, DoS (all disproven). Findings fixed: - A (security, MEDIUM): the anti-join/threshold now run AFTER the candidate-pool cut, so a densely-linked "hub" can return fewer than `limit` (or none) vs master's full-corpus exclusion. The reviewer's proposed alternative — move the anti-join back inside the index scan — is PROD-DISPROVEN: EXPLAIN of (anti-join, no threshold) on the real corpus is a Parallel Seq Scan + Sort (cost 105k), i.e. it reintroduces the 500. So the recall ceiling is intrinsic to the indexed path. Documented the exact mechanism and consequence on `suggest_links/3` and the controller OpenAPI description. - engineer: `suggestion_candidate_pool/1` could cap the pool below `limit` if an operator misconfigures `:max_suggestion_candidate_pool`. Floored at `limit` so the pool can never truncate below the requested count before exclusions run. - B (LOW): strengthened the EXPLAIN guard to require an actual `Index Scan using articles_embedding(_hnsw)?_idx` (not just the index name appearing), tolerant of the test-vs-prod index-name drift. - C (LOW): no audit log on the BYPASSRLS read path — pre-existing, identical on master, and a module-wide concern (every AdminRepo read); out of scope for this hotfix. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The merged #172 fix (
089d592) was deployed and still 500'd in production — I verified this against the live system rather than trusting the green CI:fly logs:ERROR 57014 (query_canceled) "canceling statement"insuggest_links/3, firing after the deploy.EXPLAINon the real ~76k-row corpus:Seq Scan on articles+ Sort, cost ~57,000 — the HNSW index was never chosen. The prior EXPLAIN test only proved index eligibility (a forced plan with sort disabled), never that the planner picks the index at scale. That gap is exactly what the skeptical-BA reviewer flagged and I closed Verify follow-up (#168/#170): suggested_links STILL 500s in prod — unbounded full-corpus scan (scale), not param interpolation #172 without prod verification. Owned.Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a pure
ORDER BY embedding <=> $const LIMIT k. The already-linked anti-join (a JOIN) and the> thresholddistance filter, in the same query, make the planner abandon the index and full-scan. (search_semantichas neither — that's why it's fast.) Confirmed by isolating each: anti-join alone → Seq Scan (cost 105k); threshold alone → Seq Scan; pure ANN →Index Scan using articles_embedding_hnsw_idx(cost 742).Fix
Split into a subquery (the standard pgvector filter-after-ANN pattern, already used by
do_distant_pairs):poolnearest by cosine — only index-safe filters (tenant / status / not-null / not-self / visibility, each EXPLAIN-verified on prod to keep the index).limit, over justpoolrows.Verified on the real prod corpus, including the agent visibility filter (the actual failing path):
EXPLAIN ANALYZE= 73 ms,Index Scan using articles_embedding_hnsw_idx, no Seq Scan on articles.pool = max(limit×5, 100)capped 500, floored atlimit.Enhanced review (team round 1 already in branch history; focused adversarial round 2 here)
Security + engineer (Opus), every claim verified against code/prod:
limit. The alternative (anti-join inside the index scan) is prod-disproven (Seq Scan, cost 105k) — the ceiling is intrinsic to the indexed path. Documented onsuggest_links/3+ the OpenAPI description.limit(operator-misconfig hardening); EXPLAIN guard strengthened to require an actualIndex Scan usingthe HNSW index.Test that would have caught this
The EXPLAIN guard now asserts the corpus is reached via an
Index Scan using articles_embedding(_hnsw)?_idxwith noSeq Scan on articles(an outer sort over the small pool is expected) — the invariant that both #170 and the first #172 attempt violated. (Caveat: small-DB CI still can't reproduce the planner's cost-based choice at 76k rows — see the architectural note below.)Notes for the operator
articles_embedding_hnsw_idx, but migration20260410022906createsarticles_embedding_idx. Both are HNSW cosine; the query uses whichever exists. Worth reconciling.Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
Fixes #172