Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale) by mkreyman · Pull Request #174 · mkreyman/loopctl

mkreyman · 2026-06-24T15:53:18Z

Summary

The merged #172 fix (089d592) was deployed and still 500'd in production — I verified this against the live system rather than trusting the green CI:

fly logs: ERROR 57014 (query_canceled) "canceling statement" in suggest_links/3, firing after the deploy.
EXPLAIN on the real ~76k-row corpus: Seq Scan on articles + Sort, cost ~57,000 — the HNSW index was never chosen. The prior EXPLAIN test only proved index eligibility (a forced plan with sort disabled), never that the planner picks the index at scale. That gap is exactly what the skeptical-BA reviewer flagged and I closed Verify follow-up (#168/#170): suggested_links STILL 500s in prod — unbounded full-corpus scan (scale), not param interpolation #172 without prod verification. Owned.

Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a pure ORDER BY embedding <=> $const LIMIT k. The already-linked anti-join (a JOIN) and the > threshold distance filter, in the same query, make the planner abandon the index and full-scan. (search_semantic has neither — that's why it's fast.) Confirmed by isolating each: anti-join alone → Seq Scan (cost 105k); threshold alone → Seq Scan; pure ANN → Index Scan using articles_embedding_hnsw_idx (cost 742).

Fix

Split into a subquery (the standard pgvector filter-after-ANN pattern, already used by do_distant_pairs):

INNER: pure top-pool nearest by cosine — only index-safe filters (tenant / status / not-null / not-self / visibility, each EXPLAIN-verified on prod to keep the index).
OUTER: the anti-join + threshold + final limit, over just pool rows.

Verified on the real prod corpus, including the agent visibility filter (the actual failing path): EXPLAIN ANALYZE = 73 ms, Index Scan using articles_embedding_hnsw_idx, no Seq Scan on articles.

pool = max(limit×5, 100) capped 500, floored at limit.

Enhanced review (team round 1 already in branch history; focused adversarial round 2 here)

Security + engineer (Opus), every claim verified against code/prod:

Sound: tenant isolation, visibility (filter forced onto the inner subquery), bidirectional anti-join, ordering, DoS bounds — all disproven as attacks.
Recall change (documented): exclusions now apply to the candidate pool, not the whole corpus, so a densely-linked hub may return fewer than limit. The alternative (anti-join inside the index scan) is prod-disproven (Seq Scan, cost 105k) — the ceiling is intrinsic to the indexed path. Documented on suggest_links/3 + the OpenAPI description.
Pool floored at limit (operator-misconfig hardening); EXPLAIN guard strengthened to require an actual Index Scan using the HNSW index.

Test that would have caught this

The EXPLAIN guard now asserts the corpus is reached via an Index Scan using articles_embedding(_hnsw)?_idx with no Seq Scan on articles (an outer sort over the small pool is expected) — the invariant that both #170 and the first #172 attempt violated. (Caveat: small-DB CI still can't reproduce the planner's cost-based choice at 76k rows — see the architectural note below.)

Notes for the operator

Migration drift: prod's index is articles_embedding_hnsw_idx, but migration 20260410022906 creates articles_embedding_idx. Both are HNSW cosine; the query uses whichever exists. Worth reconciling.
Verification gap: this class of bug (planner choice at scale) is invisible to the test suite. The durable fix is a scale-representative test/staging env + surfacing the real SQLSTATE instead of a generic 500 — proposed as a follow-up architectural track.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

Fixes #172

…HNSW index The previously-merged #172 fix (089d592) was deployed and STILL 500'd in prod — verified live in prod logs: ERROR 57014 (query_canceled) "canceling statement" on suggest_links/3 (knowledge.ex:2844), well after the deploy. EXPLAIN against the real ~76k-row corpus showed a Seq Scan on articles + Sort (cost ~57k): the HNSW index was NOT chosen. The prior fix proved index ELIGIBILITY (a forced EXPLAIN with sort disabled) but never that the planner CHOOSES the index at scale — the skeptical-BA flagged exactly this gap and I closed the issue without prod verification. Owned. Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a PURE `ORDER BY embedding <=> $const LIMIT k`. The already-linked anti-join (a JOIN) and the `GREATEST(0,1-(embedding<=>$t)) > threshold` distance filter, in the SAME query, make the planner abandon the index and Seq-Scan the whole corpus. (search_semantic has neither, which is why it stays fast.) Fix: split into a subquery. The INNER query is the pure top-`pool` nearest by cosine (only index-safe filters: tenant/status/not-null/not-self/visibility — each verified on prod to keep the index). The OUTER query applies the anti-join + threshold + final limit over just `pool` rows. Verified on the prod corpus: inner uses `articles_embedding_hnsw_idx`, total EXPLAIN ANALYZE 73ms (was a statement timeout), no Seq Scan on articles — confirmed WITH the agent visibility filter (the actual failing path). `pool = max(limit*5, 100)` capped at 500; effective recall additionally bounded by hnsw.ef_search (~40, observed) — consistent with search_semantic; ample for the default limit 5. Documented. Test: the EXPLAIN guard now asserts the corpus is reached via the HNSW index with NO `Seq Scan on articles` (an outer sort over the small pool is expected) — the assertion that would have caught both #170 and the first #172 attempt. Structural + behavioral tests unchanged and green. Note: prod's index is named `articles_embedding_hnsw_idx`, but the migration (20260410022906) creates `articles_embedding_idx` — migration drift (prod index created out-of-band). Both are HNSW cosine; the query uses whichever exists. Flagged for the operator; not addressed here. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

…IN strengthen) Round-2 focused adversarial review (security + engineer, Opus). Reshape confirmed sound on tenant isolation, visibility, anti-join, ordering, DoS (all disproven). Findings fixed: - A (security, MEDIUM): the anti-join/threshold now run AFTER the candidate-pool cut, so a densely-linked "hub" can return fewer than `limit` (or none) vs master's full-corpus exclusion. The reviewer's proposed alternative — move the anti-join back inside the index scan — is PROD-DISPROVEN: EXPLAIN of (anti-join, no threshold) on the real corpus is a Parallel Seq Scan + Sort (cost 105k), i.e. it reintroduces the 500. So the recall ceiling is intrinsic to the indexed path. Documented the exact mechanism and consequence on `suggest_links/3` and the controller OpenAPI description. - engineer: `suggestion_candidate_pool/1` could cap the pool below `limit` if an operator misconfigures `:max_suggestion_candidate_pool`. Floored at `limit` so the pool can never truncate below the requested count before exclusions run. - B (LOW): strengthened the EXPLAIN guard to require an actual `Index Scan using articles_embedding(_hnsw)?_idx` (not just the index name appearing), tolerant of the test-vs-prod index-name drift. - C (LOW): no audit log on the BYPASSRLS read path — pre-existing, identical on master, and a module-wide concern (every AdminRepo read); out of scope for this hotfix. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

mkreyman added 2 commits June 24, 2026 09:45

mkreyman merged commit 8a93029 into master Jun 24, 2026
6 checks passed

mkreyman deleted the fix/172-suggested-links-subquery-ann branch June 24, 2026 15:56

mkreyman mentioned this pull request Jun 24, 2026

Verify follow-up (#168/#170): suggested_links STILL 500s in prod — unbounded full-corpus scan (scale), not param interpolation #172

Closed

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale)#174

Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale)#174
mkreyman merged 2 commits into
masterfrom
fix/172-suggested-links-subquery-ann

mkreyman commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkreyman commented Jun 24, 2026

Summary

Fix

Enhanced review (team round 1 already in branch history; focused adversarial round 2 here)

Test that would have caught this

Notes for the operator

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant