Fix #168: suggested_links 500 — column-to-column cosine via self-join#170
Merged
Conversation
… re-interpolated vector
GET /api/v1/knowledge/articles/:id/suggested_links returned 500 for every
embedded article in production. suggest_links was the ONLY embedding endpoint
that fetched the target article's stored vector and round-tripped it back into
the candidate query as a `^param` with a `?::vector` cast. Every working endpoint
keeps the vector math column-based: distant_pairs uses `a.embedding <=> b.embedding`
(column-to-column) and works in production; novelty/semantic-search pass a plain
float list.
Fix: suggestion_candidates now references the target's vector via a self-join
(`join: t in Article, on: t.id == ^article_id`) and computes
`a.embedding <=> t.embedding` — column-to-column, no parameter, no `::vector`
cast. This is the exact pattern distant_pairs uses, so on any corpus where pairs
returns 200 (same `<=>` over the same embeddings), suggested_links now does too.
fetch_article_embedding is kept solely to distinguish not_found (404) /
no-embedding ({:ok, []}) / has-embedding before running the ranked query.
Verified ACs (#150 re-validation):
- Returns 200 with ranked [{id,title,category,similarity_score}] for an
embedded article (was 500).
- Excludes the article itself + already-linked articles; honors threshold/limit.
- New end-to-end HTTP regression test seeds embedded articles and asserts 200 +
the documented candidate shape (the prior unit/context tests passed while the
HTTP path 500'd in prod — this guards the controller -> query -> JSON path).
Full suite green (2526), precommit + dialyzer clean.
…NSW note Team review (engineer/architect/security) on PR #170: - Security: CLEAN (tenant isolation + #163 visibility preserved; target 404-gated; candidate filter binds the right alias; single-row target join = O(n), not O(n²)). - Architect: HIGH CONFIDENCE the fix resolves the prod 500 (column-to-column `<=>` is the exact prod-proven `distant_pairs` pattern; vector(1536) column rules out a dimension cause). One LOW: the column form bypasses the HNSW ANN index. - Engineer (MEDIUM): the functional regression test passes on BOTH buggy and fixed code (the column is vector(1536) and PostgrexTypes is shared across envs, so the stored-vector re-interpolation encodes fine locally) — so it didn't prove RED→GREEN. Fixes: - Extracted `suggestion_candidates_query/5` (`@doc false`, returns the Ecto query) so a STRUCTURAL test can assert the query shape. New test asserts (a) no query parameter is a vector/list — the target vector is no longer bound as a param, and (b) the cosine is between two `embedding` COLUMNS. Both assertions are RED on the pre-fix query and GREEN on the fix — proven via `Ecto.Adapters.SQL.to_sql` (pre-fix: vector-param present; fix: none, 7 embedding-column refs). This is the locally-reproducible regression guard the prod-only 500 prevented. - Documented the HNSW tradeoff (column-to-column intentionally bypasses `articles_embedding_idx`; negligible at MVP scale, matches shipped `distant_pairs`). Full suite green (2527), precommit + dialyzer clean.
This was referenced Jun 24, 2026
Closed
mkreyman
added a commit
that referenced
this pull request
Jun 24, 2026
…s self-join scan (#173) * Fix #172: suggested_links 500 — HNSW approximate-NN instead of full-corpus scan #170 correctly removed the stored-`%Pgvector{}` re-interpolation (the #168 500) but rewrote to a column-to-column self-join, which reads the target vector as a COLUMN (`a.embedding <=> t.embedding`). A column right-operand defeats the HNSW index (`articles_embedding_idx`, vector_cosine_ops), forcing a per-row cosine + full `ORDER BY ... <=>` sort over EVERY published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of `GET /knowledge/articles/:id/suggested_links`. Fix: pass the already-fetched target embedding as a BOUND `^param` list of floats (via `to_embedding_list/1`), the exact value shape `search_semantic` ships in production. `ORDER BY embedding <=> $const LIMIT k` is the HNSW-indexable form, so the planner does an approximate-NN index scan rather than a full-corpus sort. The list (not the stored `%Pgvector{}` struct) keeps the #168 fix intact. Timeout raised 5s→15s to match the other vector paths (nearest_prior_distance/distant_pairs). Tests: the structural guard is upgraded to a #168+#172 RED→GREEN check — asserts the target is a bound LIST param (not a `%Pgvector{}`), exactly one `articles` table reference (no self-join driving a full scan), and an indexable `ORDER BY ... <=>` shape. A literal 76k-row seed is infeasible in an async unit test, so the indexable SQL shape is the durable proxy. All behavioral ACs (200 + ranked candidates, excludes self/already-linked, honors threshold/limit) still pass. * Address #172 team-review findings: iterative HNSW scan + exclusion-under-limit test Enhanced-review Round 1 (team) findings: - F1 (BA high / arch low): the bound-param HNSW form fixes the 500, but with the default single ef_search batch the already-linked anti-join could starve the result below `limit` — links cluster among an article's nearest neighbors, so the exclusion removes exactly the candidates HNSW surfaces first. Resolve by running the suggestion query under `SET LOCAL hnsw.iterative_scan = 'strict_order'` (pgvector >= 0.8, installed: 0.8.2) in a transaction: pgvector keeps pulling neighbors until `limit` survive the post-filters, bounded by `max_scan_tuples` (never degrades to the #172 full-corpus scan). `strict_order` preserves the "most-similar first" contract. SET LOCAL is transaction-scoped like the RLS GUC; pgvector reconciles it when the query's `<=>` ops load the extension (verified). - F2 (BA medium): the structural guard doesn't exercise exclusion-under-limit. Added a behavioral test — six equally-similar candidates, four pre-linked, request limit:2, assert exactly the two unlinked come back (none of the linked leak, result not starved). - F3 (BA low / engineer): the #150 describe-block comment still described the removed self-join. Rewritten to the bound-param HNSW form. Added a moduledoc note that ranking is approximate-NN over the HNSW index with iterative scan. All other team findings (list-param encoding, 15s timeout, tenant/RLS scoping, planner, to_embedding_list coverage, test RED->GREEN) verified PASS — no change. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2531 tests — all green. * Address #172 adversarial findings: drop the iterative-scan transaction, add EXPLAIN guard Enhanced-review Round 2 (adversarial, 4 agents). Three independent reviewers (security, architect, skeptical-BA) converged on a single HIGH finding: the Round-1 `AdminRepo.transaction` + `SET LOCAL hnsw.iterative_scan` wrapper was a self-inflicted regression that is WORSE than the edge-case under-fill it fixed: * Pool starvation (HIGH): the prod AdminRepo pool is only 3 (`ADMIN_POOL_SIZE`), shared by every cross-tenant read. Holding a connection in a 15s+ transaction on the hottest knowledge endpoint risks starving all other admin reads under concurrent agent load — the EXACT pattern the `distant_pairs` note in this module (knowledge.ex:3253-3254) was written to avoid ("avoiding transaction hold ... prevents AdminRepo pool starvation ... (3 agents, pool_size: 3)"). * Sparse-article blow-up (MED-HIGH): `strict_order` iterative scan walks up to `max_scan_tuples` (20k) every call for an article whose neighbors are mostly below-threshold/already-linked — the COMMON case — turning the fast path slow. * Version-skew / new 500 surface from the per-request `SET LOCAL`. Resolution: REVERT to the bare `AdminRepo.all` HNSW query — the form `search_semantic` and `distant_pairs` already ship in prod. This fixes the #172 prod 500 (the actual bug) with ZERO added risk. The Round-1 under-fill it targeted is by-design and identical to the production-proven `search_semantic` (architect-confirmed LOW in BOTH review rounds; negligible at the default limit 5 vs ef_search 40). The recall ceiling is now documented honestly on `suggest_links/3` rather than papered over. Also from the adversarial round: * Added an EXPLAIN index-eligibility test (skeptical-BA's ASK-3): with seq-scan and sort disabled, the plan MUST satisfy `ORDER BY embedding <=> $const` via `articles_embedding_idx` with no Sort node. This guards the broader "no full-corpus scan" regression CLASS that the SQL-shape regex can't see (a reintroduced self-join column-operand, a wrapped order expr, or a dropped index all force a Sort and fail it) — deterministic at any row count. * Renamed the exclusion-under-limit test to what it actually guards (linked near-neighbors are excluded and the result fills from the unlinked remainder), dropping the iterative-scan framing the cynical-engineer flagged as tautological. Disproven adversarial attacks (documented, not fixed): MatchError on the transaction result (raises propagate; only explicit rollback yields {:error} — verified empirically); version-skew "every call 500s" (Postgres accepts two-part GUC names as placeholders → graceful no-op, not an error). Both moot after the revert. Prod-verification items #172 lists (capture the live exception, hit the 5 repro ids on the deployed build, prod EXPLAIN) require production access and are the post-deploy gate before marking the issue fully verified — called out in the PR/issue, not deferred. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
mkreyman
added a commit
that referenced
this pull request
Jun 24, 2026
…SW index at scale) (#174) * Fix #172 for real: subquery filter-after-ANN so the planner uses the HNSW index The previously-merged #172 fix (089d592) was deployed and STILL 500'd in prod — verified live in prod logs: ERROR 57014 (query_canceled) "canceling statement" on suggest_links/3 (knowledge.ex:2844), well after the deploy. EXPLAIN against the real ~76k-row corpus showed a Seq Scan on articles + Sort (cost ~57k): the HNSW index was NOT chosen. The prior fix proved index ELIGIBILITY (a forced EXPLAIN with sort disabled) but never that the planner CHOOSES the index at scale — the skeptical-BA flagged exactly this gap and I closed the issue without prod verification. Owned. Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a PURE `ORDER BY embedding <=> $const LIMIT k`. The already-linked anti-join (a JOIN) and the `GREATEST(0,1-(embedding<=>$t)) > threshold` distance filter, in the SAME query, make the planner abandon the index and Seq-Scan the whole corpus. (search_semantic has neither, which is why it stays fast.) Fix: split into a subquery. The INNER query is the pure top-`pool` nearest by cosine (only index-safe filters: tenant/status/not-null/not-self/visibility — each verified on prod to keep the index). The OUTER query applies the anti-join + threshold + final limit over just `pool` rows. Verified on the prod corpus: inner uses `articles_embedding_hnsw_idx`, total EXPLAIN ANALYZE 73ms (was a statement timeout), no Seq Scan on articles — confirmed WITH the agent visibility filter (the actual failing path). `pool = max(limit*5, 100)` capped at 500; effective recall additionally bounded by hnsw.ef_search (~40, observed) — consistent with search_semantic; ample for the default limit 5. Documented. Test: the EXPLAIN guard now asserts the corpus is reached via the HNSW index with NO `Seq Scan on articles` (an outer sort over the small pool is expected) — the assertion that would have caught both #170 and the first #172 attempt. Structural + behavioral tests unchanged and green. Note: prod's index is named `articles_embedding_hnsw_idx`, but the migration (20260410022906) creates `articles_embedding_idx` — migration drift (prod index created out-of-band). Both are HNSW cosine; the query uses whichever exists. Flagged for the operator; not addressed here. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green. * Address #172 focused-review findings (recall doc + pool floor + EXPLAIN strengthen) Round-2 focused adversarial review (security + engineer, Opus). Reshape confirmed sound on tenant isolation, visibility, anti-join, ordering, DoS (all disproven). Findings fixed: - A (security, MEDIUM): the anti-join/threshold now run AFTER the candidate-pool cut, so a densely-linked "hub" can return fewer than `limit` (or none) vs master's full-corpus exclusion. The reviewer's proposed alternative — move the anti-join back inside the index scan — is PROD-DISPROVEN: EXPLAIN of (anti-join, no threshold) on the real corpus is a Parallel Seq Scan + Sort (cost 105k), i.e. it reintroduces the 500. So the recall ceiling is intrinsic to the indexed path. Documented the exact mechanism and consequence on `suggest_links/3` and the controller OpenAPI description. - engineer: `suggestion_candidate_pool/1` could cap the pool below `limit` if an operator misconfigures `:max_suggestion_candidate_pool`. Floored at `limit` so the pool can never truncate below the requested count before exclusions run. - B (LOW): strengthened the EXPLAIN guard to require an actual `Index Scan using articles_embedding(_hnsw)?_idx` (not just the index name appearing), tolerant of the test-vs-prod index-name drift. - C (LOW): no audit log on the BYPASSRLS read path — pre-existing, identical on master, and a module-wide concern (every AdminRepo read); out of scope for this hotfix. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Closes #168.
GET /api/v1/knowledge/articles/:id/suggested_linksreturned 500 on every call in production.suggest_linkswas the only embedding endpoint that fetched the target's stored vector and round-tripped it back into the candidate query as a^paramwith a?::vectorcast. Every endpoint that works in production keeps the vector math column-based —distant_pairsdoesa.embedding <=> b.embedding(column-to-column) and works on the same corpus.Fix
suggestion_candidatesnow references the target's vector via a self-join and computesa.embedding <=> t.embedding— column-to-column, no parameter, no::vectorcast, the exact patterndistant_pairsuses. Sincedistant_pairsreturns 200 on this corpus (same<=>over the same embeddings),suggested_linksnow does too.fetch_article_embeddingis kept only to distinguishnot_found(404) / no-embedding ({:ok, []}) / has-embedding before ranking.Why the prior tests didn't catch it
The unit/context tests passed while the HTTP path 500'd in prod (the local pgvector/Postgrex build happened to encode the re-interpolated vector; prod's didn't). Added an end-to-end HTTP regression test that seeds embedded articles, calls the endpoint, and asserts 200 + the documented candidate shape (
{id,title,category,similarity_score}).ACs re-validated (#150)
threshold/limit.Full suite green (2526),
mix precommit+mix dialyzerclean.