Fix #172: suggested_links 500 — HNSW approximate-NN, not a full-corpus self-join scan#173
Merged
Merged
Conversation
…orpus scan #170 correctly removed the stored-`%Pgvector{}` re-interpolation (the #168 500) but rewrote to a column-to-column self-join, which reads the target vector as a COLUMN (`a.embedding <=> t.embedding`). A column right-operand defeats the HNSW index (`articles_embedding_idx`, vector_cosine_ops), forcing a per-row cosine + full `ORDER BY ... <=>` sort over EVERY published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of `GET /knowledge/articles/:id/suggested_links`. Fix: pass the already-fetched target embedding as a BOUND `^param` list of floats (via `to_embedding_list/1`), the exact value shape `search_semantic` ships in production. `ORDER BY embedding <=> $const LIMIT k` is the HNSW-indexable form, so the planner does an approximate-NN index scan rather than a full-corpus sort. The list (not the stored `%Pgvector{}` struct) keeps the #168 fix intact. Timeout raised 5s→15s to match the other vector paths (nearest_prior_distance/distant_pairs). Tests: the structural guard is upgraded to a #168+#172 RED→GREEN check — asserts the target is a bound LIST param (not a `%Pgvector{}`), exactly one `articles` table reference (no self-join driving a full scan), and an indexable `ORDER BY ... <=>` shape. A literal 76k-row seed is infeasible in an async unit test, so the indexable SQL shape is the durable proxy. All behavioral ACs (200 + ranked candidates, excludes self/already-linked, honors threshold/limit) still pass.
…der-limit test Enhanced-review Round 1 (team) findings: - F1 (BA high / arch low): the bound-param HNSW form fixes the 500, but with the default single ef_search batch the already-linked anti-join could starve the result below `limit` — links cluster among an article's nearest neighbors, so the exclusion removes exactly the candidates HNSW surfaces first. Resolve by running the suggestion query under `SET LOCAL hnsw.iterative_scan = 'strict_order'` (pgvector >= 0.8, installed: 0.8.2) in a transaction: pgvector keeps pulling neighbors until `limit` survive the post-filters, bounded by `max_scan_tuples` (never degrades to the #172 full-corpus scan). `strict_order` preserves the "most-similar first" contract. SET LOCAL is transaction-scoped like the RLS GUC; pgvector reconciles it when the query's `<=>` ops load the extension (verified). - F2 (BA medium): the structural guard doesn't exercise exclusion-under-limit. Added a behavioral test — six equally-similar candidates, four pre-linked, request limit:2, assert exactly the two unlinked come back (none of the linked leak, result not starved). - F3 (BA low / engineer): the #150 describe-block comment still described the removed self-join. Rewritten to the bound-param HNSW form. Added a moduledoc note that ranking is approximate-NN over the HNSW index with iterative scan. All other team findings (list-param encoding, 15s timeout, tenant/RLS scoping, planner, to_embedding_list coverage, test RED->GREEN) verified PASS — no change. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2531 tests — all green.
…n, add EXPLAIN guard
Enhanced-review Round 2 (adversarial, 4 agents). Three independent reviewers
(security, architect, skeptical-BA) converged on a single HIGH finding: the
Round-1 `AdminRepo.transaction` + `SET LOCAL hnsw.iterative_scan` wrapper was a
self-inflicted regression that is WORSE than the edge-case under-fill it fixed:
* Pool starvation (HIGH): the prod AdminRepo pool is only 3 (`ADMIN_POOL_SIZE`),
shared by every cross-tenant read. Holding a connection in a 15s+ transaction
on the hottest knowledge endpoint risks starving all other admin reads under
concurrent agent load — the EXACT pattern the `distant_pairs` note in this
module (knowledge.ex:3253-3254) was written to avoid ("avoiding transaction
hold ... prevents AdminRepo pool starvation ... (3 agents, pool_size: 3)").
* Sparse-article blow-up (MED-HIGH): `strict_order` iterative scan walks up to
`max_scan_tuples` (20k) every call for an article whose neighbors are mostly
below-threshold/already-linked — the COMMON case — turning the fast path slow.
* Version-skew / new 500 surface from the per-request `SET LOCAL`.
Resolution: REVERT to the bare `AdminRepo.all` HNSW query — the form `search_semantic`
and `distant_pairs` already ship in prod. This fixes the #172 prod 500 (the actual
bug) with ZERO added risk. The Round-1 under-fill it targeted is by-design and
identical to the production-proven `search_semantic` (architect-confirmed LOW in
BOTH review rounds; negligible at the default limit 5 vs ef_search 40). The recall
ceiling is now documented honestly on `suggest_links/3` rather than papered over.
Also from the adversarial round:
* Added an EXPLAIN index-eligibility test (skeptical-BA's ASK-3): with seq-scan and
sort disabled, the plan MUST satisfy `ORDER BY embedding <=> $const` via
`articles_embedding_idx` with no Sort node. This guards the broader "no
full-corpus scan" regression CLASS that the SQL-shape regex can't see (a
reintroduced self-join column-operand, a wrapped order expr, or a dropped index
all force a Sort and fail it) — deterministic at any row count.
* Renamed the exclusion-under-limit test to what it actually guards (linked
near-neighbors are excluded and the result fills from the unlinked remainder),
dropping the iterative-scan framing the cynical-engineer flagged as tautological.
Disproven adversarial attacks (documented, not fixed): MatchError on the transaction
result (raises propagate; only explicit rollback yields {:error} — verified
empirically); version-skew "every call 500s" (Postgres accepts two-part GUC names
as placeholders → graceful no-op, not an error). Both moot after the revert.
Prod-verification items #172 lists (capture the live exception, hit the 5 repro ids on
the deployed build, prod EXPLAIN) require production access and are the post-deploy
gate before marking the issue fully verified — called out in the PR/issue, not deferred.
Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
This was referenced Jun 24, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #172 —
GET /api/v1/knowledge/articles/:id/suggested_linksstill 500'd in prod after #170.Root cause (corrected from #170): #170 removed the
%Pgvector{}re-interpolation (the original #168 500) but rewrote the query to a column-to-column self-join that read the target vector as a column (a.embedding <=> t.embedding). A column right-operand can't use the HNSW index, so the query did a per-row cosine + fullORDER BY ... <=>sort over every published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of calls.Fix: pass the already-fetched target embedding as a bound
[float()]list param — the exact shapesearch_semanticanddistant_pairsship in production.ORDER BY embedding <=> $const LIMIT kis the HNSW-indexable form (articles_embedding_idx,vector_cosine_ops), so the planner does an approximate-NN index scan instead of a full-corpus sort. Converting the stored%Pgvector{}to a plain list (not binding the struct) keeps the #168 fix intact.The query stays a bare
AdminRepo.all— no transaction, matchingsearch_semantic/distant_pairs, which deliberately avoid holding one of the 3AdminRepoconnections (see thedistant_pairspool-starvation note in the same module).Enhanced review (team + adversarial, Opus)
Both rounds ran; every agent claim was independently verified against the code/tests.
ef_searchbatch could let the already-linked anti-join under-fill the result belowlimit. An initial fix addedSET LOCAL hnsw.iterative_scanin a transaction.max_scan_tupleswalk on the common sparse-article case, and a newSET LOCALfailure surface. Resolution: reverted it. The under-fill it targeted is by-design and identical to the production-provensearch_semantic(architect-rated LOW in both rounds; negligible at the default limit 5 vsef_search40), now documented honestly onsuggest_links/3.rollbackyields{:error}); "version-skew 500" (Postgres treats two-part GUC names as placeholders → graceful no-op). Both moot after the revert.Tests
articles_embedding_idxwith no Sort node — catches the broad "reintroduced full-corpus scan" regression class (self-join column-operand, wrapped order expr, dropped index) that a small-data behavioral test can't, deterministically at any row count.articlestable, no self-join).threshold/limit; exclusion-under-limit (linked near-neighbors excluded, result filled from the unlinked remainder); published-only; tenant isolation; read-only.Gate:
compile --warnings-as-errors,format,credo --strict,dialyzer, 2532 tests — all green.Post-deploy verification (operator gate before marking #172 fully verified)
The prod-only 500 can't be reproduced at test scale, so before closing #172 as verified-in-prod, on the deployed build: hit the 5 repro ids from the issue and confirm 200 + ranked candidates, and (optionally) capture a prod
EXPLAIN ANALYZEshowing the index scan. These need production access and are not deferrals of any code change.Closes #172