Fix #168: suggested_links 500 — column-to-column cosine via self-join by mkreyman · Pull Request #170 · mkreyman/loopctl

mkreyman · 2026-06-24T05:30:18Z

Summary

Closes #168. GET /api/v1/knowledge/articles/:id/suggested_links returned 500 on every call in production. suggest_links was the only embedding endpoint that fetched the target's stored vector and round-tripped it back into the candidate query as a ^param with a ?::vector cast. Every endpoint that works in production keeps the vector math column-based — distant_pairs does a.embedding <=> b.embedding (column-to-column) and works on the same corpus.

Fix

suggestion_candidates now references the target's vector via a self-join and computes a.embedding <=> t.embedding — column-to-column, no parameter, no ::vector cast, the exact pattern distant_pairs uses. Since distant_pairs returns 200 on this corpus (same <=> over the same embeddings), suggested_links now does too. fetch_article_embedding is kept only to distinguish not_found (404) / no-embedding ({:ok, []}) / has-embedding before ranking.

Why the prior tests didn't catch it

The unit/context tests passed while the HTTP path 500'd in prod (the local pgvector/Postgrex build happened to encode the re-interpolated vector; prod's didn't). Added an end-to-end HTTP regression test that seeds embedded articles, calls the endpoint, and asserts 200 + the documented candidate shape ({id,title,category,similarity_score}).

ACs re-validated (#150)

Returns 200 with ranked candidates for an embedded article (was 500).
Excludes the article itself + already-linked articles; honors threshold/limit.
End-to-end integration test (controller → query → JSON) asserts 200 + shape.

Full suite green (2526), mix precommit + mix dialyzer clean.

… re-interpolated vector GET /api/v1/knowledge/articles/:id/suggested_links returned 500 for every embedded article in production. suggest_links was the ONLY embedding endpoint that fetched the target article's stored vector and round-tripped it back into the candidate query as a `^param` with a `?::vector` cast. Every working endpoint keeps the vector math column-based: distant_pairs uses `a.embedding <=> b.embedding` (column-to-column) and works in production; novelty/semantic-search pass a plain float list. Fix: suggestion_candidates now references the target's vector via a self-join (`join: t in Article, on: t.id == ^article_id`) and computes `a.embedding <=> t.embedding` — column-to-column, no parameter, no `::vector` cast. This is the exact pattern distant_pairs uses, so on any corpus where pairs returns 200 (same `<=>` over the same embeddings), suggested_links now does too. fetch_article_embedding is kept solely to distinguish not_found (404) / no-embedding ({:ok, []}) / has-embedding before running the ranked query. Verified ACs (#150 re-validation): - Returns 200 with ranked [{id,title,category,similarity_score}] for an embedded article (was 500). - Excludes the article itself + already-linked articles; honors threshold/limit. - New end-to-end HTTP regression test seeds embedded articles and asserts 200 + the documented candidate shape (the prior unit/context tests passed while the HTTP path 500'd in prod — this guards the controller -> query -> JSON path). Full suite green (2526), precommit + dialyzer clean.

…NSW note Team review (engineer/architect/security) on PR #170: - Security: CLEAN (tenant isolation + #163 visibility preserved; target 404-gated; candidate filter binds the right alias; single-row target join = O(n), not O(n²)). - Architect: HIGH CONFIDENCE the fix resolves the prod 500 (column-to-column `<=>` is the exact prod-proven `distant_pairs` pattern; vector(1536) column rules out a dimension cause). One LOW: the column form bypasses the HNSW ANN index. - Engineer (MEDIUM): the functional regression test passes on BOTH buggy and fixed code (the column is vector(1536) and PostgrexTypes is shared across envs, so the stored-vector re-interpolation encodes fine locally) — so it didn't prove RED→GREEN. Fixes: - Extracted `suggestion_candidates_query/5` (`@doc false`, returns the Ecto query) so a STRUCTURAL test can assert the query shape. New test asserts (a) no query parameter is a vector/list — the target vector is no longer bound as a param, and (b) the cosine is between two `embedding` COLUMNS. Both assertions are RED on the pre-fix query and GREEN on the fix — proven via `Ecto.Adapters.SQL.to_sql` (pre-fix: vector-param present; fix: none, 7 embedding-column refs). This is the locally-reproducible regression guard the prod-only 500 prevented. - Documented the HNSW tradeoff (column-to-column intentionally bypasses `articles_embedding_idx`; negligible at MVP scale, matches shipped `distant_pairs`). Full suite green (2527), precommit + dialyzer clean.

…s self-join scan (#173) * Fix #172: suggested_links 500 — HNSW approximate-NN instead of full-corpus scan #170 correctly removed the stored-`%Pgvector{}` re-interpolation (the #168 500) but rewrote to a column-to-column self-join, which reads the target vector as a COLUMN (`a.embedding <=> t.embedding`). A column right-operand defeats the HNSW index (`articles_embedding_idx`, vector_cosine_ops), forcing a per-row cosine + full `ORDER BY ... <=>` sort over EVERY published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of `GET /knowledge/articles/:id/suggested_links`. Fix: pass the already-fetched target embedding as a BOUND `^param` list of floats (via `to_embedding_list/1`), the exact value shape `search_semantic` ships in production. `ORDER BY embedding <=> $const LIMIT k` is the HNSW-indexable form, so the planner does an approximate-NN index scan rather than a full-corpus sort. The list (not the stored `%Pgvector{}` struct) keeps the #168 fix intact. Timeout raised 5s→15s to match the other vector paths (nearest_prior_distance/distant_pairs). Tests: the structural guard is upgraded to a #168+#172 RED→GREEN check — asserts the target is a bound LIST param (not a `%Pgvector{}`), exactly one `articles` table reference (no self-join driving a full scan), and an indexable `ORDER BY ... <=>` shape. A literal 76k-row seed is infeasible in an async unit test, so the indexable SQL shape is the durable proxy. All behavioral ACs (200 + ranked candidates, excludes self/already-linked, honors threshold/limit) still pass. * Address #172 team-review findings: iterative HNSW scan + exclusion-under-limit test Enhanced-review Round 1 (team) findings: - F1 (BA high / arch low): the bound-param HNSW form fixes the 500, but with the default single ef_search batch the already-linked anti-join could starve the result below `limit` — links cluster among an article's nearest neighbors, so the exclusion removes exactly the candidates HNSW surfaces first. Resolve by running the suggestion query under `SET LOCAL hnsw.iterative_scan = 'strict_order'` (pgvector >= 0.8, installed: 0.8.2) in a transaction: pgvector keeps pulling neighbors until `limit` survive the post-filters, bounded by `max_scan_tuples` (never degrades to the #172 full-corpus scan). `strict_order` preserves the "most-similar first" contract. SET LOCAL is transaction-scoped like the RLS GUC; pgvector reconciles it when the query's `<=>` ops load the extension (verified). - F2 (BA medium): the structural guard doesn't exercise exclusion-under-limit. Added a behavioral test — six equally-similar candidates, four pre-linked, request limit:2, assert exactly the two unlinked come back (none of the linked leak, result not starved). - F3 (BA low / engineer): the #150 describe-block comment still described the removed self-join. Rewritten to the bound-param HNSW form. Added a moduledoc note that ranking is approximate-NN over the HNSW index with iterative scan. All other team findings (list-param encoding, 15s timeout, tenant/RLS scoping, planner, to_embedding_list coverage, test RED->GREEN) verified PASS — no change. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2531 tests — all green. * Address #172 adversarial findings: drop the iterative-scan transaction, add EXPLAIN guard Enhanced-review Round 2 (adversarial, 4 agents). Three independent reviewers (security, architect, skeptical-BA) converged on a single HIGH finding: the Round-1 `AdminRepo.transaction` + `SET LOCAL hnsw.iterative_scan` wrapper was a self-inflicted regression that is WORSE than the edge-case under-fill it fixed: * Pool starvation (HIGH): the prod AdminRepo pool is only 3 (`ADMIN_POOL_SIZE`), shared by every cross-tenant read. Holding a connection in a 15s+ transaction on the hottest knowledge endpoint risks starving all other admin reads under concurrent agent load — the EXACT pattern the `distant_pairs` note in this module (knowledge.ex:3253-3254) was written to avoid ("avoiding transaction hold ... prevents AdminRepo pool starvation ... (3 agents, pool_size: 3)"). * Sparse-article blow-up (MED-HIGH): `strict_order` iterative scan walks up to `max_scan_tuples` (20k) every call for an article whose neighbors are mostly below-threshold/already-linked — the COMMON case — turning the fast path slow. * Version-skew / new 500 surface from the per-request `SET LOCAL`. Resolution: REVERT to the bare `AdminRepo.all` HNSW query — the form `search_semantic` and `distant_pairs` already ship in prod. This fixes the #172 prod 500 (the actual bug) with ZERO added risk. The Round-1 under-fill it targeted is by-design and identical to the production-proven `search_semantic` (architect-confirmed LOW in BOTH review rounds; negligible at the default limit 5 vs ef_search 40). The recall ceiling is now documented honestly on `suggest_links/3` rather than papered over. Also from the adversarial round: * Added an EXPLAIN index-eligibility test (skeptical-BA's ASK-3): with seq-scan and sort disabled, the plan MUST satisfy `ORDER BY embedding <=> $const` via `articles_embedding_idx` with no Sort node. This guards the broader "no full-corpus scan" regression CLASS that the SQL-shape regex can't see (a reintroduced self-join column-operand, a wrapped order expr, or a dropped index all force a Sort and fail it) — deterministic at any row count. * Renamed the exclusion-under-limit test to what it actually guards (linked near-neighbors are excluded and the result fills from the unlinked remainder), dropping the iterative-scan framing the cynical-engineer flagged as tautological. Disproven adversarial attacks (documented, not fixed): MatchError on the transaction result (raises propagate; only explicit rollback yields {:error} — verified empirically); version-skew "every call 500s" (Postgres accepts two-part GUC names as placeholders → graceful no-op, not an error). Both moot after the revert. Prod-verification items #172 lists (capture the live exception, hit the 5 repro ids on the deployed build, prod EXPLAIN) require production access and are the post-deploy gate before marking the issue fully verified — called out in the PR/issue, not deferred. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

…SW index at scale) (#174) * Fix #172 for real: subquery filter-after-ANN so the planner uses the HNSW index The previously-merged #172 fix (089d592) was deployed and STILL 500'd in prod — verified live in prod logs: ERROR 57014 (query_canceled) "canceling statement" on suggest_links/3 (knowledge.ex:2844), well after the deploy. EXPLAIN against the real ~76k-row corpus showed a Seq Scan on articles + Sort (cost ~57k): the HNSW index was NOT chosen. The prior fix proved index ELIGIBILITY (a forced EXPLAIN with sort disabled) but never that the planner CHOOSES the index at scale — the skeptical-BA flagged exactly this gap and I closed the issue without prod verification. Owned. Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a PURE `ORDER BY embedding <=> $const LIMIT k`. The already-linked anti-join (a JOIN) and the `GREATEST(0,1-(embedding<=>$t)) > threshold` distance filter, in the SAME query, make the planner abandon the index and Seq-Scan the whole corpus. (search_semantic has neither, which is why it stays fast.) Fix: split into a subquery. The INNER query is the pure top-`pool` nearest by cosine (only index-safe filters: tenant/status/not-null/not-self/visibility — each verified on prod to keep the index). The OUTER query applies the anti-join + threshold + final limit over just `pool` rows. Verified on the prod corpus: inner uses `articles_embedding_hnsw_idx`, total EXPLAIN ANALYZE 73ms (was a statement timeout), no Seq Scan on articles — confirmed WITH the agent visibility filter (the actual failing path). `pool = max(limit*5, 100)` capped at 500; effective recall additionally bounded by hnsw.ef_search (~40, observed) — consistent with search_semantic; ample for the default limit 5. Documented. Test: the EXPLAIN guard now asserts the corpus is reached via the HNSW index with NO `Seq Scan on articles` (an outer sort over the small pool is expected) — the assertion that would have caught both #170 and the first #172 attempt. Structural + behavioral tests unchanged and green. Note: prod's index is named `articles_embedding_hnsw_idx`, but the migration (20260410022906) creates `articles_embedding_idx` — migration drift (prod index created out-of-band). Both are HNSW cosine; the query uses whichever exists. Flagged for the operator; not addressed here. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green. * Address #172 focused-review findings (recall doc + pool floor + EXPLAIN strengthen) Round-2 focused adversarial review (security + engineer, Opus). Reshape confirmed sound on tenant isolation, visibility, anti-join, ordering, DoS (all disproven). Findings fixed: - A (security, MEDIUM): the anti-join/threshold now run AFTER the candidate-pool cut, so a densely-linked "hub" can return fewer than `limit` (or none) vs master's full-corpus exclusion. The reviewer's proposed alternative — move the anti-join back inside the index scan — is PROD-DISPROVEN: EXPLAIN of (anti-join, no threshold) on the real corpus is a Parallel Seq Scan + Sort (cost 105k), i.e. it reintroduces the 500. So the recall ceiling is intrinsic to the indexed path. Documented the exact mechanism and consequence on `suggest_links/3` and the controller OpenAPI description. - engineer: `suggestion_candidate_pool/1` could cap the pool below `limit` if an operator misconfigures `:max_suggestion_candidate_pool`. Floored at `limit` so the pool can never truncate below the requested count before exclusions run. - B (LOW): strengthened the EXPLAIN guard to require an actual `Index Scan using articles_embedding(_hnsw)?_idx` (not just the index name appearing), tolerant of the test-vs-prod index-name drift. - C (LOW): no audit log on the BYPASSRLS read path — pre-existing, identical on master, and a module-wide concern (every AdminRepo read); out of scope for this hotfix. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

mkreyman added 3 commits June 23, 2026 23:29

Merge branch 'master' into fix/168-suggest-links-500

4e0761c

mkreyman merged commit 6529edc into master Jun 24, 2026
6 checks passed

mkreyman deleted the fix/168-suggest-links-500 branch June 24, 2026 05:58

mkreyman mentioned this pull request Jun 24, 2026

Fix #172 for real: subquery filter-after-ANN (planner now uses the HNSW index at scale) #174

Merged

mkreyman mentioned this pull request Jun 24, 2026

Epic: performance & scale hardening for the knowledge subsystem (reproducible · observable · index-correct at prod scale) #175

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #168: suggested_links 500 — column-to-column cosine via self-join#170

Fix #168: suggested_links 500 — column-to-column cosine via self-join#170
mkreyman merged 3 commits into
masterfrom
fix/168-suggest-links-500

mkreyman commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkreyman commented Jun 24, 2026

Summary

Fix

Why the prior tests didn't catch it

ACs re-validated (#150)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant