Fix #172: suggested_links 500 — HNSW approximate-NN, not a full-corpus self-join scan by mkreyman · Pull Request #173 · mkreyman/loopctl

mkreyman · 2026-06-24T15:11:22Z

Summary

Fixes #172 — GET /api/v1/knowledge/articles/:id/suggested_links still 500'd in prod after #170.

Root cause (corrected from #170): #170 removed the %Pgvector{} re-interpolation (the original #168 500) but rewrote the query to a column-to-column self-join that read the target vector as a column (a.embedding <=> t.embedding). A column right-operand can't use the HNSW index, so the query did a per-row cosine + full ORDER BY ... <=> sort over every published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of calls.

Fix: pass the already-fetched target embedding as a bound [float()] list param — the exact shape search_semantic and distant_pairs ship in production. ORDER BY embedding <=> $const LIMIT k is the HNSW-indexable form (articles_embedding_idx, vector_cosine_ops), so the planner does an approximate-NN index scan instead of a full-corpus sort. Converting the stored %Pgvector{} to a plain list (not binding the struct) keeps the #168 fix intact.

The query stays a bare AdminRepo.all — no transaction, matching search_semantic/distant_pairs, which deliberately avoid holding one of the 3 AdminRepo connections (see the distant_pairs pool-starvation note in the same module).

Enhanced review (team + adversarial, Opus)

Both rounds ran; every agent claim was independently verified against the code/tests.

Round 1 (team): flagged that a single HNSW ef_search batch could let the already-linked anti-join under-fill the result below limit. An initial fix added SET LOCAL hnsw.iterative_scan in a transaction.
Round 2 (adversarial): three independent reviewers (security, architect, skeptical-BA) converged that the transaction wrapper was worse than the bug it fixed — pool starvation on the 3-connection admin pool (the hottest knowledge endpoint), a 20k-tuple max_scan_tuples walk on the common sparse-article case, and a new SET LOCAL failure surface. Resolution: reverted it. The under-fill it targeted is by-design and identical to the production-proven search_semantic (architect-rated LOW in both rounds; negligible at the default limit 5 vs ef_search 40), now documented honestly on suggest_links/3.
Disproven adversarial attacks (verified empirically, not fixed): MatchError on the transaction result (raises propagate; only rollback yields {:error}); "version-skew 500" (Postgres treats two-part GUC names as placeholders → graceful no-op). Both moot after the revert.

Tests

EXPLAIN index-eligibility guard (the issue's "test that exhibits scale" ask): with seq-scan and sort disabled, the plan must satisfy the ORDER BY via articles_embedding_idx with no Sort node — catches the broad "reintroduced full-corpus scan" regression class (self-join column-operand, wrapped order expr, dropped index) that a small-data behavioral test can't, deterministically at any row count.
Structural SQL-shape guard (bound list param, single articles table, no self-join).
Behavioral ACs: 200 + ranked candidates; excludes self + already-linked (either direction); honors threshold/limit; exclusion-under-limit (linked near-neighbors excluded, result filled from the unlinked remainder); published-only; tenant isolation; read-only.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

Post-deploy verification (operator gate before marking #172 fully verified)

The prod-only 500 can't be reproduced at test scale, so before closing #172 as verified-in-prod, on the deployed build: hit the 5 repro ids from the issue and confirm 200 + ranked candidates, and (optionally) capture a prod EXPLAIN ANALYZE showing the index scan. These need production access and are not deferrals of any code change.

Closes #172

…orpus scan #170 correctly removed the stored-`%Pgvector{}` re-interpolation (the #168 500) but rewrote to a column-to-column self-join, which reads the target vector as a COLUMN (`a.embedding <=> t.embedding`). A column right-operand defeats the HNSW index (`articles_embedding_idx`, vector_cosine_ops), forcing a per-row cosine + full `ORDER BY ... <=>` sort over EVERY published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of `GET /knowledge/articles/:id/suggested_links`. Fix: pass the already-fetched target embedding as a BOUND `^param` list of floats (via `to_embedding_list/1`), the exact value shape `search_semantic` ships in production. `ORDER BY embedding <=> $const LIMIT k` is the HNSW-indexable form, so the planner does an approximate-NN index scan rather than a full-corpus sort. The list (not the stored `%Pgvector{}` struct) keeps the #168 fix intact. Timeout raised 5s→15s to match the other vector paths (nearest_prior_distance/distant_pairs). Tests: the structural guard is upgraded to a #168+#172 RED→GREEN check — asserts the target is a bound LIST param (not a `%Pgvector{}`), exactly one `articles` table reference (no self-join driving a full scan), and an indexable `ORDER BY ... <=>` shape. A literal 76k-row seed is infeasible in an async unit test, so the indexable SQL shape is the durable proxy. All behavioral ACs (200 + ranked candidates, excludes self/already-linked, honors threshold/limit) still pass.

…der-limit test Enhanced-review Round 1 (team) findings: - F1 (BA high / arch low): the bound-param HNSW form fixes the 500, but with the default single ef_search batch the already-linked anti-join could starve the result below `limit` — links cluster among an article's nearest neighbors, so the exclusion removes exactly the candidates HNSW surfaces first. Resolve by running the suggestion query under `SET LOCAL hnsw.iterative_scan = 'strict_order'` (pgvector >= 0.8, installed: 0.8.2) in a transaction: pgvector keeps pulling neighbors until `limit` survive the post-filters, bounded by `max_scan_tuples` (never degrades to the #172 full-corpus scan). `strict_order` preserves the "most-similar first" contract. SET LOCAL is transaction-scoped like the RLS GUC; pgvector reconciles it when the query's `<=>` ops load the extension (verified). - F2 (BA medium): the structural guard doesn't exercise exclusion-under-limit. Added a behavioral test — six equally-similar candidates, four pre-linked, request limit:2, assert exactly the two unlinked come back (none of the linked leak, result not starved). - F3 (BA low / engineer): the #150 describe-block comment still described the removed self-join. Rewritten to the bound-param HNSW form. Added a moduledoc note that ranking is approximate-NN over the HNSW index with iterative scan. All other team findings (list-param encoding, 15s timeout, tenant/RLS scoping, planner, to_embedding_list coverage, test RED->GREEN) verified PASS — no change. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2531 tests — all green.

…n, add EXPLAIN guard Enhanced-review Round 2 (adversarial, 4 agents). Three independent reviewers (security, architect, skeptical-BA) converged on a single HIGH finding: the Round-1 `AdminRepo.transaction` + `SET LOCAL hnsw.iterative_scan` wrapper was a self-inflicted regression that is WORSE than the edge-case under-fill it fixed: * Pool starvation (HIGH): the prod AdminRepo pool is only 3 (`ADMIN_POOL_SIZE`), shared by every cross-tenant read. Holding a connection in a 15s+ transaction on the hottest knowledge endpoint risks starving all other admin reads under concurrent agent load — the EXACT pattern the `distant_pairs` note in this module (knowledge.ex:3253-3254) was written to avoid ("avoiding transaction hold ... prevents AdminRepo pool starvation ... (3 agents, pool_size: 3)"). * Sparse-article blow-up (MED-HIGH): `strict_order` iterative scan walks up to `max_scan_tuples` (20k) every call for an article whose neighbors are mostly below-threshold/already-linked — the COMMON case — turning the fast path slow. * Version-skew / new 500 surface from the per-request `SET LOCAL`. Resolution: REVERT to the bare `AdminRepo.all` HNSW query — the form `search_semantic` and `distant_pairs` already ship in prod. This fixes the #172 prod 500 (the actual bug) with ZERO added risk. The Round-1 under-fill it targeted is by-design and identical to the production-proven `search_semantic` (architect-confirmed LOW in BOTH review rounds; negligible at the default limit 5 vs ef_search 40). The recall ceiling is now documented honestly on `suggest_links/3` rather than papered over. Also from the adversarial round: * Added an EXPLAIN index-eligibility test (skeptical-BA's ASK-3): with seq-scan and sort disabled, the plan MUST satisfy `ORDER BY embedding <=> $const` via `articles_embedding_idx` with no Sort node. This guards the broader "no full-corpus scan" regression CLASS that the SQL-shape regex can't see (a reintroduced self-join column-operand, a wrapped order expr, or a dropped index all force a Sort and fail it) — deterministic at any row count. * Renamed the exclusion-under-limit test to what it actually guards (linked near-neighbors are excluded and the result fills from the unlinked remainder), dropping the iterative-scan framing the cynical-engineer flagged as tautological. Disproven adversarial attacks (documented, not fixed): MatchError on the transaction result (raises propagate; only explicit rollback yields {:error} — verified empirically); version-skew "every call 500s" (Postgres accepts two-part GUC names as placeholders → graceful no-op, not an error). Both moot after the revert. Prod-verification items #172 lists (capture the live exception, hit the 5 repro ids on the deployed build, prod EXPLAIN) require production access and are the post-deploy gate before marking the issue fully verified — called out in the PR/issue, not deferred. Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

mkreyman added 3 commits June 24, 2026 08:46

mkreyman merged commit 089d592 into master Jun 24, 2026
6 checks passed

mkreyman deleted the fix/172-suggested-links-hnsw-scale branch June 24, 2026 15:14

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix #172: suggested_links 500 — HNSW approximate-NN, not a full-corpus self-join scan#173

Fix #172: suggested_links 500 — HNSW approximate-NN, not a full-corpus self-join scan#173
mkreyman merged 3 commits into
masterfrom
fix/172-suggested-links-hnsw-scale

mkreyman commented Jun 24, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mkreyman commented Jun 24, 2026

Summary

Enhanced review (team + adversarial, Opus)

Tests

Post-deploy verification (operator gate before marking #172 fully verified)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant