Skip to content

Fix #172: suggested_links 500 — HNSW approximate-NN, not a full-corpus self-join scan#173

Merged
mkreyman merged 3 commits into
masterfrom
fix/172-suggested-links-hnsw-scale
Jun 24, 2026
Merged

Fix #172: suggested_links 500 — HNSW approximate-NN, not a full-corpus self-join scan#173
mkreyman merged 3 commits into
masterfrom
fix/172-suggested-links-hnsw-scale

Conversation

@mkreyman

Copy link
Copy Markdown
Owner

Summary

Fixes #172GET /api/v1/knowledge/articles/:id/suggested_links still 500'd in prod after #170.

Root cause (corrected from #170): #170 removed the %Pgvector{} re-interpolation (the original #168 500) but rewrote the query to a column-to-column self-join that read the target vector as a column (a.embedding <=> t.embedding). A column right-operand can't use the HNSW index, so the query did a per-row cosine + full ORDER BY ... <=> sort over every published-embedded article. Negligible on a handful of test rows; at prod scale (~76k articles) it exceeds the statement timeout → generic 500 on 100% of calls.

Fix: pass the already-fetched target embedding as a bound [float()] list param — the exact shape search_semantic and distant_pairs ship in production. ORDER BY embedding <=> $const LIMIT k is the HNSW-indexable form (articles_embedding_idx, vector_cosine_ops), so the planner does an approximate-NN index scan instead of a full-corpus sort. Converting the stored %Pgvector{} to a plain list (not binding the struct) keeps the #168 fix intact.

The query stays a bare AdminRepo.all — no transaction, matching search_semantic/distant_pairs, which deliberately avoid holding one of the 3 AdminRepo connections (see the distant_pairs pool-starvation note in the same module).

Enhanced review (team + adversarial, Opus)

Both rounds ran; every agent claim was independently verified against the code/tests.

  • Round 1 (team): flagged that a single HNSW ef_search batch could let the already-linked anti-join under-fill the result below limit. An initial fix added SET LOCAL hnsw.iterative_scan in a transaction.
  • Round 2 (adversarial): three independent reviewers (security, architect, skeptical-BA) converged that the transaction wrapper was worse than the bug it fixed — pool starvation on the 3-connection admin pool (the hottest knowledge endpoint), a 20k-tuple max_scan_tuples walk on the common sparse-article case, and a new SET LOCAL failure surface. Resolution: reverted it. The under-fill it targeted is by-design and identical to the production-proven search_semantic (architect-rated LOW in both rounds; negligible at the default limit 5 vs ef_search 40), now documented honestly on suggest_links/3.
  • Disproven adversarial attacks (verified empirically, not fixed): MatchError on the transaction result (raises propagate; only rollback yields {:error}); "version-skew 500" (Postgres treats two-part GUC names as placeholders → graceful no-op). Both moot after the revert.

Tests

  • EXPLAIN index-eligibility guard (the issue's "test that exhibits scale" ask): with seq-scan and sort disabled, the plan must satisfy the ORDER BY via articles_embedding_idx with no Sort node — catches the broad "reintroduced full-corpus scan" regression class (self-join column-operand, wrapped order expr, dropped index) that a small-data behavioral test can't, deterministically at any row count.
  • Structural SQL-shape guard (bound list param, single articles table, no self-join).
  • Behavioral ACs: 200 + ranked candidates; excludes self + already-linked (either direction); honors threshold/limit; exclusion-under-limit (linked near-neighbors excluded, result filled from the unlinked remainder); published-only; tenant isolation; read-only.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

Post-deploy verification (operator gate before marking #172 fully verified)

The prod-only 500 can't be reproduced at test scale, so before closing #172 as verified-in-prod, on the deployed build: hit the 5 repro ids from the issue and confirm 200 + ranked candidates, and (optionally) capture a prod EXPLAIN ANALYZE showing the index scan. These need production access and are not deferrals of any code change.

Closes #172

mkreyman added 3 commits June 24, 2026 08:46
…orpus scan

#170 correctly removed the stored-`%Pgvector{}` re-interpolation (the #168 500)
but rewrote to a column-to-column self-join, which reads the target vector as a
COLUMN (`a.embedding <=> t.embedding`). A column right-operand defeats the HNSW
index (`articles_embedding_idx`, vector_cosine_ops), forcing a per-row cosine +
full `ORDER BY ... <=>` sort over EVERY published-embedded article. Negligible on
a handful of test rows; at prod scale (~76k articles) it exceeds the statement
timeout → generic 500 on 100% of `GET /knowledge/articles/:id/suggested_links`.

Fix: pass the already-fetched target embedding as a BOUND `^param` list of floats
(via `to_embedding_list/1`), the exact value shape `search_semantic` ships in
production. `ORDER BY embedding <=> $const LIMIT k` is the HNSW-indexable form, so
the planner does an approximate-NN index scan rather than a full-corpus sort. The
list (not the stored `%Pgvector{}` struct) keeps the #168 fix intact. Timeout
raised 5s→15s to match the other vector paths (nearest_prior_distance/distant_pairs).

Tests: the structural guard is upgraded to a #168+#172 RED→GREEN check — asserts
the target is a bound LIST param (not a `%Pgvector{}`), exactly one `articles`
table reference (no self-join driving a full scan), and an indexable `ORDER BY
... <=>` shape. A literal 76k-row seed is infeasible in an async unit test, so the
indexable SQL shape is the durable proxy. All behavioral ACs (200 + ranked
candidates, excludes self/already-linked, honors threshold/limit) still pass.
…der-limit test

Enhanced-review Round 1 (team) findings:

- F1 (BA high / arch low): the bound-param HNSW form fixes the 500, but with the
  default single ef_search batch the already-linked anti-join could starve the
  result below `limit` — links cluster among an article's nearest neighbors, so the
  exclusion removes exactly the candidates HNSW surfaces first. Resolve by running
  the suggestion query under `SET LOCAL hnsw.iterative_scan = 'strict_order'`
  (pgvector >= 0.8, installed: 0.8.2) in a transaction: pgvector keeps pulling
  neighbors until `limit` survive the post-filters, bounded by `max_scan_tuples`
  (never degrades to the #172 full-corpus scan). `strict_order` preserves the
  "most-similar first" contract. SET LOCAL is transaction-scoped like the RLS GUC;
  pgvector reconciles it when the query's `<=>` ops load the extension (verified).

- F2 (BA medium): the structural guard doesn't exercise exclusion-under-limit. Added
  a behavioral test — six equally-similar candidates, four pre-linked, request
  limit:2, assert exactly the two unlinked come back (none of the linked leak,
  result not starved).

- F3 (BA low / engineer): the #150 describe-block comment still described the removed
  self-join. Rewritten to the bound-param HNSW form. Added a moduledoc note that
  ranking is approximate-NN over the HNSW index with iterative scan.

All other team findings (list-param encoding, 15s timeout, tenant/RLS scoping,
planner, to_embedding_list coverage, test RED->GREEN) verified PASS — no change.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2531 tests — all green.
…n, add EXPLAIN guard

Enhanced-review Round 2 (adversarial, 4 agents). Three independent reviewers
(security, architect, skeptical-BA) converged on a single HIGH finding: the
Round-1 `AdminRepo.transaction` + `SET LOCAL hnsw.iterative_scan` wrapper was a
self-inflicted regression that is WORSE than the edge-case under-fill it fixed:

  * Pool starvation (HIGH): the prod AdminRepo pool is only 3 (`ADMIN_POOL_SIZE`),
    shared by every cross-tenant read. Holding a connection in a 15s+ transaction
    on the hottest knowledge endpoint risks starving all other admin reads under
    concurrent agent load — the EXACT pattern the `distant_pairs` note in this
    module (knowledge.ex:3253-3254) was written to avoid ("avoiding transaction
    hold ... prevents AdminRepo pool starvation ... (3 agents, pool_size: 3)").
  * Sparse-article blow-up (MED-HIGH): `strict_order` iterative scan walks up to
    `max_scan_tuples` (20k) every call for an article whose neighbors are mostly
    below-threshold/already-linked — the COMMON case — turning the fast path slow.
  * Version-skew / new 500 surface from the per-request `SET LOCAL`.

Resolution: REVERT to the bare `AdminRepo.all` HNSW query — the form `search_semantic`
and `distant_pairs` already ship in prod. This fixes the #172 prod 500 (the actual
bug) with ZERO added risk. The Round-1 under-fill it targeted is by-design and
identical to the production-proven `search_semantic` (architect-confirmed LOW in
BOTH review rounds; negligible at the default limit 5 vs ef_search 40). The recall
ceiling is now documented honestly on `suggest_links/3` rather than papered over.

Also from the adversarial round:
  * Added an EXPLAIN index-eligibility test (skeptical-BA's ASK-3): with seq-scan and
    sort disabled, the plan MUST satisfy `ORDER BY embedding <=> $const` via
    `articles_embedding_idx` with no Sort node. This guards the broader "no
    full-corpus scan" regression CLASS that the SQL-shape regex can't see (a
    reintroduced self-join column-operand, a wrapped order expr, or a dropped index
    all force a Sort and fail it) — deterministic at any row count.
  * Renamed the exclusion-under-limit test to what it actually guards (linked
    near-neighbors are excluded and the result fills from the unlinked remainder),
    dropping the iterative-scan framing the cynical-engineer flagged as tautological.

Disproven adversarial attacks (documented, not fixed): MatchError on the transaction
result (raises propagate; only explicit rollback yields {:error} — verified
empirically); version-skew "every call 500s" (Postgres accepts two-part GUC names
as placeholders → graceful no-op, not an error). Both moot after the revert.

Prod-verification items #172 lists (capture the live exception, hit the 5 repro ids on
the deployed build, prod EXPLAIN) require production access and are the post-deploy
gate before marking the issue fully verified — called out in the PR/issue, not deferred.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify follow-up (#168/#170): suggested_links STILL 500s in prod — unbounded full-corpus scan (scale), not param interpolation

1 participant