Skip to content

Fix #168: suggested_links 500 — column-to-column cosine via self-join#170

Merged
mkreyman merged 3 commits into
masterfrom
fix/168-suggest-links-500
Jun 24, 2026
Merged

Fix #168: suggested_links 500 — column-to-column cosine via self-join#170
mkreyman merged 3 commits into
masterfrom
fix/168-suggest-links-500

Conversation

@mkreyman

Copy link
Copy Markdown
Owner

Summary

Closes #168. GET /api/v1/knowledge/articles/:id/suggested_links returned 500 on every call in production. suggest_links was the only embedding endpoint that fetched the target's stored vector and round-tripped it back into the candidate query as a ^param with a ?::vector cast. Every endpoint that works in production keeps the vector math column-based — distant_pairs does a.embedding <=> b.embedding (column-to-column) and works on the same corpus.

Fix

suggestion_candidates now references the target's vector via a self-join and computes a.embedding <=> t.embedding — column-to-column, no parameter, no ::vector cast, the exact pattern distant_pairs uses. Since distant_pairs returns 200 on this corpus (same <=> over the same embeddings), suggested_links now does too. fetch_article_embedding is kept only to distinguish not_found (404) / no-embedding ({:ok, []}) / has-embedding before ranking.

Why the prior tests didn't catch it

The unit/context tests passed while the HTTP path 500'd in prod (the local pgvector/Postgrex build happened to encode the re-interpolated vector; prod's didn't). Added an end-to-end HTTP regression test that seeds embedded articles, calls the endpoint, and asserts 200 + the documented candidate shape ({id,title,category,similarity_score}).

ACs re-validated (#150)

  • Returns 200 with ranked candidates for an embedded article (was 500).
  • Excludes the article itself + already-linked articles; honors threshold/limit.
  • End-to-end integration test (controller → query → JSON) asserts 200 + shape.

Full suite green (2526), mix precommit + mix dialyzer clean.

mkreyman added 3 commits June 23, 2026 23:29
… re-interpolated vector

GET /api/v1/knowledge/articles/:id/suggested_links returned 500 for every
embedded article in production. suggest_links was the ONLY embedding endpoint
that fetched the target article's stored vector and round-tripped it back into
the candidate query as a `^param` with a `?::vector` cast. Every working endpoint
keeps the vector math column-based: distant_pairs uses `a.embedding <=> b.embedding`
(column-to-column) and works in production; novelty/semantic-search pass a plain
float list.

Fix: suggestion_candidates now references the target's vector via a self-join
(`join: t in Article, on: t.id == ^article_id`) and computes
`a.embedding <=> t.embedding` — column-to-column, no parameter, no `::vector`
cast. This is the exact pattern distant_pairs uses, so on any corpus where pairs
returns 200 (same `<=>` over the same embeddings), suggested_links now does too.
fetch_article_embedding is kept solely to distinguish not_found (404) /
no-embedding ({:ok, []}) / has-embedding before running the ranked query.

Verified ACs (#150 re-validation):
- Returns 200 with ranked [{id,title,category,similarity_score}] for an
  embedded article (was 500).
- Excludes the article itself + already-linked articles; honors threshold/limit.
- New end-to-end HTTP regression test seeds embedded articles and asserts 200 +
  the documented candidate shape (the prior unit/context tests passed while the
  HTTP path 500'd in prod — this guards the controller -> query -> JSON path).

Full suite green (2526), precommit + dialyzer clean.
…NSW note

Team review (engineer/architect/security) on PR #170:
- Security: CLEAN (tenant isolation + #163 visibility preserved; target 404-gated;
  candidate filter binds the right alias; single-row target join = O(n), not O(n²)).
- Architect: HIGH CONFIDENCE the fix resolves the prod 500 (column-to-column `<=>`
  is the exact prod-proven `distant_pairs` pattern; vector(1536) column rules out a
  dimension cause). One LOW: the column form bypasses the HNSW ANN index.
- Engineer (MEDIUM): the functional regression test passes on BOTH buggy and fixed
  code (the column is vector(1536) and PostgrexTypes is shared across envs, so the
  stored-vector re-interpolation encodes fine locally) — so it didn't prove RED→GREEN.

Fixes:
- Extracted `suggestion_candidates_query/5` (`@doc false`, returns the Ecto query) so
  a STRUCTURAL test can assert the query shape. New test asserts (a) no query
  parameter is a vector/list — the target vector is no longer bound as a param, and
  (b) the cosine is between two `embedding` COLUMNS. Both assertions are RED on the
  pre-fix query and GREEN on the fix — proven via `Ecto.Adapters.SQL.to_sql`
  (pre-fix: vector-param present; fix: none, 7 embedding-column refs). This is the
  locally-reproducible regression guard the prod-only 500 prevented.
- Documented the HNSW tradeoff (column-to-column intentionally bypasses
  `articles_embedding_idx`; negligible at MVP scale, matches shipped `distant_pairs`).

Full suite green (2527), precommit + dialyzer clean.
@mkreyman mkreyman merged commit 6529edc into master Jun 24, 2026
6 checks passed
@mkreyman mkreyman deleted the fix/168-suggest-links-500 branch June 24, 2026 05:58
mkreyman added a commit that referenced this pull request Jun 24, 2026
…s self-join scan (#173)

* Fix #172: suggested_links 500 — HNSW approximate-NN instead of full-corpus scan

#170 correctly removed the stored-`%Pgvector{}` re-interpolation (the #168 500)
but rewrote to a column-to-column self-join, which reads the target vector as a
COLUMN (`a.embedding <=> t.embedding`). A column right-operand defeats the HNSW
index (`articles_embedding_idx`, vector_cosine_ops), forcing a per-row cosine +
full `ORDER BY ... <=>` sort over EVERY published-embedded article. Negligible on
a handful of test rows; at prod scale (~76k articles) it exceeds the statement
timeout → generic 500 on 100% of `GET /knowledge/articles/:id/suggested_links`.

Fix: pass the already-fetched target embedding as a BOUND `^param` list of floats
(via `to_embedding_list/1`), the exact value shape `search_semantic` ships in
production. `ORDER BY embedding <=> $const LIMIT k` is the HNSW-indexable form, so
the planner does an approximate-NN index scan rather than a full-corpus sort. The
list (not the stored `%Pgvector{}` struct) keeps the #168 fix intact. Timeout
raised 5s→15s to match the other vector paths (nearest_prior_distance/distant_pairs).

Tests: the structural guard is upgraded to a #168+#172 RED→GREEN check — asserts
the target is a bound LIST param (not a `%Pgvector{}`), exactly one `articles`
table reference (no self-join driving a full scan), and an indexable `ORDER BY
... <=>` shape. A literal 76k-row seed is infeasible in an async unit test, so the
indexable SQL shape is the durable proxy. All behavioral ACs (200 + ranked
candidates, excludes self/already-linked, honors threshold/limit) still pass.

* Address #172 team-review findings: iterative HNSW scan + exclusion-under-limit test

Enhanced-review Round 1 (team) findings:

- F1 (BA high / arch low): the bound-param HNSW form fixes the 500, but with the
  default single ef_search batch the already-linked anti-join could starve the
  result below `limit` — links cluster among an article's nearest neighbors, so the
  exclusion removes exactly the candidates HNSW surfaces first. Resolve by running
  the suggestion query under `SET LOCAL hnsw.iterative_scan = 'strict_order'`
  (pgvector >= 0.8, installed: 0.8.2) in a transaction: pgvector keeps pulling
  neighbors until `limit` survive the post-filters, bounded by `max_scan_tuples`
  (never degrades to the #172 full-corpus scan). `strict_order` preserves the
  "most-similar first" contract. SET LOCAL is transaction-scoped like the RLS GUC;
  pgvector reconciles it when the query's `<=>` ops load the extension (verified).

- F2 (BA medium): the structural guard doesn't exercise exclusion-under-limit. Added
  a behavioral test — six equally-similar candidates, four pre-linked, request
  limit:2, assert exactly the two unlinked come back (none of the linked leak,
  result not starved).

- F3 (BA low / engineer): the #150 describe-block comment still described the removed
  self-join. Rewritten to the bound-param HNSW form. Added a moduledoc note that
  ranking is approximate-NN over the HNSW index with iterative scan.

All other team findings (list-param encoding, 15s timeout, tenant/RLS scoping,
planner, to_embedding_list coverage, test RED->GREEN) verified PASS — no change.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2531 tests — all green.

* Address #172 adversarial findings: drop the iterative-scan transaction, add EXPLAIN guard

Enhanced-review Round 2 (adversarial, 4 agents). Three independent reviewers
(security, architect, skeptical-BA) converged on a single HIGH finding: the
Round-1 `AdminRepo.transaction` + `SET LOCAL hnsw.iterative_scan` wrapper was a
self-inflicted regression that is WORSE than the edge-case under-fill it fixed:

  * Pool starvation (HIGH): the prod AdminRepo pool is only 3 (`ADMIN_POOL_SIZE`),
    shared by every cross-tenant read. Holding a connection in a 15s+ transaction
    on the hottest knowledge endpoint risks starving all other admin reads under
    concurrent agent load — the EXACT pattern the `distant_pairs` note in this
    module (knowledge.ex:3253-3254) was written to avoid ("avoiding transaction
    hold ... prevents AdminRepo pool starvation ... (3 agents, pool_size: 3)").
  * Sparse-article blow-up (MED-HIGH): `strict_order` iterative scan walks up to
    `max_scan_tuples` (20k) every call for an article whose neighbors are mostly
    below-threshold/already-linked — the COMMON case — turning the fast path slow.
  * Version-skew / new 500 surface from the per-request `SET LOCAL`.

Resolution: REVERT to the bare `AdminRepo.all` HNSW query — the form `search_semantic`
and `distant_pairs` already ship in prod. This fixes the #172 prod 500 (the actual
bug) with ZERO added risk. The Round-1 under-fill it targeted is by-design and
identical to the production-proven `search_semantic` (architect-confirmed LOW in
BOTH review rounds; negligible at the default limit 5 vs ef_search 40). The recall
ceiling is now documented honestly on `suggest_links/3` rather than papered over.

Also from the adversarial round:
  * Added an EXPLAIN index-eligibility test (skeptical-BA's ASK-3): with seq-scan and
    sort disabled, the plan MUST satisfy `ORDER BY embedding <=> $const` via
    `articles_embedding_idx` with no Sort node. This guards the broader "no
    full-corpus scan" regression CLASS that the SQL-shape regex can't see (a
    reintroduced self-join column-operand, a wrapped order expr, or a dropped index
    all force a Sort and fail it) — deterministic at any row count.
  * Renamed the exclusion-under-limit test to what it actually guards (linked
    near-neighbors are excluded and the result fills from the unlinked remainder),
    dropping the iterative-scan framing the cynical-engineer flagged as tautological.

Disproven adversarial attacks (documented, not fixed): MatchError on the transaction
result (raises propagate; only explicit rollback yields {:error} — verified
empirically); version-skew "every call 500s" (Postgres accepts two-part GUC names
as placeholders → graceful no-op, not an error). Both moot after the revert.

Prod-verification items #172 lists (capture the live exception, hit the 5 repro ids on
the deployed build, prod EXPLAIN) require production access and are the post-deploy
gate before marking the issue fully verified — called out in the PR/issue, not deferred.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
mkreyman added a commit that referenced this pull request Jun 24, 2026
…SW index at scale) (#174)

* Fix #172 for real: subquery filter-after-ANN so the planner uses the HNSW index

The previously-merged #172 fix (089d592) was deployed and STILL 500'd in prod —
verified live in prod logs: ERROR 57014 (query_canceled) "canceling statement" on
suggest_links/3 (knowledge.ex:2844), well after the deploy. EXPLAIN against the real
~76k-row corpus showed a Seq Scan on articles + Sort (cost ~57k): the HNSW index was
NOT chosen. The prior fix proved index ELIGIBILITY (a forced EXPLAIN with sort
disabled) but never that the planner CHOOSES the index at scale — the skeptical-BA
flagged exactly this gap and I closed the issue without prod verification. Owned.

Root cause (EXPLAIN-verified on prod): pgvector's HNSW index only accelerates a PURE
`ORDER BY embedding <=> $const LIMIT k`. The already-linked anti-join (a JOIN) and the
`GREATEST(0,1-(embedding<=>$t)) > threshold` distance filter, in the SAME query, make
the planner abandon the index and Seq-Scan the whole corpus. (search_semantic has
neither, which is why it stays fast.)

Fix: split into a subquery. The INNER query is the pure top-`pool` nearest by cosine
(only index-safe filters: tenant/status/not-null/not-self/visibility — each verified on
prod to keep the index). The OUTER query applies the anti-join + threshold + final
limit over just `pool` rows. Verified on the prod corpus: inner uses
`articles_embedding_hnsw_idx`, total EXPLAIN ANALYZE 73ms (was a statement timeout),
no Seq Scan on articles — confirmed WITH the agent visibility filter (the actual
failing path).

`pool = max(limit*5, 100)` capped at 500; effective recall additionally bounded by
hnsw.ef_search (~40, observed) — consistent with search_semantic; ample for the
default limit 5. Documented.

Test: the EXPLAIN guard now asserts the corpus is reached via the HNSW index with NO
`Seq Scan on articles` (an outer sort over the small pool is expected) — the assertion
that would have caught both #170 and the first #172 attempt. Structural + behavioral
tests unchanged and green.

Note: prod's index is named `articles_embedding_hnsw_idx`, but the migration
(20260410022906) creates `articles_embedding_idx` — migration drift (prod index
created out-of-band). Both are HNSW cosine; the query uses whichever exists. Flagged
for the operator; not addressed here.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.

* Address #172 focused-review findings (recall doc + pool floor + EXPLAIN strengthen)

Round-2 focused adversarial review (security + engineer, Opus). Reshape confirmed
sound on tenant isolation, visibility, anti-join, ordering, DoS (all disproven).
Findings fixed:

- A (security, MEDIUM): the anti-join/threshold now run AFTER the candidate-pool cut,
  so a densely-linked "hub" can return fewer than `limit` (or none) vs master's
  full-corpus exclusion. The reviewer's proposed alternative — move the anti-join back
  inside the index scan — is PROD-DISPROVEN: EXPLAIN of (anti-join, no threshold) on the
  real corpus is a Parallel Seq Scan + Sort (cost 105k), i.e. it reintroduces the 500.
  So the recall ceiling is intrinsic to the indexed path. Documented the exact mechanism
  and consequence on `suggest_links/3` and the controller OpenAPI description.

- engineer: `suggestion_candidate_pool/1` could cap the pool below `limit` if an operator
  misconfigures `:max_suggestion_candidate_pool`. Floored at `limit` so the pool can
  never truncate below the requested count before exclusions run.

- B (LOW): strengthened the EXPLAIN guard to require an actual `Index Scan using
  articles_embedding(_hnsw)?_idx` (not just the index name appearing), tolerant of the
  test-vs-prod index-name drift.

- C (LOW): no audit log on the BYPASSRLS read path — pre-existing, identical on master,
  and a module-wide concern (every AdminRepo read); out of scope for this hotfix.

Gate: compile --warnings-as-errors, format, credo --strict, dialyzer, 2532 tests — all green.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Verify follow-up (#150): suggested_links returns 500 on every call — endpoint non-functional

1 participant