You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
suggested_links (#168 → #170 → #172 → #173) took four attempts to fix. Three were Opus team+adversarial enhanced-reviewed, CI-green, and reported "resolved end-to-end" — yet kept failing in production (intermittent 15–30s cosine scans → 30s statement-timeout → 500). It is now genuinely fixed (verified live: 10/10 calls, ~0.2s).
The repeated misses are not a review-quality problem — the reviews were thorough. They're a system problem: loopctl cannot reproduce or observe its own behavior at production scale, so fixes are validated against a non-representative environment and ship "green" but broken. This epic addresses the class of issue, not the next instance. Each theme cites incidents from a real multi-day session against the live KB (~76k published articles).
Theme 1 — Scale-representative testing + observability ⟵ do this FIRST
Problem. Tests run against tiny DBs that can't reproduce prod-scale planner decisions or timeouts; DB errors surface as generic 500s that hide the real SQLSTATE; prod logs/Sentry weren't reachable from the fixing session → fixes made blind. Evidence. suggest_links 4 attempts; the EXPLAIN enable_seqscan=off guard proved index eligibility on a join-free shape, not that the actual (joined) query uses the index; "couldn't reproduce in the vector(1536) test DB"; "can't reach prod Sentry, so the literal 57014 trace isn't attached." Direction.
A seeded large-corpus fixture/staging gate (~50–100k articles w/ embeddings + links) so heavy endpoints are load- and plan-testable on representative data.
Plan assertions on the real query shape (joins/filters included) — fail if it Seqscans/Sorts the corpus, without forcing enable_seqscan=off.
Structured DB-error surfacing: map SQLSTATE (57014 timeout, 22000, …) to a safe, logged error (not a blanket 500); slow-query logging (query + duration); a documented path to the real prod exception.
Explicit statement_timeout per heavy endpoint with a clear fast-fail. Acceptance. Representative-scale corpus available to CI/staging; vector + enumeration endpoints have a test that fails on a full-scan regression at that scale; DB-error 500s carry a logged structured cause; a runbook exists for retrieving the real prod error.
Theme 2 — Index-correct vector-query layer
Problem. Every embedding endpoint hand-rolls its cosine query; some defeat the HNSW index, some bound the scan, some don't — no correct-by-construction path. Evidence. suggest_links' LEFT JOIN article_links defeated the HNSW index → 15–30s full scan; distant_pairs bounds via max_pair_candidates() + timeout; search_semantic is the proven index shape; the eventual fix was "match search_semantic + filter already-linked in app, not in the index-ordered query." Direction. One shared kNN helper: index-backed ORDER BY embedding <=> $vec LIMIT k over published-embedded, no joins in the index-ordered query; post-filters (self/linked/tag) applied after the index fetch (over-fetch + app filter). Route search, suggest_links, pairs, novelty, and the auto-link worker through it. Guard: any cosine ORDER BY must be index-eligible on its real shape. Acceptance. All vector endpoints return <2s at prod scale with a per-endpoint index-usage test; new vector endpoints reuse the helper (no new hand-rolled scans).
Theme 3 — Cursor/keyset pagination for enumeration
Problem. Offset pagination drifts under concurrent writes and the list endpoint historically clamped/truncated; full-body responses broke streaming. Evidence.#148 silent limit→100 clamp; same query returned 9,881 then 4,981 rows mid-write; ChunkedEncodingError on large full-body pages; body-less default (#166) helped but offset drift remains. Direction. Keyset/cursor pagination (next_cursor) for list/enumeration, stable under writes; keep body-less default + bounded opt-in include_body. Acceptance. Enumerate a tag to exhaustion under concurrent writes via cursor returns a stable, complete set; no offset drift; meta documents the cursor contract.
Problem. A 3-connection admin pool starves under heavy/bulk reads; cascade mutations are O(n) round-trips. Evidence. The transaction-based first fix for #172 caused pool starvation on the 3-conn admin pool (caught in adversarial review — the anti-pattern an existing distant_pairs comment warns about); a KB cleanup this session issued ~4,000 individual DELETEs because there's no set-based bulk-by-tag. Direction. Size the pool / add a dedicated heavy-read pool; add set-based bulk archive/delete/unpublish by ids and by tag (one statement), incl. cascade-by-source. Acceptance. Archiving a whole source/tag is one bounded statement (not N round-trips); heavy analytical reads don't contend on a tiny pool.
Each theme as its own branch/PR off master, full enhanced review (Opus team + adversarial, every claim independently verified), all findings fixed in-PR (no deferrals), green CI, and — per Theme 1 — verified against representative scale + confirmed on the live deployed build, not just the test DB. Exhibit A (#172) is closed; tie any residual vector-endpoint work to Theme 2.
Why this epic exists
suggested_links(#168 → #170 → #172 → #173) took four attempts to fix. Three were Opus team+adversarial enhanced-reviewed, CI-green, and reported "resolved end-to-end" — yet kept failing in production (intermittent 15–30s cosine scans → 30s statement-timeout → 500). It is now genuinely fixed (verified live: 10/10 calls, ~0.2s).The repeated misses are not a review-quality problem — the reviews were thorough. They're a system problem: loopctl cannot reproduce or observe its own behavior at production scale, so fixes are validated against a non-representative environment and ship "green" but broken. This epic addresses the class of issue, not the next instance. Each theme cites incidents from a real multi-day session against the live KB (~76k published articles).
Theme 1 — Scale-representative testing + observability ⟵ do this FIRST
Problem. Tests run against tiny DBs that can't reproduce prod-scale planner decisions or timeouts; DB errors surface as generic 500s that hide the real SQLSTATE; prod logs/Sentry weren't reachable from the fixing session → fixes made blind.
Evidence. suggest_links 4 attempts; the
EXPLAIN enable_seqscan=offguard proved index eligibility on a join-free shape, not that the actual (joined) query uses the index; "couldn't reproduce in the vector(1536) test DB"; "can't reach prod Sentry, so the literal 57014 trace isn't attached."Direction.
enable_seqscan=off.statement_timeoutper heavy endpoint with a clear fast-fail.Acceptance. Representative-scale corpus available to CI/staging; vector + enumeration endpoints have a test that fails on a full-scan regression at that scale; DB-error 500s carry a logged structured cause; a runbook exists for retrieving the real prod error.
Theme 2 — Index-correct vector-query layer
Problem. Every embedding endpoint hand-rolls its cosine query; some defeat the HNSW index, some bound the scan, some don't — no correct-by-construction path.
Evidence. suggest_links'
LEFT JOIN article_linksdefeated the HNSW index → 15–30s full scan;distant_pairsbounds viamax_pair_candidates()+timeout;search_semanticis the proven index shape; the eventual fix was "match search_semantic + filter already-linked in app, not in the index-ordered query."Direction. One shared kNN helper: index-backed
ORDER BY embedding <=> $vec LIMIT kover published-embedded, no joins in the index-ordered query; post-filters (self/linked/tag) applied after the index fetch (over-fetch + app filter). Route search, suggest_links, pairs, novelty, and the auto-link worker through it. Guard: any cosineORDER BYmust be index-eligible on its real shape.Acceptance. All vector endpoints return <2s at prod scale with a per-endpoint index-usage test; new vector endpoints reuse the helper (no new hand-rolled scans).
Theme 3 — Cursor/keyset pagination for enumeration
Problem. Offset pagination drifts under concurrent writes and the list endpoint historically clamped/truncated; full-body responses broke streaming.
Evidence. #148 silent limit→100 clamp; same query returned 9,881 then 4,981 rows mid-write;
ChunkedEncodingErroron large full-body pages; body-less default (#166) helped but offset drift remains.Direction. Keyset/cursor pagination (
next_cursor) for list/enumeration, stable under writes; keep body-less default + bounded opt-ininclude_body.Acceptance. Enumerate a tag to exhaustion under concurrent writes via cursor returns a stable, complete set; no offset drift;
metadocuments the cursor contract.Theme 4 — Connection-pool sizing + set-based bulk mutations
Problem. A 3-connection admin pool starves under heavy/bulk reads; cascade mutations are O(n) round-trips.
Evidence. The transaction-based first fix for #172 caused pool starvation on the 3-conn admin pool (caught in adversarial review — the anti-pattern an existing
distant_pairscomment warns about); a KB cleanup this session issued ~4,000 individual DELETEs because there's no set-based bulk-by-tag.Direction. Size the pool / add a dedicated heavy-read pool; add set-based bulk archive/delete/unpublish by ids and by tag (one statement), incl. cascade-by-source.
Acceptance. Archiving a whole source/tag is one bounded statement (not N round-trips); heavy analytical reads don't contend on a tiny pool.
Sequencing
Workflow
Each theme as its own branch/PR off master, full enhanced review (Opus team + adversarial, every claim independently verified), all findings fixed in-PR (no deferrals), green CI, and — per Theme 1 — verified against representative scale + confirmed on the live deployed build, not just the test DB. Exhibit A (#172) is closed; tie any residual vector-endpoint work to Theme 2.