Skip to content

Epic: performance & scale hardening for the knowledge subsystem (reproducible · observable · index-correct at prod scale) #175

Description

@mkreyman

Why this epic exists

suggested_links (#168#170#172#173) took four attempts to fix. Three were Opus team+adversarial enhanced-reviewed, CI-green, and reported "resolved end-to-end" — yet kept failing in production (intermittent 15–30s cosine scans → 30s statement-timeout → 500). It is now genuinely fixed (verified live: 10/10 calls, ~0.2s).

The repeated misses are not a review-quality problem — the reviews were thorough. They're a system problem: loopctl cannot reproduce or observe its own behavior at production scale, so fixes are validated against a non-representative environment and ship "green" but broken. This epic addresses the class of issue, not the next instance. Each theme cites incidents from a real multi-day session against the live KB (~76k published articles).


Theme 1 — Scale-representative testing + observability ⟵ do this FIRST

Problem. Tests run against tiny DBs that can't reproduce prod-scale planner decisions or timeouts; DB errors surface as generic 500s that hide the real SQLSTATE; prod logs/Sentry weren't reachable from the fixing session → fixes made blind.
Evidence. suggest_links 4 attempts; the EXPLAIN enable_seqscan=off guard proved index eligibility on a join-free shape, not that the actual (joined) query uses the index; "couldn't reproduce in the vector(1536) test DB"; "can't reach prod Sentry, so the literal 57014 trace isn't attached."
Direction.

  • A seeded large-corpus fixture/staging gate (~50–100k articles w/ embeddings + links) so heavy endpoints are load- and plan-testable on representative data.
  • Plan assertions on the real query shape (joins/filters included) — fail if it Seqscans/Sorts the corpus, without forcing enable_seqscan=off.
  • Structured DB-error surfacing: map SQLSTATE (57014 timeout, 22000, …) to a safe, logged error (not a blanket 500); slow-query logging (query + duration); a documented path to the real prod exception.
  • Explicit statement_timeout per heavy endpoint with a clear fast-fail.
    Acceptance. Representative-scale corpus available to CI/staging; vector + enumeration endpoints have a test that fails on a full-scan regression at that scale; DB-error 500s carry a logged structured cause; a runbook exists for retrieving the real prod error.

Theme 2 — Index-correct vector-query layer

Problem. Every embedding endpoint hand-rolls its cosine query; some defeat the HNSW index, some bound the scan, some don't — no correct-by-construction path.
Evidence. suggest_links' LEFT JOIN article_links defeated the HNSW index → 15–30s full scan; distant_pairs bounds via max_pair_candidates() + timeout; search_semantic is the proven index shape; the eventual fix was "match search_semantic + filter already-linked in app, not in the index-ordered query."
Direction. One shared kNN helper: index-backed ORDER BY embedding <=> $vec LIMIT k over published-embedded, no joins in the index-ordered query; post-filters (self/linked/tag) applied after the index fetch (over-fetch + app filter). Route search, suggest_links, pairs, novelty, and the auto-link worker through it. Guard: any cosine ORDER BY must be index-eligible on its real shape.
Acceptance. All vector endpoints return <2s at prod scale with a per-endpoint index-usage test; new vector endpoints reuse the helper (no new hand-rolled scans).

Theme 3 — Cursor/keyset pagination for enumeration

Problem. Offset pagination drifts under concurrent writes and the list endpoint historically clamped/truncated; full-body responses broke streaming.
Evidence. #148 silent limit→100 clamp; same query returned 9,881 then 4,981 rows mid-write; ChunkedEncodingError on large full-body pages; body-less default (#166) helped but offset drift remains.
Direction. Keyset/cursor pagination (next_cursor) for list/enumeration, stable under writes; keep body-less default + bounded opt-in include_body.
Acceptance. Enumerate a tag to exhaustion under concurrent writes via cursor returns a stable, complete set; no offset drift; meta documents the cursor contract.

Theme 4 — Connection-pool sizing + set-based bulk mutations

Problem. A 3-connection admin pool starves under heavy/bulk reads; cascade mutations are O(n) round-trips.
Evidence. The transaction-based first fix for #172 caused pool starvation on the 3-conn admin pool (caught in adversarial review — the anti-pattern an existing distant_pairs comment warns about); a KB cleanup this session issued ~4,000 individual DELETEs because there's no set-based bulk-by-tag.
Direction. Size the pool / add a dedicated heavy-read pool; add set-based bulk archive/delete/unpublish by ids and by tag (one statement), incl. cascade-by-source.
Acceptance. Archiving a whole source/tag is one bounded statement (not N round-trips); heavy analytical reads don't contend on a tiny pool.


Sequencing

  1. Theme 1 first — without reproduce-at-scale + see-the-real-error, every perf fix is a blind guess (that's how Verify follow-up (#168/#170): suggested_links STILL 500s in prod — unbounded full-corpus scan (scale), not param interpolation #172 missed three times).
  2. Theme 2 — the biggest perf class; fixes all vector endpoints at once.
  3. Themes 3 & 4 — enumeration scale + write/pool efficiency.

Workflow

Each theme as its own branch/PR off master, full enhanced review (Opus team + adversarial, every claim independently verified), all findings fixed in-PR (no deferrals), green CI, and — per Theme 1 — verified against representative scale + confirmed on the live deployed build, not just the test DB. Exhibit A (#172) is closed; tie any residual vector-endpoint work to Theme 2.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestepicMulti-PR architectural track

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions