Epic: performance & scale hardening for the knowledge subsystem (reproducible · observable · index-correct at prod scale)

## Why this epic exists

`suggested_links` (#168 → #170 → #172 → #173) took **four** attempts to fix. Three were Opus team+adversarial enhanced-reviewed, CI-green, and reported "resolved end-to-end" — yet kept failing in production (intermittent 15–30s cosine scans → 30s statement-timeout → 500). It is now genuinely fixed (verified live: 10/10 calls, ~0.2s).

The repeated misses are **not** a review-quality problem — the reviews were thorough. They're a **system** problem: loopctl cannot reproduce or observe its own behavior at production scale, so fixes are validated against a non-representative environment and ship "green" but broken. This epic addresses the **class** of issue, not the next instance. Each theme cites incidents from a real multi-day session against the live KB (~76k published articles).

---

## Theme 1 — Scale-representative testing + observability  ⟵ do this FIRST
**Problem.** Tests run against tiny DBs that can't reproduce prod-scale planner decisions or timeouts; DB errors surface as generic 500s that hide the real SQLSTATE; prod logs/Sentry weren't reachable from the fixing session → fixes made blind.
**Evidence.** suggest_links 4 attempts; the `EXPLAIN enable_seqscan=off` guard proved index *eligibility* on a join-free shape, not that the *actual* (joined) query uses the index; "couldn't reproduce in the vector(1536) test DB"; "can't reach prod Sentry, so the literal 57014 trace isn't attached."
**Direction.**
- A seeded **large-corpus** fixture/staging gate (~50–100k articles w/ embeddings + links) so heavy endpoints are load- and plan-testable on representative data.
- Plan assertions on the **real query shape** (joins/filters included) — fail if it Seqscans/Sorts the corpus, *without* forcing `enable_seqscan=off`.
- **Structured DB-error surfacing**: map SQLSTATE (57014 timeout, 22000, …) to a safe, logged error (not a blanket 500); slow-query logging (query + duration); a documented path to the real prod exception.
- Explicit `statement_timeout` per heavy endpoint with a clear fast-fail.
**Acceptance.** Representative-scale corpus available to CI/staging; vector + enumeration endpoints have a test that fails on a full-scan regression at that scale; DB-error 500s carry a logged structured cause; a runbook exists for retrieving the real prod error.

## Theme 2 — Index-correct vector-query layer
**Problem.** Every embedding endpoint hand-rolls its cosine query; some defeat the HNSW index, some bound the scan, some don't — no correct-by-construction path.
**Evidence.** suggest_links' `LEFT JOIN article_links` defeated the HNSW index → 15–30s full scan; `distant_pairs` bounds via `max_pair_candidates()` + `timeout`; `search_semantic` is the proven index shape; the eventual fix was "match search_semantic + filter already-linked in app, not in the index-ordered query."
**Direction.** One shared kNN helper: index-backed `ORDER BY embedding <=> $vec LIMIT k` over published-embedded, **no joins in the index-ordered query**; post-filters (self/linked/tag) applied after the index fetch (over-fetch + app filter). Route search, suggest_links, pairs, novelty, and the auto-link worker through it. Guard: any cosine `ORDER BY` must be index-eligible on its real shape.
**Acceptance.** All vector endpoints return <2s at prod scale with a per-endpoint index-usage test; new vector endpoints reuse the helper (no new hand-rolled scans).

## Theme 3 — Cursor/keyset pagination for enumeration
**Problem.** Offset pagination drifts under concurrent writes and the list endpoint historically clamped/truncated; full-body responses broke streaming.
**Evidence.** #148 silent limit→100 clamp; same query returned 9,881 then 4,981 rows mid-write; `ChunkedEncodingError` on large full-body pages; body-less default (#166) helped but offset drift remains.
**Direction.** Keyset/cursor pagination (`next_cursor`) for list/enumeration, stable under writes; keep body-less default + bounded opt-in `include_body`.
**Acceptance.** Enumerate a tag to exhaustion under concurrent writes via cursor returns a stable, complete set; no offset drift; `meta` documents the cursor contract.

## Theme 4 — Connection-pool sizing + set-based bulk mutations
**Problem.** A 3-connection admin pool starves under heavy/bulk reads; cascade mutations are O(n) round-trips.
**Evidence.** The transaction-based first fix for #172 caused pool starvation on the 3-conn admin pool (caught in adversarial review — the anti-pattern an existing `distant_pairs` comment warns about); a KB cleanup this session issued ~4,000 individual DELETEs because there's no set-based bulk-by-tag.
**Direction.** Size the pool / add a dedicated heavy-read pool; add set-based bulk archive/delete/unpublish by ids **and by tag** (one statement), incl. cascade-by-source.
**Acceptance.** Archiving a whole source/tag is one bounded statement (not N round-trips); heavy analytical reads don't contend on a tiny pool.

---

## Sequencing
1. **Theme 1** first — without reproduce-at-scale + see-the-real-error, every perf fix is a blind guess (that's how #172 missed three times).
2. **Theme 2** — the biggest perf class; fixes all vector endpoints at once.
3. **Themes 3 & 4** — enumeration scale + write/pool efficiency.

## Workflow
Each theme as its own branch/PR off master, full enhanced review (Opus team + adversarial, every claim independently verified), all findings fixed in-PR (no deferrals), green CI, **and — per Theme 1 — verified against representative scale + confirmed on the live deployed build**, not just the test DB. Exhibit A (#172) is closed; tie any residual vector-endpoint work to Theme 2.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Epic: performance & scale hardening for the knowledge subsystem (reproducible · observable · index-correct at prod scale) #175

Why this epic exists

Theme 1 — Scale-representative testing + observability ⟵ do this FIRST

Theme 2 — Index-correct vector-query layer

Theme 3 — Cursor/keyset pagination for enumeration

Theme 4 — Connection-pool sizing + set-based bulk mutations

Sequencing

Workflow

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Epic: performance & scale hardening for the knowledge subsystem (reproducible · observable · index-correct at prod scale) #175

Description

Why this epic exists

Theme 1 — Scale-representative testing + observability ⟵ do this FIRST

Theme 2 — Index-correct vector-query layer

Theme 3 — Cursor/keyset pagination for enumeration

Theme 4 — Connection-pool sizing + set-based bulk mutations

Sequencing

Workflow

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions