codellm-devkit · rahlk · Jul 2, 2026
diff --git a/skills/cldk-sdk-frontend/SKILL.md b/skills/cldk-sdk-frontend/SKILL.md
@@ -41,6 +41,35 @@ of the facade's *query surface* (the SDK-side mirror of the backend's schema des
 that one approved surface into each target SDK. One facade vocabulary feeds every SDK encoding, so
 the SDKs stay in lockstep.
 
+## Client analyses (slicing, taint) are the SDK's job, not the analyzer's
+
+The `codeanalyzer-<lang>` backend is a **pure graph provider**: at `-a 3` it emits the dependence
+graph substrate — `program_graphs` (CFG/PDG/SDG) with transitive `SUMMARY` edges — and nothing
+more. It deliberately does **not** emit a `taint_flows` section or run a slice
+(`codeanalyzer-backend`'s `dataflow-graphs.md § provider/client boundary`). **Slicing, taint, and
+reachability queries live here, in the SDK**, as part of the facade's query surface:
+
+- **Backward/forward slice** and **taint** are reachability walks over the emitted graph —
+  `CDG ∪ DDG ∪ PARAM_* ∪ SUMMARY` — computed in-SDK. The `SUMMARY` edges the analyzer ships are
+  what make these **context-sensitive** (the two-phase HRB up-then-down traversal) without the SDK
+  re-descending into callees.
+- **Sources/sinks/sanitizers/library models are data, not code** — a JSON spec validated against a
+  JSON Schema, precedence *built-in pack < config file < caller-supplied* — and they live with the
+  SDK because they're a *policy* that changes far faster than the graph. This is why they aren't in
+  the analyzer: a policy edit re-runs a cheap in-SDK traversal instead of forcing a graph re-emit.
+- The **`TaintFlow` / slice-result models** (`{ source, sink, rule, sanitized, path }`, paths as
+  `(signature, node_id)` lists with model ids) are SDK models — the shared graph models
+  (`ProgramGraphs` / `GraphNode` / `GraphEdge` / `SDGEdge`) come from the backend contract; the
+  client-result models are added here.
+- Surface these as facade methods in the query-surface design loop (e.g. `get_backward_slice(...)`,
+  `get_taint_flows(spec=...)`), and gate them with the **Slice** and **Taint** frontend gates from
+  `sdk-testing.md` (exact expected node set for a slice; one source→sink flow found and the same
+  flow reported `sanitized` with a sanitizer interposed) over the analyzer's fixture graph.
+
+Over-approximations inherited from the graph (e.g. ENTRY-anchored `PARAM_IN` collapsing argument
+arity, missing `SUMMARY` edges before that analyzer PR lands, heap flows only under the analyzer's
+heap-dependence mode) must be **surfaced in the SDK's results**, not silently absorbed.
+
 ## Precondition & inputs (what the backend skill hands you)
 
 Do not start until a **working, schema-conformant `codeanalyzer-<lang>`** exists. You need, from the

diff --git a/skills/cldk-sdk-frontend/references/sdk-testing.md b/skills/cldk-sdk-frontend/references/sdk-testing.md
@@ -120,6 +120,26 @@ def _analysis(tmp_path, level=AnalysisLevel.symbol_table):
 
 ---
 
+## 3b. Client-analysis gates (slicing & taint — SDK-side, only when the language has level 3)
+
+Slicing and taint run in the SDK over the analyzer's `program_graphs`, not in the analyzer (see
+`SKILL.md § Client analyses`). When the wired language exposes level-3 graphs, the query surface
+gets these gates, over the analyzer's own dataflow fixture:
+
+- **Slice gate:** a backward slice of a named `(signature, node_id)` criterion equals the
+  hand-computed expected node set — **exact**, not "non-empty". This catches both missing control
+  dependences and missing def-use edges in the consumed graph, and a broken traversal in the SDK.
+- **Taint gate:** with a small sources/sinks/sanitizers spec, one known source→sink flow is found;
+  the **same** flow with a sanitizer interposed is reported `sanitized` (not dropped). Assert the
+  witness `path` is a contiguous `(signature, node_id)` chain and carries the matching model id.
+- **Context-sensitivity check (if `SUMMARY` edges are present):** a flow that enters a callee from
+  call site A does **not** exit at an unrelated call site B (no unrealizable path). If the analyzer
+  hasn't shipped `SUMMARY` edges yet, record this as a known over-approximation in the result
+  rather than asserting it away.
+
+These are the frontend counterparts of the backend's CFG/PDG/SDG gates — the backend proves the
+graph is correct; these prove the SDK's queries over it are correct.
+
 ## 4. Definition of done (SDK surface)
 
 - [ ] Mocked SDK tests pass under `pytest` (backend patched).

diff --git a/skills/codeanalyzer-backend/SKILL.md b/skills/codeanalyzer-backend/SKILL.md
@@ -316,11 +316,8 @@ schema` for the static `schema.neo4j.json` contract. Build it as a modular `neo4
 **optional/lazy** dependency, and hold the graph schema in lockstep with the JSON schema (same
 `SCHEMA_DECISIONS.md` node kinds → node labels; identity-only call edges → `CALLS`). The SDK's
 Neo4j backend (frontend skill) reconstructs the canonical model from this graph, so the node
-families and `--app-name` anchor must match. **The graph is always full-depth:** analysis levels
-gate the JSON path only — `--emit neo4j` runs at maximum implemented depth (once level 3 exists,
-the complete SDG/CPG, unconditionally), and combining `-a`/`--graphs` with it is an explicit
-error (`neo4j-projection.md § Depth rule`). Leave the projection out only if the user explicitly
-scopes to JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage.
+families and `--app-name` anchor must match. Leave it out only if the user explicitly scopes to
+JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage.
 
 ### (Optional) Level 2: framework-based analysis
 Gated on the depth choice from *Orient & choose the backend tooling*. The heavy tier — a dedicated analysis engine
@@ -340,8 +337,14 @@ in the README's *Architecture & Tooling*), and build stage by stage per
 `references/dataflow-construction.md` against the contract in `references/dataflow-graphs.md`.
 The rules that bind: everything is **native and in-process**; graphs are keyed by
 `(signature, node_id)` on the same `signatureOf()`; each stage's gate passes before the next
-stage starts; `-a 1`/`-a 2` stay untouched; the **SDG is the core artifact** (clients query it),
-and the CPG is only its Neo4j projection — skip the CPG if the Neo4j surface isn't in scope.
+stage starts; `-a 1`/`-a 2` stay untouched; the **SDG (with its transitive `SUMMARY` edges) is
+the core artifact**, and the CPG is only its Neo4j projection — skip the CPG if the Neo4j surface
+isn't in scope. **Provider/client boundary:** this skill builds the *graph* only. Slicing and
+taint are reachability *queries* over that graph and belong to the **frontend SDK**
+(`cldk-sdk-frontend`) — do not build a slicer/taint engine here and do not emit a `taint_flows`
+section. `SUMMARY` edges are the exception that proves the rule: they're policy-agnostic data-
+dependence substrate (not tied to any sources/sinks config), so they stay in the analyzer and are
+exactly what make the frontend's queries context-sensitive.
 
 ### Write the analyzer README (last build step)
 The analyzer's `codeanalyzer-<lang>/README.md` already holds the **Architecture & Tooling**

diff --git a/skills/codeanalyzer-backend/references/dataflow-construction.md b/skills/codeanalyzer-backend/references/dataflow-construction.md
@@ -152,18 +152,23 @@ Stitch the PDGs with the interprocedural edges (Horwitz–Reps–Binkley):
 `(signature, node_id)` endpoints, arity match, at least one SUMMARY edge for a known transitive
 flow, and the whole `program_graphs` section validates against the SDK models.
 
-## Stage 8 — Clients and the CPG
+## Stage 8 — The CPG (the analyzer's last stage)
 
-- **Slicing and taint** as SDG queries (`dataflow-graphs.md § Client analyses`) — the two-phase
-  HRB traversal for context-sensitive slices; labeled reachability with sanitizer blocking for
-  taint; witness paths reconstructed lazily over reverse edges.
 - **CPG:** project the new node/edge families through the existing `neo4j/` subpackage —
   new labels in the schema catalog, same `RowBuilder`/writer machinery, additive
   `schema.neo4j.json` version bump. The deferred-edge gate already enforces no-dangling.
 
-**Client gate:** the slice and taint assertions from `dataflow-graphs.md § Verification gates`,
-plus: the Cypher snapshot with graphs enabled loads clean into an empty Neo4j and a
-`MATCH (:CFGNode)` count equals the JSON node count.
+**Slicing and taint are NOT an analyzer stage.** They are reachability queries over the emitted
+SDG and live in the **frontend SDK** (`cldk-sdk-frontend`), per the provider/client boundary in
+`dataflow-graphs.md`. The analyzer's dataflow work ends when the SDG (with its `SUMMARY` edges)
+and the CPG are emitted; do not build a slicer or a taint engine here, and do not emit a
+`taint_flows` section. The two-phase HRB slice traversal, labeled taint reachability with
+sanitizer blocking, lazy witness reconstruction, and the sources/sinks/sanitizers model packs are
+all the SDK's job.
+
+**CPG gate:** the Cypher snapshot with graphs enabled loads clean into an empty Neo4j and a
+`MATCH (:CFGNode)` count equals the JSON node count. (The slice/taint gates are frontend gates —
+`cldk-sdk-frontend`.)
 
 ---
 

diff --git a/skills/codeanalyzer-backend/references/dataflow-graphs.md b/skills/codeanalyzer-backend/references/dataflow-graphs.md
@@ -19,15 +19,22 @@ runs in-process in the analyzer's own language (Jelly, WALA, `go/ssa`) counts as
 | --- | --- | --- | --- |
 | 1 | Symbol table + resolver call graph | Cheap | `-a 1` / `-a 2` (default 1) |
 | 2 | Framework-based call-graph enrichment (Joern/WALA) | Heavy, external | own toggle, off by default |
-| **3** | **Native CFG/DFG/PDG/SDG + client queries (slicing, taint)** | Heavy, in-process | `-a 3` (+ `--graphs` selector) |
+| **3** | **Native CFG/DFG/PDG/SDG — the graph substrate** | Heavy, in-process | `-a 3` (+ `--graphs` selector) |
 
 `-a 3` implies `-a 2`'s resolver call graph (the SDG needs it). The framework toggle stays
 orthogonal — its edges merge into the call graph with provenance, exactly as at level 2. The
 cheap path stays cheap: **nothing at level 3 may run unless requested.**
 
-The levels gate the **JSON path only**. When the output target is the graph (`--emit neo4j`),
-levels don't apply: the analyzer runs at maximum implemented depth and projects the **full SDG**
-unconditionally (`neo4j-projection.md § Depth rule`).
+**Provider/client boundary (non-negotiable).** The analyzer is a **pure graph provider**: level 3
+emits the universal dependence graph — CFG, PDG, SDG, and the transitive `SUMMARY` edges — and
+*stops there*. **Client analyses (taint, slicing, reachability) are NOT analyzer concerns** — they
+are reachability queries run in the **frontend SDK** over the emitted graph (`cldk-sdk-frontend`).
+The analyzer never emits a `taint_flows` section, never ingests a sources/sinks/sanitizers policy,
+and never runs a slice. Rationale: a taint result is keyed on a *policy* (which APIs are sources/
+sinks) that evolves at SDK speed; baking it into the graph would couple the universal artifact to
+one policy and force a re-emit on every model-pack edit. What *does* stay analyzer-side is
+**policy-agnostic substrate** — `SUMMARY` edges are keyed on data dependence, not on any taint
+config, so they belong in the graph and are what make the frontend's queries context-sensitive.
 
 ## The graph ladder (definitions and edge vocabulary)
 
@@ -47,7 +54,7 @@ whole-program.
 4. **SDG (system dependence graph)** — the whole-program graph: all PDGs stitched together at
    call sites via `CALL`, `PARAM_IN`, `PARAM_OUT`, and transitive `SUMMARY` edges
    (Horwitz–Reps–Binkley). Global/module state is modeled as extra parameters. This is the graph
-   client analyses (slicing, taint) query.
+   the **frontend SDK's** client analyses (slicing, taint) query — the analyzer only produces it.
 
 ## Node identity (the invariant that makes everything joinable)
 
@@ -91,8 +98,9 @@ facade invariant that `analysis.json` is the single facade-visible output:
         "target": { "signature": "...", "node": 0 },
         "type": "PARAM_IN", "var": "arg0" }
     ]
-  },
-  "taint_flows": [ ... ]                 // optional client-analysis output, see below
+  }
+  // NO `taint_flows` — client analyses (taint, slicing) are a FRONTEND SDK concern, not an
+  // analyzer output. The analyzer emits only the graph above. See § Client analyses below.
 }
 ```
 
@@ -101,26 +109,34 @@ facade invariant that `analysis.json` is the single facade-visible output:
   without `pdg` emits a PDG with only `DDG` edges.
 - Unrecognized `--graphs` values follow the **flag-validation rule** (`cli-contract.md`): explicit
   non-zero error, never silent fallback.
-- The CPG is **Neo4j-only**, and the graph surface is **level-agnostic**: `--emit neo4j` always
-  runs at maximum implemented depth and projects the full SDG — `-a` and `--graphs` gate only
-  the JSON path, and combining them with `--emit neo4j` is an explicit error
-  (`neo4j-projection.md § Depth rule`). `--emit schema` includes the CFG/PDG/SDG labels in
-  `schema.neo4j.json`.
+- The CPG is **Neo4j-only**: `--emit neo4j` at `-a 3` adds the CFG/PDG/SDG labels and edge types
+  to the projection (see § CPG below). `--emit schema` includes them in `schema.neo4j.json`.
 - `program_graphs.schema_version` is versioned independently of the top-level schema and bumps
   additively, like `schema.neo4j.json`.
 
-## Client analyses are queries, not engines
+## Client analyses live in the frontend, not the analyzer
 
-Slicing and taint are **reachability queries over the SDG**, not separate analyses:
+Slicing and taint are **reachability queries over the SDG**, not separate analyses — and they run
+in the **frontend SDK** (`cldk-sdk-frontend`), never in the analyzer. The analyzer's job ends at
+emitting the graph substrate; the SDK loads it and answers queries against it:
 
 - **Backward slice** of `(signature, node)`: reverse reachability over `CDG ∪ DDG ∪ PARAM_* ∪
   SUMMARY` (context-sensitive via the two-phase HRB traversal — up then down).
 - **Taint**: seed at *sources*, propagate labeled reachability along dependence edges, block at
   *sanitizers* on the path, report when a source label reaches a matching *sink*. Sources, sinks,
   sanitizers, and library models are **data, not code** — a JSON spec validated against a JSON
-  Schema, with precedence *built-in pack < config file < inline flags*. Output is the
-  `taint_flows` section: `{ source, sink, rule, sanitized, path }`, each path a list of
-  `(signature, node_id)` pairs, with the matching model id for explainability.
+  Schema, with precedence *built-in pack < config file < inline flags*. The result is a
+  `taint_flows` structure — `{ source, sink, rule, sanitized, path }`, each path a list of
+  `(signature, node_id)` pairs with the matching model id for explainability — **produced and
+  owned by the SDK**, not written into the analyzer's `analysis.json`.
+
+**Why this split.** The `SUMMARY` edges the analyzer emits are keyed on data dependence and are
+reusable across *every* taint policy, so they are graph substrate and belong in the analyzer.
+Sources/sinks/sanitizers are a *policy* that changes far faster than the graph; keeping the query
+(and its `taint_flows` output) in the SDK means a policy edit re-runs a cheap traversal instead of
+re-emitting the whole universal graph. This is exactly Joern's factoring: the CPG stores the
+dependence substrate; `reachableBy` is evaluated at query time — Joern does **not** materialize
+all-pairs taint edges, and neither should a `codeanalyzer-<lang>`.
 
 ## Cross-language parity clause
 
@@ -152,6 +168,12 @@ Each rung has a gate; do not build the next rung until the current one passes:
 | Dominance | Post-dominator tree well-formed (unique root = EXIT; infinite loops handled via synthetic edge) |
 | PDG | CDG edges match hand-computed control dependence on the fixture; every DDG edge connects a real def to a real use of the same access path |
 | SDG | No dangling `(signature, node_id)` endpoints; PARAM_IN/OUT arity matches the callable's parameters; SUMMARY edges exist for at least one transitive flow in the fixture |
+
+The **Slice** and **Taint** gates are **frontend gates** — they exercise the SDK's queries over
+the emitted graph, not the analyzer, and live in `cldk-sdk-frontend`'s testing reference:
+
+| Frontend gate | Core assertion |
+| --- | --- |
 | Slice | Backward slice of a named fixture variable equals the hand-computed expected node set — **exact**, not "non-empty" |
 | Taint | One known source→sink flow found; the same flow with a sanitizer on the path is reported `sanitized` |