diff --git a/skills/cldk-sdk-frontend/SKILL.md b/skills/cldk-sdk-frontend/SKILL.md index 1003b79..1bc285e 100644 --- a/skills/cldk-sdk-frontend/SKILL.md +++ b/skills/cldk-sdk-frontend/SKILL.md @@ -41,6 +41,35 @@ of the facade's *query surface* (the SDK-side mirror of the backend's schema des that one approved surface into each target SDK. One facade vocabulary feeds every SDK encoding, so the SDKs stay in lockstep. +## Client analyses (slicing, taint) are the SDK's job, not the analyzer's + +The `codeanalyzer-` backend is a **pure graph provider**: at `-a 3` it emits the dependence +graph substrate — `program_graphs` (CFG/PDG/SDG) with transitive `SUMMARY` edges — and nothing +more. It deliberately does **not** emit a `taint_flows` section or run a slice +(`codeanalyzer-backend`'s `dataflow-graphs.md § provider/client boundary`). **Slicing, taint, and +reachability queries live here, in the SDK**, as part of the facade's query surface: + +- **Backward/forward slice** and **taint** are reachability walks over the emitted graph — + `CDG ∪ DDG ∪ PARAM_* ∪ SUMMARY` — computed in-SDK. The `SUMMARY` edges the analyzer ships are + what make these **context-sensitive** (the two-phase HRB up-then-down traversal) without the SDK + re-descending into callees. +- **Sources/sinks/sanitizers/library models are data, not code** — a JSON spec validated against a + JSON Schema, precedence *built-in pack < config file < caller-supplied* — and they live with the + SDK because they're a *policy* that changes far faster than the graph. This is why they aren't in + the analyzer: a policy edit re-runs a cheap in-SDK traversal instead of forcing a graph re-emit. +- The **`TaintFlow` / slice-result models** (`{ source, sink, rule, sanitized, path }`, paths as + `(signature, node_id)` lists with model ids) are SDK models — the shared graph models + (`ProgramGraphs` / `GraphNode` / `GraphEdge` / `SDGEdge`) come from the backend contract; the + client-result models are added here. +- Surface these as facade methods in the query-surface design loop (e.g. `get_backward_slice(...)`, + `get_taint_flows(spec=...)`), and gate them with the **Slice** and **Taint** frontend gates from + `sdk-testing.md` (exact expected node set for a slice; one source→sink flow found and the same + flow reported `sanitized` with a sanitizer interposed) over the analyzer's fixture graph. + +Over-approximations inherited from the graph (e.g. ENTRY-anchored `PARAM_IN` collapsing argument +arity, missing `SUMMARY` edges before that analyzer PR lands, heap flows only under the analyzer's +heap-dependence mode) must be **surfaced in the SDK's results**, not silently absorbed. + ## Precondition & inputs (what the backend skill hands you) Do not start until a **working, schema-conformant `codeanalyzer-`** exists. You need, from the diff --git a/skills/cldk-sdk-frontend/references/sdk-testing.md b/skills/cldk-sdk-frontend/references/sdk-testing.md index 10ea4f2..93f727a 100644 --- a/skills/cldk-sdk-frontend/references/sdk-testing.md +++ b/skills/cldk-sdk-frontend/references/sdk-testing.md @@ -120,6 +120,26 @@ def _analysis(tmp_path, level=AnalysisLevel.symbol_table): --- +## 3b. Client-analysis gates (slicing & taint — SDK-side, only when the language has level 3) + +Slicing and taint run in the SDK over the analyzer's `program_graphs`, not in the analyzer (see +`SKILL.md § Client analyses`). When the wired language exposes level-3 graphs, the query surface +gets these gates, over the analyzer's own dataflow fixture: + +- **Slice gate:** a backward slice of a named `(signature, node_id)` criterion equals the + hand-computed expected node set — **exact**, not "non-empty". This catches both missing control + dependences and missing def-use edges in the consumed graph, and a broken traversal in the SDK. +- **Taint gate:** with a small sources/sinks/sanitizers spec, one known source→sink flow is found; + the **same** flow with a sanitizer interposed is reported `sanitized` (not dropped). Assert the + witness `path` is a contiguous `(signature, node_id)` chain and carries the matching model id. +- **Context-sensitivity check (if `SUMMARY` edges are present):** a flow that enters a callee from + call site A does **not** exit at an unrelated call site B (no unrealizable path). If the analyzer + hasn't shipped `SUMMARY` edges yet, record this as a known over-approximation in the result + rather than asserting it away. + +These are the frontend counterparts of the backend's CFG/PDG/SDG gates — the backend proves the +graph is correct; these prove the SDK's queries over it are correct. + ## 4. Definition of done (SDK surface) - [ ] Mocked SDK tests pass under `pytest` (backend patched). diff --git a/skills/codeanalyzer-backend/SKILL.md b/skills/codeanalyzer-backend/SKILL.md index 8bb05b4..51dae82 100644 --- a/skills/codeanalyzer-backend/SKILL.md +++ b/skills/codeanalyzer-backend/SKILL.md @@ -316,11 +316,8 @@ schema` for the static `schema.neo4j.json` contract. Build it as a modular `neo4 **optional/lazy** dependency, and hold the graph schema in lockstep with the JSON schema (same `SCHEMA_DECISIONS.md` node kinds → node labels; identity-only call edges → `CALLS`). The SDK's Neo4j backend (frontend skill) reconstructs the canonical model from this graph, so the node -families and `--app-name` anchor must match. **The graph is always full-depth:** analysis levels -gate the JSON path only — `--emit neo4j` runs at maximum implemented depth (once level 3 exists, -the complete SDG/CPG, unconditionally), and combining `-a`/`--graphs` with it is an explicit -error (`neo4j-projection.md § Depth rule`). Leave the projection out only if the user explicitly -scopes to JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage. +families and `--app-name` anchor must match. Leave it out only if the user explicitly scopes to +JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage. ### (Optional) Level 2: framework-based analysis Gated on the depth choice from *Orient & choose the backend tooling*. The heavy tier — a dedicated analysis engine @@ -340,8 +337,14 @@ in the README's *Architecture & Tooling*), and build stage by stage per `references/dataflow-construction.md` against the contract in `references/dataflow-graphs.md`. The rules that bind: everything is **native and in-process**; graphs are keyed by `(signature, node_id)` on the same `signatureOf()`; each stage's gate passes before the next -stage starts; `-a 1`/`-a 2` stay untouched; the **SDG is the core artifact** (clients query it), -and the CPG is only its Neo4j projection — skip the CPG if the Neo4j surface isn't in scope. +stage starts; `-a 1`/`-a 2` stay untouched; the **SDG (with its transitive `SUMMARY` edges) is +the core artifact**, and the CPG is only its Neo4j projection — skip the CPG if the Neo4j surface +isn't in scope. **Provider/client boundary:** this skill builds the *graph* only. Slicing and +taint are reachability *queries* over that graph and belong to the **frontend SDK** +(`cldk-sdk-frontend`) — do not build a slicer/taint engine here and do not emit a `taint_flows` +section. `SUMMARY` edges are the exception that proves the rule: they're policy-agnostic data- +dependence substrate (not tied to any sources/sinks config), so they stay in the analyzer and are +exactly what make the frontend's queries context-sensitive. ### Write the analyzer README (last build step) The analyzer's `codeanalyzer-/README.md` already holds the **Architecture & Tooling** diff --git a/skills/codeanalyzer-backend/references/dataflow-construction.md b/skills/codeanalyzer-backend/references/dataflow-construction.md index 6f2ae19..ae52dc4 100644 --- a/skills/codeanalyzer-backend/references/dataflow-construction.md +++ b/skills/codeanalyzer-backend/references/dataflow-construction.md @@ -152,18 +152,23 @@ Stitch the PDGs with the interprocedural edges (Horwitz–Reps–Binkley): `(signature, node_id)` endpoints, arity match, at least one SUMMARY edge for a known transitive flow, and the whole `program_graphs` section validates against the SDK models. -## Stage 8 — Clients and the CPG +## Stage 8 — The CPG (the analyzer's last stage) -- **Slicing and taint** as SDG queries (`dataflow-graphs.md § Client analyses`) — the two-phase - HRB traversal for context-sensitive slices; labeled reachability with sanitizer blocking for - taint; witness paths reconstructed lazily over reverse edges. - **CPG:** project the new node/edge families through the existing `neo4j/` subpackage — new labels in the schema catalog, same `RowBuilder`/writer machinery, additive `schema.neo4j.json` version bump. The deferred-edge gate already enforces no-dangling. -**Client gate:** the slice and taint assertions from `dataflow-graphs.md § Verification gates`, -plus: the Cypher snapshot with graphs enabled loads clean into an empty Neo4j and a -`MATCH (:CFGNode)` count equals the JSON node count. +**Slicing and taint are NOT an analyzer stage.** They are reachability queries over the emitted +SDG and live in the **frontend SDK** (`cldk-sdk-frontend`), per the provider/client boundary in +`dataflow-graphs.md`. The analyzer's dataflow work ends when the SDG (with its `SUMMARY` edges) +and the CPG are emitted; do not build a slicer or a taint engine here, and do not emit a +`taint_flows` section. The two-phase HRB slice traversal, labeled taint reachability with +sanitizer blocking, lazy witness reconstruction, and the sources/sinks/sanitizers model packs are +all the SDK's job. + +**CPG gate:** the Cypher snapshot with graphs enabled loads clean into an empty Neo4j and a +`MATCH (:CFGNode)` count equals the JSON node count. (The slice/taint gates are frontend gates — +`cldk-sdk-frontend`.) --- diff --git a/skills/codeanalyzer-backend/references/dataflow-graphs.md b/skills/codeanalyzer-backend/references/dataflow-graphs.md index fad5103..02f6f52 100644 --- a/skills/codeanalyzer-backend/references/dataflow-graphs.md +++ b/skills/codeanalyzer-backend/references/dataflow-graphs.md @@ -19,15 +19,22 @@ runs in-process in the analyzer's own language (Jelly, WALA, `go/ssa`) counts as | --- | --- | --- | --- | | 1 | Symbol table + resolver call graph | Cheap | `-a 1` / `-a 2` (default 1) | | 2 | Framework-based call-graph enrichment (Joern/WALA) | Heavy, external | own toggle, off by default | -| **3** | **Native CFG/DFG/PDG/SDG + client queries (slicing, taint)** | Heavy, in-process | `-a 3` (+ `--graphs` selector) | +| **3** | **Native CFG/DFG/PDG/SDG — the graph substrate** | Heavy, in-process | `-a 3` (+ `--graphs` selector) | `-a 3` implies `-a 2`'s resolver call graph (the SDG needs it). The framework toggle stays orthogonal — its edges merge into the call graph with provenance, exactly as at level 2. The cheap path stays cheap: **nothing at level 3 may run unless requested.** -The levels gate the **JSON path only**. When the output target is the graph (`--emit neo4j`), -levels don't apply: the analyzer runs at maximum implemented depth and projects the **full SDG** -unconditionally (`neo4j-projection.md § Depth rule`). +**Provider/client boundary (non-negotiable).** The analyzer is a **pure graph provider**: level 3 +emits the universal dependence graph — CFG, PDG, SDG, and the transitive `SUMMARY` edges — and +*stops there*. **Client analyses (taint, slicing, reachability) are NOT analyzer concerns** — they +are reachability queries run in the **frontend SDK** over the emitted graph (`cldk-sdk-frontend`). +The analyzer never emits a `taint_flows` section, never ingests a sources/sinks/sanitizers policy, +and never runs a slice. Rationale: a taint result is keyed on a *policy* (which APIs are sources/ +sinks) that evolves at SDK speed; baking it into the graph would couple the universal artifact to +one policy and force a re-emit on every model-pack edit. What *does* stay analyzer-side is +**policy-agnostic substrate** — `SUMMARY` edges are keyed on data dependence, not on any taint +config, so they belong in the graph and are what make the frontend's queries context-sensitive. ## The graph ladder (definitions and edge vocabulary) @@ -47,7 +54,7 @@ whole-program. 4. **SDG (system dependence graph)** — the whole-program graph: all PDGs stitched together at call sites via `CALL`, `PARAM_IN`, `PARAM_OUT`, and transitive `SUMMARY` edges (Horwitz–Reps–Binkley). Global/module state is modeled as extra parameters. This is the graph - client analyses (slicing, taint) query. + the **frontend SDK's** client analyses (slicing, taint) query — the analyzer only produces it. ## Node identity (the invariant that makes everything joinable) @@ -91,8 +98,9 @@ facade invariant that `analysis.json` is the single facade-visible output: "target": { "signature": "...", "node": 0 }, "type": "PARAM_IN", "var": "arg0" } ] - }, - "taint_flows": [ ... ] // optional client-analysis output, see below + } + // NO `taint_flows` — client analyses (taint, slicing) are a FRONTEND SDK concern, not an + // analyzer output. The analyzer emits only the graph above. See § Client analyses below. } ``` @@ -101,26 +109,34 @@ facade invariant that `analysis.json` is the single facade-visible output: without `pdg` emits a PDG with only `DDG` edges. - Unrecognized `--graphs` values follow the **flag-validation rule** (`cli-contract.md`): explicit non-zero error, never silent fallback. -- The CPG is **Neo4j-only**, and the graph surface is **level-agnostic**: `--emit neo4j` always - runs at maximum implemented depth and projects the full SDG — `-a` and `--graphs` gate only - the JSON path, and combining them with `--emit neo4j` is an explicit error - (`neo4j-projection.md § Depth rule`). `--emit schema` includes the CFG/PDG/SDG labels in - `schema.neo4j.json`. +- The CPG is **Neo4j-only**: `--emit neo4j` at `-a 3` adds the CFG/PDG/SDG labels and edge types + to the projection (see § CPG below). `--emit schema` includes them in `schema.neo4j.json`. - `program_graphs.schema_version` is versioned independently of the top-level schema and bumps additively, like `schema.neo4j.json`. -## Client analyses are queries, not engines +## Client analyses live in the frontend, not the analyzer -Slicing and taint are **reachability queries over the SDG**, not separate analyses: +Slicing and taint are **reachability queries over the SDG**, not separate analyses — and they run +in the **frontend SDK** (`cldk-sdk-frontend`), never in the analyzer. The analyzer's job ends at +emitting the graph substrate; the SDK loads it and answers queries against it: - **Backward slice** of `(signature, node)`: reverse reachability over `CDG ∪ DDG ∪ PARAM_* ∪ SUMMARY` (context-sensitive via the two-phase HRB traversal — up then down). - **Taint**: seed at *sources*, propagate labeled reachability along dependence edges, block at *sanitizers* on the path, report when a source label reaches a matching *sink*. Sources, sinks, sanitizers, and library models are **data, not code** — a JSON spec validated against a JSON - Schema, with precedence *built-in pack < config file < inline flags*. Output is the - `taint_flows` section: `{ source, sink, rule, sanitized, path }`, each path a list of - `(signature, node_id)` pairs, with the matching model id for explainability. + Schema, with precedence *built-in pack < config file < inline flags*. The result is a + `taint_flows` structure — `{ source, sink, rule, sanitized, path }`, each path a list of + `(signature, node_id)` pairs with the matching model id for explainability — **produced and + owned by the SDK**, not written into the analyzer's `analysis.json`. + +**Why this split.** The `SUMMARY` edges the analyzer emits are keyed on data dependence and are +reusable across *every* taint policy, so they are graph substrate and belong in the analyzer. +Sources/sinks/sanitizers are a *policy* that changes far faster than the graph; keeping the query +(and its `taint_flows` output) in the SDK means a policy edit re-runs a cheap traversal instead of +re-emitting the whole universal graph. This is exactly Joern's factoring: the CPG stores the +dependence substrate; `reachableBy` is evaluated at query time — Joern does **not** materialize +all-pairs taint edges, and neither should a `codeanalyzer-`. ## Cross-language parity clause @@ -152,6 +168,12 @@ Each rung has a gate; do not build the next rung until the current one passes: | Dominance | Post-dominator tree well-formed (unique root = EXIT; infinite loops handled via synthetic edge) | | PDG | CDG edges match hand-computed control dependence on the fixture; every DDG edge connects a real def to a real use of the same access path | | SDG | No dangling `(signature, node_id)` endpoints; PARAM_IN/OUT arity matches the callable's parameters; SUMMARY edges exist for at least one transitive flow in the fixture | + +The **Slice** and **Taint** gates are **frontend gates** — they exercise the SDK's queries over +the emitted graph, not the analyzer, and live in `cldk-sdk-frontend`'s testing reference: + +| Frontend gate | Core assertion | +| --- | --- | | Slice | Backward slice of a named fixture variable equals the hand-computed expected node set — **exact**, not "non-empty" | | Taint | One known source→sink flow found; the same flow with a sanitizer on the path is reported `sanitized` | diff --git a/skills/codeanalyzer-backend/references/dataflow-issue-template.md b/skills/codeanalyzer-backend/references/dataflow-issue-template.md index 511b3fa..8d33c62 100644 --- a/skills/codeanalyzer-backend/references/dataflow-issue-template.md +++ b/skills/codeanalyzer-backend/references/dataflow-issue-template.md @@ -15,16 +15,23 @@ staged PRs reference it. --- ```markdown -Title: Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) and taint analysis for +Title: Level-3: native dataflow graphs (CFG/DFG/PDG/SDG/CPG) for PROBLEM codeanalyzer- today emits the level-1 symbol table and resolver call graph<, plus level-2 framework enrichment via if applicable>. It has -no dataflow: no CFG, no dependence graphs, no way to answer "what does this value -affect" or "does user input reach this sink". This issue adds level 3 — native, -whole-program dependence graphs built from 's own AST, per the skillset's -dataflow-graphs.md contract — and exposes slicing and taint as queries over them. +no dataflow: no CFG, no dependence graphs, nothing for a client to answer "what +does this value affect" or "does user input reach this sink" against. This issue +adds level 3 — native, whole-program dependence graphs built from 's own +AST, per the skillset's dataflow-graphs.md contract — as the GRAPH SUBSTRATE +those questions are answered over. + +Scope boundary: this analyzer is a pure graph provider. It emits the graph +(CFG/PDG/SDG with SUMMARY edges, and the CPG projection) and stops. Slicing and +taint are reachability QUERIES over the graph and live in the frontend SDK +(cldk-sdk-frontend) — they are explicitly out of scope here, and this analyzer +emits no `taint_flows` section. Native is the constraint: everything runs in-process in the analyzer's own ecosystem. No external analysis engines, no subprocess to a foreign toolchain. @@ -35,13 +42,11 @@ GOALS (the contract, in one list) (`program_graphs`, schema_version'd, keyed by canonical (signature, node_id)), gated by `-a 3` / `--graphs`. 2. Project the CPG (AST+CFG+PDG overlay) through the existing Neo4j emitter as - new node labels / edge types; additive schema.neo4j.json bump. The graph - surface is level-agnostic: --emit neo4j always projects the full SDG; - -a/--graphs gate the JSON path only, and combining them with --emit neo4j - is an explicit error. -3. Expose backward slicing and taint as SDG queries; sources/sinks/sanitizers/ - library models supplied as data (JSON spec + JSON Schema validation), emitted - as a `taint_flows` section. + new node labels / edge types; additive schema.neo4j.json bump. +3. Emit transitive `SUMMARY` edges (HRB) into the SDG — the context-sensitivity + substrate. Backward slicing and taint are NOT built here: they are frontend + SDK queries over this graph (cldk-sdk-frontend). This analyzer emits no + `taint_flows` section and ingests no sources/sinks/sanitizers policy. 4. Hold the cross-language parity clause: shared node kinds / edge types / JSON shapes; -specific additions are additive and recorded in SCHEMA_DECISIONS.md. @@ -102,18 +107,18 @@ PART 2 — INTERPROCEDURAL (stages 5–7) PART 3 — EMISSION AND CLIENTS (stage 8) - 10. `program_graphs` section in analysis.json per the contract; `--graphs` - selector with strict flag validation; co-evolve the shared SDK Pydantic - models (ProgramGraphs / GraphNode / GraphEdge / SDGEdge / TaintFlow) in the - same change. + 10. `program_graphs` section in analysis.json per the contract, including + `SUMMARY` edges; `--graphs` selector with strict flag validation; co-evolve + the shared SDK Pydantic models (ProgramGraphs / GraphNode / GraphEdge / + SDGEdge) in the same change. (TaintFlow models live with the client, in + cldk-sdk-frontend, not here.) 11. CPG projection: CFGNode label + CFG_NEXT/CDG/DDG/PARAM_IN/PARAM_OUT/SUMMARY/ HAS_CFG_NODE in the neo4j/ subpackage; schema.neo4j.json bump; conformance test extended. - 12. Backward slicing (two-phase context-sensitive traversal) and taint - (labeled reachability, sanitizer blocking, lazy witness reconstruction) - as SDG queries; sources/sinks/sanitizers configurable as data (built-in - pack < config file < inline flags); `taint_flows` output with model ids - for explainability. + + Backward slicing and taint are OUT OF SCOPE for this issue — they are frontend + SDK queries over the graph emitted above (cldk-sdk-frontend), not analyzer + features. File them against the SDK, not here. CAVEATS AND KNOWN RISKS @@ -150,25 +155,30 @@ STAGED PRs cfg/pdg, the slice gate green on the fixture; then per-callable parallel fan-out (-j), differential-tested against --jobs 1. PR D Summaries: hammock regions, SCC fixpoint with k-limiting; SDG assembly; - sdg_edges emission; MVP taint over the call graph; then the ready-queue - wavefront over the SCC DAG, differential-tested against --jobs 1. - PR E Models-as-data: JSON spec + Schema, default pack, precedence; taint_flows - output + lazy witness paths; SDK models co-evolved. - PR F Points-to-backed (alias-aware) propagation via ; replace the + sdg_edges emission INCLUDING transitive SUMMARY edges; then the + ready-queue wavefront over the SCC DAG, differential-tested against + --jobs 1. (No taint here — SUMMARY edges are the substrate a frontend + taint query consumes.) + PR E Points-to-backed (alias-aware) propagation via ; replace the type-based MVP stub. - PR G (optional) CPG Neo4j projection + conformance test + schema bump — skip + PR F (optional) CPG Neo4j projection + conformance test + schema bump — skip if the Neo4j surface is not in scope; the SDG is the core artifact and no client analysis depends on the CPG. - PR H (later) Incremental re-analysis over the recorded dependency edges. + PR G (later) Incremental re-analysis over the recorded dependency edges. + + (Slicing + taint as SDG queries, sources/sinks/sanitizers model packs, and the + `taint_flows` output are a SEPARATE ladder in cldk-sdk-frontend — not PRs on + this analyzer.) VERIFICATION / DEFINITION OF DONE - - Every gate in dataflow-construction.md passes on the fixture (CFG, - dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected - sets, not "non-empty". + - Every analyzer gate in dataflow-construction.md passes on the fixture (CFG, + dominance, DFG, PDG-slice, summary, SDG) — exact expected sets, not + "non-empty". (The slice/taint gates are frontend gates — verified in + cldk-sdk-frontend, not here.) - Fixture covers the full stage-1 lowering checklist for plus the - shared fixture minimums (aliasing, SCC recursion, multi-file flow, - sanitized + unsanitized taint pair). + shared fixture minimums (aliasing, SCC recursion, multi-file flow, and a + source→sink data-flow pair so the FRONTEND can later assert taint over it). - analysis.json with -a 3 validates against the shared SDK ProgramGraphs models; parity clause holds (no renamed/repurposed shared vocabulary). - Cypher snapshot with graphs loads clean into empty Neo4j; CFGNode count