Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 29 additions & 0 deletions skills/cldk-sdk-frontend/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,35 @@ of the facade's *query surface* (the SDK-side mirror of the backend's schema des
that one approved surface into each target SDK. One facade vocabulary feeds every SDK encoding, so
the SDKs stay in lockstep.

## Client analyses (slicing, taint) are the SDK's job, not the analyzer's

The `codeanalyzer-<lang>` backend is a **pure graph provider**: at `-a 3` it emits the dependence
graph substrate — `program_graphs` (CFG/PDG/SDG) with transitive `SUMMARY` edges — and nothing
more. It deliberately does **not** emit a `taint_flows` section or run a slice
(`codeanalyzer-backend`'s `dataflow-graphs.md § provider/client boundary`). **Slicing, taint, and
reachability queries live here, in the SDK**, as part of the facade's query surface:

- **Backward/forward slice** and **taint** are reachability walks over the emitted graph —
`CDG ∪ DDG ∪ PARAM_* ∪ SUMMARY` — computed in-SDK. The `SUMMARY` edges the analyzer ships are
what make these **context-sensitive** (the two-phase HRB up-then-down traversal) without the SDK
re-descending into callees.
- **Sources/sinks/sanitizers/library models are data, not code** — a JSON spec validated against a
JSON Schema, precedence *built-in pack < config file < caller-supplied* — and they live with the
SDK because they're a *policy* that changes far faster than the graph. This is why they aren't in
the analyzer: a policy edit re-runs a cheap in-SDK traversal instead of forcing a graph re-emit.
- The **`TaintFlow` / slice-result models** (`{ source, sink, rule, sanitized, path }`, paths as
`(signature, node_id)` lists with model ids) are SDK models — the shared graph models
(`ProgramGraphs` / `GraphNode` / `GraphEdge` / `SDGEdge`) come from the backend contract; the
client-result models are added here.
- Surface these as facade methods in the query-surface design loop (e.g. `get_backward_slice(...)`,
`get_taint_flows(spec=...)`), and gate them with the **Slice** and **Taint** frontend gates from
`sdk-testing.md` (exact expected node set for a slice; one source→sink flow found and the same
flow reported `sanitized` with a sanitizer interposed) over the analyzer's fixture graph.

Over-approximations inherited from the graph (e.g. ENTRY-anchored `PARAM_IN` collapsing argument
arity, missing `SUMMARY` edges before that analyzer PR lands, heap flows only under the analyzer's
heap-dependence mode) must be **surfaced in the SDK's results**, not silently absorbed.

## Precondition & inputs (what the backend skill hands you)

Do not start until a **working, schema-conformant `codeanalyzer-<lang>`** exists. You need, from the
Expand Down
20 changes: 20 additions & 0 deletions skills/cldk-sdk-frontend/references/sdk-testing.md
Original file line number Diff line number Diff line change
Expand Up @@ -120,6 +120,26 @@ def _analysis(tmp_path, level=AnalysisLevel.symbol_table):

---

## 3b. Client-analysis gates (slicing & taint — SDK-side, only when the language has level 3)

Slicing and taint run in the SDK over the analyzer's `program_graphs`, not in the analyzer (see
`SKILL.md § Client analyses`). When the wired language exposes level-3 graphs, the query surface
gets these gates, over the analyzer's own dataflow fixture:

- **Slice gate:** a backward slice of a named `(signature, node_id)` criterion equals the
hand-computed expected node set — **exact**, not "non-empty". This catches both missing control
dependences and missing def-use edges in the consumed graph, and a broken traversal in the SDK.
- **Taint gate:** with a small sources/sinks/sanitizers spec, one known source→sink flow is found;
the **same** flow with a sanitizer interposed is reported `sanitized` (not dropped). Assert the
witness `path` is a contiguous `(signature, node_id)` chain and carries the matching model id.
- **Context-sensitivity check (if `SUMMARY` edges are present):** a flow that enters a callee from
call site A does **not** exit at an unrelated call site B (no unrealizable path). If the analyzer
hasn't shipped `SUMMARY` edges yet, record this as a known over-approximation in the result
rather than asserting it away.

These are the frontend counterparts of the backend's CFG/PDG/SDG gates — the backend proves the
graph is correct; these prove the SDK's queries over it are correct.

## 4. Definition of done (SDK surface)

- [ ] Mocked SDK tests pass under `pytest` (backend patched).
Expand Down
17 changes: 10 additions & 7 deletions skills/codeanalyzer-backend/SKILL.md
Original file line number Diff line number Diff line change
Expand Up @@ -316,11 +316,8 @@ schema` for the static `schema.neo4j.json` contract. Build it as a modular `neo4
**optional/lazy** dependency, and hold the graph schema in lockstep with the JSON schema (same
`SCHEMA_DECISIONS.md` node kinds → node labels; identity-only call edges → `CALLS`). The SDK's
Neo4j backend (frontend skill) reconstructs the canonical model from this graph, so the node
families and `--app-name` anchor must match. **The graph is always full-depth:** analysis levels
gate the JSON path only — `--emit neo4j` runs at maximum implemented depth (once level 3 exists,
the complete SDG/CPG, unconditionally), and combining `-a`/`--graphs` with it is an explicit
error (`neo4j-projection.md § Depth rule`). Leave the projection out only if the user explicitly
scopes to JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage.
families and `--app-name` anchor must match. Leave it out only if the user explicitly scopes to
JSON-only; otherwise it's a standard deliverable of the CLI/packaging stage.

### (Optional) Level 2: framework-based analysis
Gated on the depth choice from *Orient & choose the backend tooling*. The heavy tier — a dedicated analysis engine
Expand All @@ -340,8 +337,14 @@ in the README's *Architecture & Tooling*), and build stage by stage per
`references/dataflow-construction.md` against the contract in `references/dataflow-graphs.md`.
The rules that bind: everything is **native and in-process**; graphs are keyed by
`(signature, node_id)` on the same `signatureOf()`; each stage's gate passes before the next
stage starts; `-a 1`/`-a 2` stay untouched; the **SDG is the core artifact** (clients query it),
and the CPG is only its Neo4j projection — skip the CPG if the Neo4j surface isn't in scope.
stage starts; `-a 1`/`-a 2` stay untouched; the **SDG (with its transitive `SUMMARY` edges) is
the core artifact**, and the CPG is only its Neo4j projection — skip the CPG if the Neo4j surface
isn't in scope. **Provider/client boundary:** this skill builds the *graph* only. Slicing and
taint are reachability *queries* over that graph and belong to the **frontend SDK**
(`cldk-sdk-frontend`) — do not build a slicer/taint engine here and do not emit a `taint_flows`
section. `SUMMARY` edges are the exception that proves the rule: they're policy-agnostic data-
dependence substrate (not tied to any sources/sinks config), so they stay in the analyzer and are
exactly what make the frontend's queries context-sensitive.

### Write the analyzer README (last build step)
The analyzer's `codeanalyzer-<lang>/README.md` already holds the **Architecture & Tooling**
Expand Down
19 changes: 12 additions & 7 deletions skills/codeanalyzer-backend/references/dataflow-construction.md
Original file line number Diff line number Diff line change
Expand Up @@ -152,18 +152,23 @@ Stitch the PDGs with the interprocedural edges (Horwitz–Reps–Binkley):
`(signature, node_id)` endpoints, arity match, at least one SUMMARY edge for a known transitive
flow, and the whole `program_graphs` section validates against the SDK models.

## Stage 8 — Clients and the CPG
## Stage 8 — The CPG (the analyzer's last stage)

- **Slicing and taint** as SDG queries (`dataflow-graphs.md § Client analyses`) — the two-phase
HRB traversal for context-sensitive slices; labeled reachability with sanitizer blocking for
taint; witness paths reconstructed lazily over reverse edges.
- **CPG:** project the new node/edge families through the existing `neo4j/` subpackage —
new labels in the schema catalog, same `RowBuilder`/writer machinery, additive
`schema.neo4j.json` version bump. The deferred-edge gate already enforces no-dangling.

**Client gate:** the slice and taint assertions from `dataflow-graphs.md § Verification gates`,
plus: the Cypher snapshot with graphs enabled loads clean into an empty Neo4j and a
`MATCH (:CFGNode)` count equals the JSON node count.
**Slicing and taint are NOT an analyzer stage.** They are reachability queries over the emitted
SDG and live in the **frontend SDK** (`cldk-sdk-frontend`), per the provider/client boundary in
`dataflow-graphs.md`. The analyzer's dataflow work ends when the SDG (with its `SUMMARY` edges)
and the CPG are emitted; do not build a slicer or a taint engine here, and do not emit a
`taint_flows` section. The two-phase HRB slice traversal, labeled taint reachability with
sanitizer blocking, lazy witness reconstruction, and the sources/sinks/sanitizers model packs are
all the SDK's job.

**CPG gate:** the Cypher snapshot with graphs enabled loads clean into an empty Neo4j and a
`MATCH (:CFGNode)` count equals the JSON node count. (The slice/taint gates are frontend gates —
`cldk-sdk-frontend`.)

---

Expand Down
56 changes: 39 additions & 17 deletions skills/codeanalyzer-backend/references/dataflow-graphs.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,15 +19,22 @@ runs in-process in the analyzer's own language (Jelly, WALA, `go/ssa`) counts as
| --- | --- | --- | --- |
| 1 | Symbol table + resolver call graph | Cheap | `-a 1` / `-a 2` (default 1) |
| 2 | Framework-based call-graph enrichment (Joern/WALA) | Heavy, external | own toggle, off by default |
| **3** | **Native CFG/DFG/PDG/SDG + client queries (slicing, taint)** | Heavy, in-process | `-a 3` (+ `--graphs` selector) |
| **3** | **Native CFG/DFG/PDG/SDG — the graph substrate** | Heavy, in-process | `-a 3` (+ `--graphs` selector) |

`-a 3` implies `-a 2`'s resolver call graph (the SDG needs it). The framework toggle stays
orthogonal — its edges merge into the call graph with provenance, exactly as at level 2. The
cheap path stays cheap: **nothing at level 3 may run unless requested.**

The levels gate the **JSON path only**. When the output target is the graph (`--emit neo4j`),
levels don't apply: the analyzer runs at maximum implemented depth and projects the **full SDG**
unconditionally (`neo4j-projection.md § Depth rule`).
**Provider/client boundary (non-negotiable).** The analyzer is a **pure graph provider**: level 3
emits the universal dependence graph — CFG, PDG, SDG, and the transitive `SUMMARY` edges — and
*stops there*. **Client analyses (taint, slicing, reachability) are NOT analyzer concerns** — they
are reachability queries run in the **frontend SDK** over the emitted graph (`cldk-sdk-frontend`).
The analyzer never emits a `taint_flows` section, never ingests a sources/sinks/sanitizers policy,
and never runs a slice. Rationale: a taint result is keyed on a *policy* (which APIs are sources/
sinks) that evolves at SDK speed; baking it into the graph would couple the universal artifact to
one policy and force a re-emit on every model-pack edit. What *does* stay analyzer-side is
**policy-agnostic substrate** — `SUMMARY` edges are keyed on data dependence, not on any taint
config, so they belong in the graph and are what make the frontend's queries context-sensitive.

## The graph ladder (definitions and edge vocabulary)

Expand All @@ -47,7 +54,7 @@ whole-program.
4. **SDG (system dependence graph)** — the whole-program graph: all PDGs stitched together at
call sites via `CALL`, `PARAM_IN`, `PARAM_OUT`, and transitive `SUMMARY` edges
(Horwitz–Reps–Binkley). Global/module state is modeled as extra parameters. This is the graph
client analyses (slicing, taint) query.
the **frontend SDK's** client analyses (slicing, taint) query — the analyzer only produces it.

## Node identity (the invariant that makes everything joinable)

Expand Down Expand Up @@ -91,8 +98,9 @@ facade invariant that `analysis.json` is the single facade-visible output:
"target": { "signature": "...", "node": 0 },
"type": "PARAM_IN", "var": "arg0" }
]
},
"taint_flows": [ ... ] // optional client-analysis output, see below
}
// NO `taint_flows` — client analyses (taint, slicing) are a FRONTEND SDK concern, not an
// analyzer output. The analyzer emits only the graph above. See § Client analyses below.
}
```

Expand All @@ -101,26 +109,34 @@ facade invariant that `analysis.json` is the single facade-visible output:
without `pdg` emits a PDG with only `DDG` edges.
- Unrecognized `--graphs` values follow the **flag-validation rule** (`cli-contract.md`): explicit
non-zero error, never silent fallback.
- The CPG is **Neo4j-only**, and the graph surface is **level-agnostic**: `--emit neo4j` always
runs at maximum implemented depth and projects the full SDG — `-a` and `--graphs` gate only
the JSON path, and combining them with `--emit neo4j` is an explicit error
(`neo4j-projection.md § Depth rule`). `--emit schema` includes the CFG/PDG/SDG labels in
`schema.neo4j.json`.
- The CPG is **Neo4j-only**: `--emit neo4j` at `-a 3` adds the CFG/PDG/SDG labels and edge types
to the projection (see § CPG below). `--emit schema` includes them in `schema.neo4j.json`.
- `program_graphs.schema_version` is versioned independently of the top-level schema and bumps
additively, like `schema.neo4j.json`.

## Client analyses are queries, not engines
## Client analyses live in the frontend, not the analyzer

Slicing and taint are **reachability queries over the SDG**, not separate analyses:
Slicing and taint are **reachability queries over the SDG**, not separate analyses — and they run
in the **frontend SDK** (`cldk-sdk-frontend`), never in the analyzer. The analyzer's job ends at
emitting the graph substrate; the SDK loads it and answers queries against it:

- **Backward slice** of `(signature, node)`: reverse reachability over `CDG ∪ DDG ∪ PARAM_* ∪
SUMMARY` (context-sensitive via the two-phase HRB traversal — up then down).
- **Taint**: seed at *sources*, propagate labeled reachability along dependence edges, block at
*sanitizers* on the path, report when a source label reaches a matching *sink*. Sources, sinks,
sanitizers, and library models are **data, not code** — a JSON spec validated against a JSON
Schema, with precedence *built-in pack < config file < inline flags*. Output is the
`taint_flows` section: `{ source, sink, rule, sanitized, path }`, each path a list of
`(signature, node_id)` pairs, with the matching model id for explainability.
Schema, with precedence *built-in pack < config file < inline flags*. The result is a
`taint_flows` structure — `{ source, sink, rule, sanitized, path }`, each path a list of
`(signature, node_id)` pairs with the matching model id for explainability — **produced and
owned by the SDK**, not written into the analyzer's `analysis.json`.

**Why this split.** The `SUMMARY` edges the analyzer emits are keyed on data dependence and are
reusable across *every* taint policy, so they are graph substrate and belong in the analyzer.
Sources/sinks/sanitizers are a *policy* that changes far faster than the graph; keeping the query
(and its `taint_flows` output) in the SDK means a policy edit re-runs a cheap traversal instead of
re-emitting the whole universal graph. This is exactly Joern's factoring: the CPG stores the
dependence substrate; `reachableBy` is evaluated at query time — Joern does **not** materialize
all-pairs taint edges, and neither should a `codeanalyzer-<lang>`.

## Cross-language parity clause

Expand Down Expand Up @@ -152,6 +168,12 @@ Each rung has a gate; do not build the next rung until the current one passes:
| Dominance | Post-dominator tree well-formed (unique root = EXIT; infinite loops handled via synthetic edge) |
| PDG | CDG edges match hand-computed control dependence on the fixture; every DDG edge connects a real def to a real use of the same access path |
| SDG | No dangling `(signature, node_id)` endpoints; PARAM_IN/OUT arity matches the callable's parameters; SUMMARY edges exist for at least one transitive flow in the fixture |

The **Slice** and **Taint** gates are **frontend gates** — they exercise the SDK's queries over
the emitted graph, not the analyzer, and live in `cldk-sdk-frontend`'s testing reference:

| Frontend gate | Core assertion |
| --- | --- |
| Slice | Backward slice of a named fixture variable equals the hand-computed expected node set — **exact**, not "non-empty" |
| Taint | One known source→sink flow found; the same flow with a sanitizer on the path is reported `sanitized` |

Expand Down
Loading