⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing & taint move to the SDK
This analyzer emits the graph substrate only; client analyses are frontend SDK queries. Established across the family in cldk-forge PR #7 (dataflow-graphs.md § provider/client boundary); reference instantiation codellm-devkit/codeanalyzer-java#171.
Superseded here → filed as codellm-devkit/python-sdk#229 (the shared client-analysis engine): GOAL 2 (slicing + taint as SDG queries, taint_flows output), PART 3 item 11 (backward slicing + taint), PR D's "MVP taint over the call graph", PR E (models-as-data + taint_flows), and the "client gates" / "sanitized + unsanitized taint pair" in the DoD (now frontend gates). TaintFlow drops from the SDK models co-evolved here.
Stays in this analyzer: the full graph substrate — GOAL 1 (program_graphs: CFG/PDG/SDG), the transitive SUMMARY edges (PART 2 item 9 SDG assembly; PR D "SDG assembly … SUMMARY edges"), and the CPG projection (PR G). The DoD fixture keeps a source→sink data-flow pair so the SDK can assert taint over it. Retitled: dropped "and taint analysis".
Instantiates the skillset's level-3 dataflow plan (cldk-forge: skills/codeanalyzer-backend/references/dataflow-{graphs,construction,substrate-menu,issue-template}.md). Sibling instantiation: codellm-devkit/codeanalyzer-typescript#2.
PROBLEM
codeanalyzer-go's level-1 analysis (symbol table + go/types resolver call graph, on feat/initial-implementation) has no dataflow: no CFG, no dependence graphs, no way to answer "what does this value affect" or "does user input reach this sink". This issue adds level 3 — native, whole-program dependence graphs built from Go's own AST — and exposes slicing and taint as queries over them.
Native is the constraint: everything runs in-process in the analyzer's own ecosystem (pure Go; the binary stays self-contained). Go is uniquely suited to a hand-built implementation because the stdlib is an answer key: each hand-built rung is differentially tested against the canonical library implementation (golang.org/x/tools/go/cfg, go/ssa), which no other language in the fleet offers.
Prerequisite: land feat/initial-implementation (level 1) first.
GOALS
- Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json (
program_graphs, schema-versioned, keyed by canonical (signature, node_id)), gated by -a 3 / --graphs.
- Expose backward slicing and taint as SDG queries; sources/sinks/sanitizers/library models supplied as data (JSON spec + JSON Schema validation), emitted as a
taint_flows section.
- Hold the cross-language parity clause: shared node kinds / edge types / JSON shapes; Go-specific additions (e.g.
defer_resume edge kind) are additive and recorded in SCHEMA_DECISIONS.md.
- Keep
-a 1/-a 2 timings unaffected; content-hash and cache summaries with recorded dependency edges (incrementality later; hooks now).
- Deterministic parallelism (
-j/--jobs, goroutines + errgroup): intraprocedural stages fan out per callable; the whole-program solve runs concurrently with them; summary composition is a ready-queue wavefront over the SCC condensation DAG. --jobs N output is byte-identical to --jobs 1.
- CPG (Neo4j projection) is optional and deferred — this analyzer does not yet ship
--emit neo4j; see PR G.
SUBSTRATE DECISIONS (locked — from dataflow-substrate-menu.md, Go row)
- CFG source: hand-built from
go/ast, statement-level, with exceptional
edges. golang.org/x/tools/go/cfg is the differential-test
oracle, NOT the production source — it is deliberately
minimal (no edge kinds, no defer/panic modeling), which is
exactly what we must add.
- Def-use source: hand-built reaching definitions over our CFG with k-limited
access paths; go/ssa is the differential oracle for
def-use on locals.
- Points-to oracle: VTA (
x/tools/go/callgraph/vta) as the dispatch oracle,
merged into call_graph with provenance "vta". May-alias MVP
is type-based (unusually strong in Go: static types,
explicit pointers). PR F upgrades to a native
Andersen-style analysis over go/ssa.
x/tools/go/pointer is deprecated — not adopted.
- Identity mapping: token.Pos / go/ssa value identities are mapped onto our
canonical signature + node_id keys — the same keys as
symbol_table / call_graph. On the critical path for every
later part.
- Precision posture: sound-leaning, over-approximate; flow-sensitive on locals
and value flow, heap flow-insensitive; k-limited access
paths (--graph-field-depth, default 3).
PART 1 — INTRAPROCEDURAL GRAPHS (construction stages 1–4)
- Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT,
multi-exit normalized; explicit Go lowering rules:
- defer: deferred calls run at every exit — splice them on each
exit-bound path (LIFO), with a defer_resume edge kind;
- panic/recover:
exception edges to the nearest recovering defer,
else to EXIT;
- go statement: the spawned call is a CALL-site fact, not a CFG
successor;
- select: multi-way branch (one arm per case + default);
- range loops, labeled break/continue, type switches,
for {}
(synthetic edge to EXIT for post-dominance).
- Dominators + post-dominators (CHK iterative); control dependence via the
post-dominance frontier walk.
- Access-path model (locals, params, receivers, package-level vars,
closure captures; k-limited); reaching definitions -> DDG edges.
- PDG assembly; the intraprocedural backward-slice gate on the fixture
(hand-computed expected node set, exact match).
Differential tests throughout: block structure vs x/tools/go/cfg on
non-panicking paths; local def-use vs go/ssa.
PART 2 — INTERPROCEDURAL (stages 5–7)
- go/ssa whole-program build behind the -a 3 flag (concurrent with Part 1
stages); VTA call-graph merge with provenance "vta"; identity mapping.
- Record package-level state (exported vars, singletons) read/write sites
as summary inputs/outputs.
- SCC condensation (Tarjan); hammock-region decomposition; bottom-up
relational region summaries indexed by exit; function-summary composition
scheduled as a ready-queue wavefront over the condensation DAG (one SCC
per worker; internal fixpoint sequential); monotone fixpoint within SCCs;
k-limiting mandatory.
- Channels: conservative value flow send -> receive through the channel
object (over its points-to set); no happens-before analysis — documented.
- External/stdlib behavior as model summaries; unmodeled externals default
to conservative pass-through. SDG assembly: actual/formal in/out, CALL /
PARAM_IN / PARAM_OUT / SUMMARY edges; package-level state as extra
formals.
PART 3 — EMISSION AND CLIENTS (stage 8)
program_graphs section per the contract; --graphs with strict flag
validation; co-evolve the shared SDK Pydantic models (ProgramGraphs /
GraphNode / GraphEdge / SDGEdge / TaintFlow) in the same change.
- Backward slicing (two-phase context-sensitive traversal) and taint
(labeled reachability, sanitizer blocking, lazy witness reconstruction)
as SDG queries; models configurable as data (built-in Go pack: net/http
request data and os.Args/env as sources; os/exec, database/sql,
html/template bypasses, os file writes as sinks; sanitizer examples);
taint_flows output with model ids.
CAVEATS AND KNOWN RISKS
- VTA gives dispatch precision, not full may-alias; the type-based MVP
over-approximates until PR F. x/tools/go/pointer is deprecated and not an
option.
- Concurrency is not modeled: no happens-before; channel flow is a
conservative object-flow approximation; data races are out of scope.
- Inherited unsoundness: reflection, unsafe, cgo, plugin loading —
documented in the README, not silently absorbed.
- Termination: k-limiting mandatory; label sets are bounded bitsets.
- Precision: intentionally over-approximate; do not trade soundness for a
lower false-positive rate — ranking/pruning is downstream's job.
- Parallel determinism: never assign ids or emit during parallel execution —
collect, then sort by (signature, node_id). Implement sequentially first,
pass every gate at --jobs 1, then parallelize with the sequential output
as the differential oracle. Memory (N workers x ASTs/CFGs) is the real
ceiling — release per-callable structures once the PDG is emitted.
- Cost: whole-program go/ssa build + summaries is heavy; -a 1/2 must be
unaffected (CI-checked).
- Incrementality: aspirational; dependency edges and content-hashes recorded
from the start.
STAGED PRs
PR A Prep: land/rebase on feat/initial-implementation; replace the
internal/semantic_analysis/codeql/ stub with the engine-generic
framework seam (CodeQL is dropped org-wide for licensing).
PR B Substrate: go/ssa build + VTA behind -a 3; identity mapping
(token.Pos -> signature+node_id); call-graph merge with provenance
"vta"; CI proves the in-process solve runs behind the flag.
PR C Intraprocedural: hand-built CFG + dominance + PDG; program_graphs
cfg/pdg emission; slice gate green; differential tests vs go/cfg and
go/ssa; then per-callable parallel fan-out (-j), diffed vs --jobs 1.
PR D Summaries: hammock regions, SCC fixpoint with k-limiting; SDG
assembly; sdg_edges emission; MVP taint over the call graph; then the
ready-queue wavefront, diffed vs --jobs 1.
PR E Models-as-data: JSON spec + Schema, default Go model pack; taint_flows
+ lazy witnesses; SDK models co-evolved.
PR F Alias-aware propagation: native Andersen-style points-to over go/ssa
replacing the type-based stub.
PR G (optional) CPG Neo4j projection — requires first standing up the
--emit neo4j surface for this analyzer (neo4j-projection.md); skip if
that surface stays out of scope. The SDG is the core artifact; no
client analysis depends on the CPG.
PR H (later) Incremental re-analysis over the recorded dependency edges.
VERIFICATION / DEFINITION OF DONE
- Every gate in dataflow-construction.md passes on the fixture (CFG,
dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
sets, not "non-empty".
- Differential gates: block structure matches x/tools/go/cfg on
non-panicking paths; local def-use matches go/ssa.
- Fixture covers the Go lowering checklist (defer, panic/recover, goroutine
spawn, select, range, labeled break/continue, type switch, closures) plus
the shared minimums (aliasing, mutual recursion/SCC, package-level state,
multi-package flow, sanitized + unsanitized taint pair).
- analysis.json with -a 3 validates against the shared SDK ProgramGraphs
models; parity clause holds.
- --jobs N output byte-identical to --jobs 1 (determinism gate), for both
the per-callable fan-out and the SCC wavefront.
- -a 1 / -a 2 wall-clock unchanged within noise on the benchmark fixture.
Instantiates the skillset's level-3 dataflow plan (cldk-forge:
skills/codeanalyzer-backend/references/dataflow-{graphs,construction,substrate-menu,issue-template}.md). Sibling instantiation: codellm-devkit/codeanalyzer-typescript#2.PROBLEM
codeanalyzer-go's level-1 analysis (symbol table + go/types resolver call graph, on
feat/initial-implementation) has no dataflow: no CFG, no dependence graphs, no way to answer "what does this value affect" or "does user input reach this sink". This issue adds level 3 — native, whole-program dependence graphs built from Go's own AST — and exposes slicing and taint as queries over them.Native is the constraint: everything runs in-process in the analyzer's own ecosystem (pure Go; the binary stays self-contained). Go is uniquely suited to a hand-built implementation because the stdlib is an answer key: each hand-built rung is differentially tested against the canonical library implementation (
golang.org/x/tools/go/cfg,go/ssa), which no other language in the fleet offers.Prerequisite: land
feat/initial-implementation(level 1) first.GOALS
program_graphs, schema-versioned, keyed by canonical (signature, node_id)), gated by-a 3/--graphs.taint_flowssection.defer_resumeedge kind) are additive and recorded in SCHEMA_DECISIONS.md.-a 1/-a 2timings unaffected; content-hash and cache summaries with recorded dependency edges (incrementality later; hooks now).-j/--jobs, goroutines + errgroup): intraprocedural stages fan out per callable; the whole-program solve runs concurrently with them; summary composition is a ready-queue wavefront over the SCC condensation DAG.--jobs Noutput is byte-identical to--jobs 1.--emit neo4j; see PR G.SUBSTRATE DECISIONS (locked — from dataflow-substrate-menu.md, Go row)
go/ast, statement-level, with exceptionaledges.
golang.org/x/tools/go/cfgis the differential-testoracle, NOT the production source — it is deliberately
minimal (no edge kinds, no defer/panic modeling), which is
exactly what we must add.
access paths;
go/ssais the differential oracle fordef-use on locals.
x/tools/go/callgraph/vta) as the dispatch oracle,merged into call_graph with provenance "vta". May-alias MVP
is type-based (unusually strong in Go: static types,
explicit pointers). PR F upgrades to a native
Andersen-style analysis over go/ssa.
x/tools/go/pointeris deprecated — not adopted.canonical signature + node_id keys — the same keys as
symbol_table / call_graph. On the critical path for every
later part.
and value flow, heap flow-insensitive; k-limited access
paths (--graph-field-depth, default 3).
PART 1 — INTRAPROCEDURAL GRAPHS (construction stages 1–4)
multi-exit normalized; explicit Go lowering rules:
exit-bound path (LIFO), with a
defer_resumeedge kind;exceptionedges to the nearest recovering defer,else to EXIT;
successor;
for {}(synthetic edge to EXIT for post-dominance).
post-dominance frontier walk.
closure captures; k-limited); reaching definitions -> DDG edges.
(hand-computed expected node set, exact match).
Differential tests throughout: block structure vs x/tools/go/cfg on
non-panicking paths; local def-use vs go/ssa.
PART 2 — INTERPROCEDURAL (stages 5–7)
stages); VTA call-graph merge with provenance "vta"; identity mapping.
as summary inputs/outputs.
relational region summaries indexed by exit; function-summary composition
scheduled as a ready-queue wavefront over the condensation DAG (one SCC
per worker; internal fixpoint sequential); monotone fixpoint within SCCs;
k-limiting mandatory.
object (over its points-to set); no happens-before analysis — documented.
to conservative pass-through. SDG assembly: actual/formal in/out, CALL /
PARAM_IN / PARAM_OUT / SUMMARY edges; package-level state as extra
formals.
PART 3 — EMISSION AND CLIENTS (stage 8)
program_graphssection per the contract;--graphswith strict flagvalidation; co-evolve the shared SDK Pydantic models (ProgramGraphs /
GraphNode / GraphEdge / SDGEdge / TaintFlow) in the same change.
(labeled reachability, sanitizer blocking, lazy witness reconstruction)
as SDG queries; models configurable as data (built-in Go pack: net/http
request data and os.Args/env as sources; os/exec, database/sql,
html/template bypasses, os file writes as sinks; sanitizer examples);
taint_flowsoutput with model ids.CAVEATS AND KNOWN RISKS
over-approximates until PR F. x/tools/go/pointer is deprecated and not an
option.
conservative object-flow approximation; data races are out of scope.
documented in the README, not silently absorbed.
lower false-positive rate — ranking/pruning is downstream's job.
collect, then sort by (signature, node_id). Implement sequentially first,
pass every gate at --jobs 1, then parallelize with the sequential output
as the differential oracle. Memory (N workers x ASTs/CFGs) is the real
ceiling — release per-callable structures once the PDG is emitted.
unaffected (CI-checked).
from the start.
STAGED PRs
PR A Prep: land/rebase on feat/initial-implementation; replace the
internal/semantic_analysis/codeql/ stub with the engine-generic
framework seam (CodeQL is dropped org-wide for licensing).
PR B Substrate: go/ssa build + VTA behind -a 3; identity mapping
(token.Pos -> signature+node_id); call-graph merge with provenance
"vta"; CI proves the in-process solve runs behind the flag.
PR C Intraprocedural: hand-built CFG + dominance + PDG; program_graphs
cfg/pdg emission; slice gate green; differential tests vs go/cfg and
go/ssa; then per-callable parallel fan-out (-j), diffed vs --jobs 1.
PR D Summaries: hammock regions, SCC fixpoint with k-limiting; SDG
assembly; sdg_edges emission; MVP taint over the call graph; then the
ready-queue wavefront, diffed vs --jobs 1.
PR E Models-as-data: JSON spec + Schema, default Go model pack; taint_flows
+ lazy witnesses; SDK models co-evolved.
PR F Alias-aware propagation: native Andersen-style points-to over go/ssa
replacing the type-based stub.
PR G (optional) CPG Neo4j projection — requires first standing up the
--emit neo4j surface for this analyzer (neo4j-projection.md); skip if
that surface stays out of scope. The SDG is the core artifact; no
client analysis depends on the CPG.
PR H (later) Incremental re-analysis over the recorded dependency edges.
VERIFICATION / DEFINITION OF DONE
dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
sets, not "non-empty".
non-panicking paths; local def-use matches go/ssa.
spawn, select, range, labeled break/continue, type switch, closures) plus
the shared minimums (aliasing, mutual recursion/SCC, package-level state,
multi-package flow, sanitized + unsanitized taint pair).
models; parity clause holds.
the per-callable fan-out and the SCC wavefront.