Skip to content

Level-3: native dataflow graphs (CFG/DFG/PDG/SDG) for Go #3

Description

@rahlk

⚠️ Scope amendment (2026-07-02): analyzer is a pure graph provider — slicing & taint move to the SDK

This analyzer emits the graph substrate only; client analyses are frontend SDK queries. Established across the family in cldk-forge PR #7 (dataflow-graphs.md § provider/client boundary); reference instantiation codellm-devkit/codeanalyzer-java#171.

Superseded here → filed as codellm-devkit/python-sdk#229 (the shared client-analysis engine): GOAL 2 (slicing + taint as SDG queries, taint_flows output), PART 3 item 11 (backward slicing + taint), PR D's "MVP taint over the call graph", PR E (models-as-data + taint_flows), and the "client gates" / "sanitized + unsanitized taint pair" in the DoD (now frontend gates). TaintFlow drops from the SDK models co-evolved here.

Stays in this analyzer: the full graph substrate — GOAL 1 (program_graphs: CFG/PDG/SDG), the transitive SUMMARY edges (PART 2 item 9 SDG assembly; PR D "SDG assembly … SUMMARY edges"), and the CPG projection (PR G). The DoD fixture keeps a source→sink data-flow pair so the SDK can assert taint over it. Retitled: dropped "and taint analysis".


Instantiates the skillset's level-3 dataflow plan (cldk-forge: skills/codeanalyzer-backend/references/dataflow-{graphs,construction,substrate-menu,issue-template}.md). Sibling instantiation: codellm-devkit/codeanalyzer-typescript#2.

PROBLEM

codeanalyzer-go's level-1 analysis (symbol table + go/types resolver call graph, on feat/initial-implementation) has no dataflow: no CFG, no dependence graphs, no way to answer "what does this value affect" or "does user input reach this sink". This issue adds level 3 — native, whole-program dependence graphs built from Go's own AST — and exposes slicing and taint as queries over them.

Native is the constraint: everything runs in-process in the analyzer's own ecosystem (pure Go; the binary stays self-contained). Go is uniquely suited to a hand-built implementation because the stdlib is an answer key: each hand-built rung is differentially tested against the canonical library implementation (golang.org/x/tools/go/cfg, go/ssa), which no other language in the fleet offers.

Prerequisite: land feat/initial-implementation (level 1) first.

GOALS

  1. Emit CFG, PDG (CDG+DDG), and SDG as first-class sections of analysis.json (program_graphs, schema-versioned, keyed by canonical (signature, node_id)), gated by -a 3 / --graphs.
  2. Expose backward slicing and taint as SDG queries; sources/sinks/sanitizers/library models supplied as data (JSON spec + JSON Schema validation), emitted as a taint_flows section.
  3. Hold the cross-language parity clause: shared node kinds / edge types / JSON shapes; Go-specific additions (e.g. defer_resume edge kind) are additive and recorded in SCHEMA_DECISIONS.md.
  4. Keep -a 1/-a 2 timings unaffected; content-hash and cache summaries with recorded dependency edges (incrementality later; hooks now).
  5. Deterministic parallelism (-j/--jobs, goroutines + errgroup): intraprocedural stages fan out per callable; the whole-program solve runs concurrently with them; summary composition is a ready-queue wavefront over the SCC condensation DAG. --jobs N output is byte-identical to --jobs 1.
  6. CPG (Neo4j projection) is optional and deferred — this analyzer does not yet ship --emit neo4j; see PR G.

SUBSTRATE DECISIONS (locked — from dataflow-substrate-menu.md, Go row)

  • CFG source: hand-built from go/ast, statement-level, with exceptional
    edges. golang.org/x/tools/go/cfg is the differential-test
    oracle, NOT the production source — it is deliberately
    minimal (no edge kinds, no defer/panic modeling), which is
    exactly what we must add.
  • Def-use source: hand-built reaching definitions over our CFG with k-limited
    access paths; go/ssa is the differential oracle for
    def-use on locals.
  • Points-to oracle: VTA (x/tools/go/callgraph/vta) as the dispatch oracle,
    merged into call_graph with provenance "vta". May-alias MVP
    is type-based (unusually strong in Go: static types,
    explicit pointers). PR F upgrades to a native
    Andersen-style analysis over go/ssa.
    x/tools/go/pointer is deprecated — not adopted.
  • Identity mapping: token.Pos / go/ssa value identities are mapped onto our
    canonical signature + node_id keys — the same keys as
    symbol_table / call_graph. On the critical path for every
    later part.
  • Precision posture: sound-leaning, over-approximate; flow-sensitive on locals
    and value flow, heap flow-insensitive; k-limited access
    paths (--graph-field-depth, default 3).

PART 1 — INTRAPROCEDURAL GRAPHS (construction stages 1–4)

  1. Exceptional CFG per callable: statement-level, synthetic ENTRY/EXIT,
    multi-exit normalized; explicit Go lowering rules:
    • defer: deferred calls run at every exit — splice them on each
      exit-bound path (LIFO), with a defer_resume edge kind;
    • panic/recover: exception edges to the nearest recovering defer,
      else to EXIT;
    • go statement: the spawned call is a CALL-site fact, not a CFG
      successor;
    • select: multi-way branch (one arm per case + default);
    • range loops, labeled break/continue, type switches, for {}
      (synthetic edge to EXIT for post-dominance).
  2. Dominators + post-dominators (CHK iterative); control dependence via the
    post-dominance frontier walk.
  3. Access-path model (locals, params, receivers, package-level vars,
    closure captures; k-limited); reaching definitions -> DDG edges.
  4. PDG assembly; the intraprocedural backward-slice gate on the fixture
    (hand-computed expected node set, exact match).
    Differential tests throughout: block structure vs x/tools/go/cfg on
    non-panicking paths; local def-use vs go/ssa.

PART 2 — INTERPROCEDURAL (stages 5–7)

  1. go/ssa whole-program build behind the -a 3 flag (concurrent with Part 1
    stages); VTA call-graph merge with provenance "vta"; identity mapping.
  2. Record package-level state (exported vars, singletons) read/write sites
    as summary inputs/outputs.
  3. SCC condensation (Tarjan); hammock-region decomposition; bottom-up
    relational region summaries indexed by exit; function-summary composition
    scheduled as a ready-queue wavefront over the condensation DAG (one SCC
    per worker; internal fixpoint sequential); monotone fixpoint within SCCs;
    k-limiting mandatory.
  4. Channels: conservative value flow send -> receive through the channel
    object (over its points-to set); no happens-before analysis — documented.
  5. External/stdlib behavior as model summaries; unmodeled externals default
    to conservative pass-through. SDG assembly: actual/formal in/out, CALL /
    PARAM_IN / PARAM_OUT / SUMMARY edges; package-level state as extra
    formals.

PART 3 — EMISSION AND CLIENTS (stage 8)

  1. program_graphs section per the contract; --graphs with strict flag
    validation; co-evolve the shared SDK Pydantic models (ProgramGraphs /
    GraphNode / GraphEdge / SDGEdge / TaintFlow) in the same change.
  2. Backward slicing (two-phase context-sensitive traversal) and taint
    (labeled reachability, sanitizer blocking, lazy witness reconstruction)
    as SDG queries; models configurable as data (built-in Go pack: net/http
    request data and os.Args/env as sources; os/exec, database/sql,
    html/template bypasses, os file writes as sinks; sanitizer examples);
    taint_flows output with model ids.

CAVEATS AND KNOWN RISKS

  • VTA gives dispatch precision, not full may-alias; the type-based MVP
    over-approximates until PR F. x/tools/go/pointer is deprecated and not an
    option.
  • Concurrency is not modeled: no happens-before; channel flow is a
    conservative object-flow approximation; data races are out of scope.
  • Inherited unsoundness: reflection, unsafe, cgo, plugin loading —
    documented in the README, not silently absorbed.
  • Termination: k-limiting mandatory; label sets are bounded bitsets.
  • Precision: intentionally over-approximate; do not trade soundness for a
    lower false-positive rate — ranking/pruning is downstream's job.
  • Parallel determinism: never assign ids or emit during parallel execution —
    collect, then sort by (signature, node_id). Implement sequentially first,
    pass every gate at --jobs 1, then parallelize with the sequential output
    as the differential oracle. Memory (N workers x ASTs/CFGs) is the real
    ceiling — release per-callable structures once the PDG is emitted.
  • Cost: whole-program go/ssa build + summaries is heavy; -a 1/2 must be
    unaffected (CI-checked).
  • Incrementality: aspirational; dependency edges and content-hashes recorded
    from the start.

STAGED PRs

PR A Prep: land/rebase on feat/initial-implementation; replace the
internal/semantic_analysis/codeql/ stub with the engine-generic
framework seam (CodeQL is dropped org-wide for licensing).
PR B Substrate: go/ssa build + VTA behind -a 3; identity mapping
(token.Pos -> signature+node_id); call-graph merge with provenance
"vta"; CI proves the in-process solve runs behind the flag.
PR C Intraprocedural: hand-built CFG + dominance + PDG; program_graphs
cfg/pdg emission; slice gate green; differential tests vs go/cfg and
go/ssa; then per-callable parallel fan-out (-j), diffed vs --jobs 1.
PR D Summaries: hammock regions, SCC fixpoint with k-limiting; SDG
assembly; sdg_edges emission; MVP taint over the call graph; then the
ready-queue wavefront, diffed vs --jobs 1.
PR E Models-as-data: JSON spec + Schema, default Go model pack; taint_flows
+ lazy witnesses; SDK models co-evolved.
PR F Alias-aware propagation: native Andersen-style points-to over go/ssa
replacing the type-based stub.
PR G (optional) CPG Neo4j projection — requires first standing up the
--emit neo4j surface for this analyzer (neo4j-projection.md); skip if
that surface stays out of scope. The SDG is the core artifact; no
client analysis depends on the CPG.
PR H (later) Incremental re-analysis over the recorded dependency edges.

VERIFICATION / DEFINITION OF DONE

  • Every gate in dataflow-construction.md passes on the fixture (CFG,
    dominance, DFG, PDG-slice, summary, SDG, client gates) — exact expected
    sets, not "non-empty".
  • Differential gates: block structure matches x/tools/go/cfg on
    non-panicking paths; local def-use matches go/ssa.
  • Fixture covers the Go lowering checklist (defer, panic/recover, goroutine
    spawn, select, range, labeled break/continue, type switch, closures) plus
    the shared minimums (aliasing, mutual recursion/SCC, package-level state,
    multi-package flow, sanitized + unsanitized taint pair).
  • analysis.json with -a 3 validates against the shared SDK ProgramGraphs
    models; parity clause holds.
  • --jobs N output byte-identical to --jobs 1 (determinism gate), for both
    the per-callable fan-out and the SCC wavefront.
  • -a 1 / -a 2 wall-clock unchanged within noise on the benchmark fixture.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions