Skip to content

feat: level-3 program graphs — native CFG/PDG/SDG and backward slicing (-a 3)#25

Open
rahlk wants to merge 2 commits into
mainfrom
feat/issue-2-program-graphs
Open

feat: level-3 program graphs — native CFG/PDG/SDG and backward slicing (-a 3)#25
rahlk wants to merge 2 commits into
mainfrom
feat/issue-2-program-graphs

Conversation

@rahlk

@rahlk rahlk commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Part of #2 (PRs C′+D′ of the amended staging) — the full SDG: native, whole-program dependence graphs built in-process from the ts-morph AST, per the CLDK level-3 dataflow contract. No external engines; Jelly stays a frozen call-graph oracle.

What ships

A new pipeline stage (src/dataflow/), fully flag-gated behind -a 3 (levels 1/2 pay nothing):

Stage Module What it does
1 cfg.ts Exceptional statement-level CFG per callable: synthetic ENTRY/EXIT, param nodes, ids in source-span order; true/false, loop_back, switch_case, exception, break/continue/return, and TS-native await_resume/yield edge kinds; region-spliced try/catch/finally; while (true) keeps its (dead) loop-exit edge so EXIT stays the unique post-dominance root
2 dominance.ts Cooper–Harper–Kennedy post-dominators + Ferrante–Ottenstein–Warren control dependence
3 defuse.ts k-limited access paths (--graph-field-depth, default 3) over declaration-keyed bases (local/param/this/captured/module); copy-alias union-find (the MVP substrate — Jelly points-to is staged PR F); forward reaching defs → labeled DDG; capture-at-declaration for closures; EXIT doubles as the HRB formal-out
5–6 summaries.ts Tarjan SCC condensation of the provenance-merged call graph; bottom-up relational summaries (argN→return, transitive global reads/writes, globals→return) co-defined to a monotone fixpoint within SCCs; persisted with dependency edges to graphs_summaries.json (write-only today; PR H consumes it)
4+7 index.ts, sdg.ts PDG assembly (CDG ∪ DDG) and HRB stitching: CALL, PARAM_IN (positional args + module globals as extra params), PARAM_OUT, SUMMARY — all keyed by (signature, node_id) with the call-graph no-dangling rule extended to graphs
8 (query) slice.ts Two-phase context-sensitive backward slicing over the emitted SDG (the gate client; the configurable taint pack is PR E)

Emission is a schema-versioned program_graphs section of analysis.json, scoped by --graphs cfg,dfg,pdg,sdg with strict flag validation (unknown values exit non-zero, never a silent fallback). Node identity uses the same signatureOf() canonicalizer as symbol_table/call_graph, so everything joins.

Verification (every contract gate, exact sets)

test/fixtures/dataflow-app + test/dataflow.test.ts (36 tests, 1.5k assertions):

  • CFG gate — single ENTRY/EXIT, contiguous span-ordered ids, real spans, every node reachable from ENTRY and reaching EXIT, and each fixture construct produces its documented edges (branches, loop back edges, exception→catch, finally re-raise, switch dispatch+breaks, await/yield, infinite-loop exit edge, short-circuit staying intra-statement).
  • Dominance gate — hand-computed CDG sets for classify and early, exact.
  • DFG gate — loop-carried acc→acc, shadowed scopes don't leak, write-through-alias reaches read-through-original, closure capture edges.
  • PDG slice gate — backward slice of classify's return equals the hand-computed node set (correctly excluding the strongly-killed initializer).
  • Summary/SDG gates — no dangling (signature, node_id) endpoints; CALL targets ENTRY; positional PARAM_IN targets param nodes; the composed SUMMARY arg0 for the a→b→c chain; cross-file edges; a global written by one callee and read by the next materializing as caller-local DDG; mutual recursion reaching fixpoint.
  • Interprocedural slicemain's return slices to exactly {main, chain.a, chain.b, chain.c, state.bump, state.readCounter} with exact per-function node sets, context-sensitively.
  • Determinism — two runs emit byte-identical program_graphs; -a 1 output contains no trace of the section.

bun test: 47 pass / 0 fail (5 Docker-gated skips) · tsc --noEmit clean · bun run build compiles the standalone binary with the new stage bundled.

Parallel execution model (second commit)

The contract's parallel model, built on the sequential run as the differential oracle:

  • Stage split first: AST-bound fact extraction is hoisted out of the summary fixpoint; the reaching-defs solve is pure data (CallableGraphData is the serializable per-callable projection), so fixpoints re-run without re-walking the AST — a sequential win on its own.
  • Stages 1–4 fan out per callable over a Bun worker pool (partitioned by file; each worker materializes its own whole-program ts-morph project, since ASTs can't cross the clone boundary).
  • Call-graph overlap: at -a 3 the extraction is posted to the pool before the provider (tsc resolver + Jelly subprocess) runs on the main thread — the "points-to solve concurrent with stages 1–4" slot — joined before summaries.
  • Stages 6–7 run as a Kahn-style ready-queue wavefront over the SCC condensation DAG (per-SCC dependency counters; the SCC's internal fixpoint is the atomic unit, one worker each).
  • Determinism gate: --jobs N is byte-identical to --jobs 1 (span-ordered ids, collect-then-sort, pure sccFixpoint), enforced by a differential test; --jobs 1 remains the debug mode.
  • Failure discipline: dying workers are retired and never strand the queue (a stranded queue could previously let the process exit 0 without output); extraction failure closes the pool so the wavefront degrades sequentially too. Warnings, never wrong/missing output.
  • Compiled binary: the worker ships as a second bun build --compile entrypoint (embedded as $bunfs/root/dataflow/worker.js); verified dist/cants -a 3 -j 2 runs workers byte-identically.
  • Honest default: sequential. Measured on self-analysis (36 files, 211 callables): -j 14 is 2.5× slower wall-clock because per-worker project load dominates the parallelizable graph math; -j N is the explicit opt-in for large codebases.

Also fixes a latent bug: TSCallable.path is absolute, so the summary cache's content-hash lookup (symbol_table[c.path]) always missed; now keyed by the project-relative file key.

Deliberate scope cuts (staged in #2)

  • PR E — taint models-as-data + taint_flows (the fixture already carries the source/sink/sanitizer pair).
  • PR F — Jelly points-to-backed alias-aware propagation (replaces the copy-alias MVP).
  • PR G — CPG projection through the Neo4j emitter + schema.neo4j.json bump.
  • PR H — incremental re-analysis over the recorded summary dependency edges.
  • Summaries are node-granular (statement-level) — sound-leaning and over-approximate per the contract's precision posture; known unsoundness (dynamic eval, reflection/monkey-patching, npm-internal effects, cross-call this flow) is recorded in .claude/SCHEMA_DECISIONS.md and the README.

rahlk added 2 commits July 1, 2026 21:45
…icing (-a 3)

Implements the dataflow half of #2: whole-program dependence graphs built
in-process from the ts-morph AST, emitted as a schema-versioned
program_graphs section of analysis.json, gated by -a 3 / --graphs.

- src/dataflow/cfg.ts: exceptional statement-level CFG per callable
  (ENTRY/param/statement/EXIT nodes in source-span order; true/false,
  loop_back, switch_case, exception, await_resume, yield edge kinds;
  region-spliced try/catch/finally; synthetic loop-exit edge keeps EXIT
  the post-dominance root).
- src/dataflow/dominance.ts: CHK iterative post-dominators +
  Ferrante–Ottenstein–Warren control dependence (CDG).
- src/dataflow/defuse.ts: k-limited access paths (declaration-keyed bases:
  local/param/this/captured/module), copy-alias union-find MVP, forward
  reaching definitions, DDG extraction, capture-at-declaration for
  closures, EXIT-as-formal-out routing.
- src/dataflow/summaries.ts: Tarjan SCC condensation of the
  provenance-merged call graph; bottom-up relational summaries
  (param→return, transitive global reads/writes) co-defined to a monotone
  fixpoint inside SCCs; persisted with dependency edges to
  graphs_summaries.json for later incrementality.
- src/dataflow/sdg.ts: HRB stitching — CALL, PARAM_IN (args by position,
  globals as extra params), PARAM_OUT, and SUMMARY edges, all keyed by
  (signature, node_id) with no dangling endpoints.
- src/dataflow/slice.ts: two-phase context-sensitive backward slicing as
  an SDG query.
- CLI: -a 3, --graphs cfg,dfg,pdg,sdg, --graph-field-depth (strictly
  validated); -a 1/-a 2 are untouched (level 3 is fully flag-gated).
- test/fixtures/dataflow-app + test/dataflow.test.ts: every contract gate
  with exact hand-computed expected sets (CFG reachability, CDG sets,
  loop-carried/shadowing/aliasing DDG, intraprocedural and
  interprocedural slices, SUMMARY for the a→b→c chain, global flow,
  mutual-recursion fixpoint, byte-identical determinism).

Follow-ups staged in #2: taint models-as-data (PR E), Jelly points-to
aliasing (PR F), CPG Neo4j projection (PR G), incremental re-analysis
(PR H).
Implements the contract's parallel model on top of the sequential oracle:

- Stage split: fact extraction (AST-bound, once per callable) is hoisted
  out of the summary fixpoint; the reaching-defs solve (defuse.solveDefUse)
  is now pure data and re-runs without touching the AST. CallableGraphData
  is the serializable per-callable projection that crosses the worker
  boundary.
- Stages 1-4 (CFG, dominance, def-use facts, PDG) fan out per callable
  over a Bun worker pool, partitioned by file; each worker materializes
  its own whole-program ts-morph project (ASTs cannot be structured-cloned)
  and returns plain data.
- The call-graph solve overlaps extraction: at -a 3, core.ts posts the
  extraction to the pool BEFORE the provider (tsc resolver + Jelly
  subprocess) runs on the main thread, and joins before summaries.
- Stages 6-7 run as a Kahn-style ready-queue wavefront over the Tarjan SCC
  condensation DAG (per-SCC dependency counters; the SCC and its internal
  fixpoint are the atomic unit, one worker each).
- Determinism: --jobs N output is byte-identical to --jobs 1 (span-ordered
  ids, collect-then-sort emission, sccFixpoint pure), enforced by a
  differential test. --jobs 1 is the sequential debug mode.
- Failure discipline: a dying worker is retired and its queue never
  strands (a stranded queue previously let the process exit 0 without
  emitting output); extraction failure closes the pool so the wavefront
  degrades sequentially too — a warning, never wrong or missing output.
- Compiled binary: the worker is a second bun build --compile entrypoint,
  embedded as /$bunfs/root/dataflow/worker.js; pool resolves the URL per
  runtime. Verified: dist/cants -a 3 -j 2 runs workers with byte-identical
  output.
- Default is sequential: measurement (self-analysis: 36 files, 211
  callables) shows per-worker project load dominates the parallelizable
  graph math at small/mid scale (2.5x slower at -j 14), so -j N is an
  explicit opt-in for large codebases.

Also fixes a latent bug: TSCallable.path is absolute, so the summary
cache's symbol_table[c.path] content-hash lookup always missed; now keyed
by the project-relative file key.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant