codellm-devkit · rahlk · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026 · Jul 2, 2026
diff --git a/.claude/SCHEMA_DECISIONS.md b/.claude/SCHEMA_DECISIONS.md
@@ -0,0 +1,20 @@
+# Schema decisions (codeanalyzer-java)
+
+Auditable record of schema-affecting design decisions, in the style of the sibling analyzers'
+`.claude/SCHEMA_DECISIONS.md`. Every entry was decided with the maintainer.
+
+## Level-3 full SDG (issue #171, 2026-07-01)
+
+| # | Concept | Options considered | **Decision** | Rationale |
+|---|---|---|---|---|
+| 1 | Level mapping | (a) `-a 2` emits SDG too (pre-Mar-2025 behavior); (b) new `-a 3` | **new `-a 3`**; `-a 2` stays call-graph-only, byte-identical | level-2 perf/output untouched; matches the CLDK level ladder. Follow-up on python-sdk: default to 3, dial down on request |
+| 2 | Slicer dependence options | no-heap only; no-heap + knob; full always | **`--sdg-data-deps <no-heap\|full>`, default `no-heap`** (`NO_HEAP_NO_EXCEPTIONS` + `NO_EXCEPTIONAL_EDGES`; `full` = `DataDependenceOptions.FULL` + `ControlDependenceOptions.FULL`) | old fast settings by default; heap dependence is opt-in because it is an order of magnitude slower |
+| 3 | Call-graph builder feeding the SDG | RTA; 0-1-CFA conditional; 0-1-CFA always | **RTA** (`Util.makeRTABuilder`), unchanged | fast, proven on fixtures; 0-1-CFA was tried (979b298) and abandoned; adequate for no-heap deps |
+| 4 | SDG edge shape | method-level `JGraphEdges` only; statement-level `program_graphs` | **both**: method-level `system_dependency_graph` (zero SDK model changes) **and** statement-level `program_graphs` per the level-3 contract | the SDK's existing `JGraphEdges` field validates today; `program_graphs` is the forward contract the SDK/SCIP indexing adapts to |
+| 5 | Node identity in `program_graphs` | AST-node source-span order (contract wording) | **SSA instruction order**: `node_id` 0 = synthetic `ENTRY`, then SSA instructions by `iindex`, last = `EXIT`; source lines from ECJ/CAst positions, `-1` sentinel when unavailable | WALA nodes are SSA instructions, not AST nodes; instruction order is deterministic across runs on identical content — the property the contract actually needs |
+| 6 | CFG edge kinds | full shared vocabulary | shared vocabulary with documented approximations: `true`/`false` by conditional-branch successor order, `loop_back` when target iindex < source iindex, `exception` from WALA exceptional successors, else `fallthrough`/`return` | WALA's SSACFG doesn't label edges; these derivations are deterministic and recorded here rather than invented ad hoc |
+| 7 | Cross-function `sdg_edges` | full HRB vocabulary | `CALL`, `PARAM_IN`, `PARAM_OUT` now; **no `SUMMARY` edges yet** (follow-up) | WALA computes HRB summaries lazily inside `Slicer`; exposing them is real extra work and not needed for the graph itself |
+| 8 | Precision posture | — | sound-leaning, over-approximate; `ReflectionOptions.NONE` (unchanged); application classes only (`GraphSlicer.prune`) | matches level 2; documented unsoundness, not silently absorbed |
+| 9 | Neo4j projection of the SDG | one type + `type` property; per-kind types | **per-kind relationship types `J_CONTROL_DEP`/`J_DATA_DEP`/`J_HEAP_DATA_DEP`** (`JCallable`→`JCallable`, props `weight`/`source_kind`/`destination_kind`, same resolved-gating as `J_CALLS`); schema 1.0.0 → 1.1.0 (additive) | the writers MERGE one relationship per (type, src, dst), so a pair with both control and data dependence would lose one with a single type; WALA's `Dependency` enum is closed (exactly these three), so the vocabulary is total. Statement-level CPG (`CFGNode` etc.) stays a follow-up |
+| 10 | Neo4j default analysis level | keep 1; default 3 | **`--emit neo4j` defaults to `-a 3`** (an explicit `-a` still wins) | the graph is the consumer that wants the full SDG; python-sdk mirrors the same contract (python-sdk#228) |
+| 11 | Client analyses (taint, slicing) | analyzer-side `taint_flows` section; frontend-side queries | **frontend-side only**: the analyzer emits the full universal graph (`program_graphs` + `system_dependency_graph`) and never runs client analyses; taint/slicing are reachability queries over that graph in the SDK (python-sdk#228) | keeps the analyzer a pure graph provider; model packs (sources/sinks/sanitizers) evolve at SDK speed without analyzer releases. Graph *substrate* additions (per-argument PARAM nodes, `SUMMARY` edges) remain analyzer-side — they are part of the graph, not clients |
diff --git a/.gitignore b/.gitignore
@@ -196,3 +196,10 @@ gradle-app.setting
 bin/
 etc/
 /src/test/resources/sample_apps/daytrader8/output/
+
+# Un-ignore the agent guide + schema decision record past a global gitignore that excludes these
+!CLAUDE.md
+!AGENTS.md
+!.claude/
+.claude/*
+!.claude/SCHEMA_DECISIONS.md
diff --git a/AGENTS.md b/AGENTS.md
@@ -0,0 +1 @@
+CLAUDE.md
diff --git a/CLAUDE.md b/CLAUDE.md
@@ -0,0 +1,123 @@
+# codeanalyzer-java
+
+The CLDK Java analyzer. Parses an enterprise Java project with
+[JavaParser](https://javaparser.org/) (symbol table) and [WALA](https://github.com/wala/WALA)
+(call graph / system dependency graph) and emits the **canonical CLDK `analysis.json`** — a
+symbol table plus a dependency graph — so the [CLDK Python SDK](../python-sdk) can consume it
+via `CLDK(language="java").analysis(...)`. It can alternatively project the same IR into a
+**Neo4j property graph** (`--emit neo4j`).
+
+It is the Java sibling of `codeanalyzer-python` and `codeanalyzer-typescript`.
+
+## Requirements
+
+- Java 11+ to run the jar; Java 17+ (Semeru or similar) to build. GraalVM 21+ only for
+  `nativeCompile`. Install via [SDKMan!](https://sdkman.io).
+- Gradle via the checked-in wrapper (`./gradlew`) — never a system Gradle.
+
+## Build / test / run
+
+```bash
+./gradlew fatJar          # → build/libs/codeanalyzer-<version>.jar (the deliverable)
+./gradlew test            # JUnit 5; Testcontainers suites need RUN_CONTAINER_TESTS=1 + Docker/Podman
+./gradlew spotlessApply   # formatting (runs automatically before compileJava)
+./gradlew nativeCompile -PbinDir=$HOME/.local/bin   # optional GraalVM native binary
+
+java -jar build/libs/codeanalyzer-*.jar -i <project> -a 2 -o <outdir>
+```
+
+Version lives in `gradle.properties` (bump with `./gradlew bumpVersion -PbumpType=patch|minor|major`).
+Releases are tag-triggered via GitHub Actions (`.github/workflows/`); a lockstep job releases the
+thin PyPI distribution (`packaging/pypi/`).
+
+## CLI
+
+```
+codeanalyzer -i <project> [options]
+
+  -i, --input <path>           project root to analyze
+  -s, --source-analysis <str>  analyze a single string of Java source instead of a project
+  -o, --output <dir>           write <dir>/analysis.json (omit ⇒ JSON to stdout)
+  -a, --analysis-level <1|2|3> 1 = symbol table (default); 2 = + RTA call graph;
+                               3 = + full system dependency graph (WALA slicer)
+      --graphs <cfg,pdg,sdg>   program_graphs sections to emit at -a 3 (default all)
+      --sdg-data-deps <d>      no-heap (default) | full — slicer data-dependence depth at -a 3
+  -b, --build-cmd <cmd>        custom build command (default: auto-detect mvn/gradle)
+      --no-build               skip building the target app (use if already built)
+  -t, --target-files <f>...    restrict analysis to specific files (incremental)
+      --emit <json|neo4j|schema>  output target (default json)
+      --app-name / --neo4j-*   Neo4j anchor name and Bolt connection (see README §5)
+  -v, --verbose                logs to console
+```
+
+stdout is a clean JSON channel when `-o` is omitted; diagnostics go through `utils/Log`.
+
+## Architecture (`src/main/java/com/ibm/cldk/`)
+
+- `CodeAnalyzer.java` — picocli entrypoint; orchestrates symbol table → graph → emitter.
+- `SymbolTable.java` — JavaParser + symbol solver; builds `Map<path, JavaCompilationUnit>`.
+- `SystemDependencyGraph.java` — WALA-based graph construction: `ScopeUtils` builds the
+  analysis scope (ECJ/CAst source-level front end), `AnalysisUtils.getEntryPoints` seeds
+  entrypoints, then an RTA call-graph build; edges are serialized from a JGraphT graph.
+- `entities/` — the Lombok data model that **is** the `analysis.json` schema
+  (`JavaCompilationUnit`, `Type`, `Callable`, `CallSite`, `CallEdge`, `SystemDepEdge`, …).
+  Schema changes here must stay in lockstep with the Python SDK's `cldk.models.java` models.
+- `javaee/` — Jakarta/Java-EE entrypoint detection helpers.
+- `neo4j/` — the property-graph projection: `GraphProjector` (IR → rows), `CypherWriter`
+  (snapshot), `BoltWriter` (live incremental push; loaded reflectively via `BoltSink` so the
+  native image prunes the driver), `SchemaCatalog` (`schema.neo4j.json` contract).
+- `utils/` — scope/build helpers (`BuildProject` auto-builds the target app), logging.
+
+## Output contract
+
+```jsonc
+{
+  "symbol_table":     { "<abs/or/rel path .java>": JavaCompilationUnit, ... },
+  "call_graph":       [ { "source": {...}, "target": {...}, "type": "CALL_DEP", "weight": ... } ],  // -a 2+
+  "system_dependency_graph": [ ... ],  // -a 3: method-level CONTROL_DEP/DATA_DEP edges (JGraphEdges shape)
+  "program_graphs":   { "schema_version": ..., "functions": { "<fqsig>": { "cfg": ..., "pdg": ... } },
+                        "sdg_edges": [ ... ] },  // -a 3: statement-level, keyed by (signature, node_id)
+  "version": "<analyzer version>"
+}
+```
+
+- Callable signatures are Java method signatures (`<init>` for constructors); call-graph edge
+  endpoints must always resolve to a real symbol-table `Callable` — no dangling edges. Callables
+  discovered only by WALA (e.g. compiler-generated) are back-filled into the symbol table via
+  `createAndPutNewCallableInSymbolTable`.
+- Neo4j labels are `J`-prefixed (`:JType`, `J_CALLS`, `J_DATA_DEP`) so Java/Python/TS graphs
+  can share a database. The contract is `schema.neo4j.json`; `Neo4jSchemaConformanceTest` keeps
+  the projector and the contract in sync — regenerate it when the model changes.
+  `--emit neo4j` defaults to the full SDG analysis (`-a 3`); an explicit `-a` dials down.
+
+## Analysis levels
+
+- **Level 1** — symbol table only (JavaParser; no WALA, no build of the target app needed
+  beyond dependency resolution).
+- **Level 2** — plus the WALA graph: entrypoint-seeded RTA call graph over application
+  classes, and cyclomatic complexity stamped onto symbol-table callables.
+- **Level 3** — plus the full system dependency graph from WALA's slicer: method-level
+  `system_dependency_graph` edges and statement-level `program_graphs` (CFG + PDG per
+  callable, cross-function CALL/PARAM_IN/PARAM_OUT edges). Data dependence defaults to
+  no-heap; `--sdg-data-deps=full` widens it. Schema decisions: `.claude/SCHEMA_DECISIONS.md`.
+  Levels 1/2 output and timings must never be affected by level-3 code.
+  The analyzer is a **pure graph provider**: client analyses (taint, slicing) live in the
+  frontend SDKs as reachability queries over this graph — never add them here. Graph
+  substrate (per-argument PARAM nodes, SUMMARY edges) does belong here.
+
+## Tests
+
+- Unit/integration tests in `src/test/java`; fixture apps in
+  `src/test/resources/test-applications/` (daytrader8 is the big end-to-end fixture; the
+  small ones each pin a regression — build-tool quirks, records, init blocks, generics
+  signature collisions, …).
+- Container-backed tests (`Neo4jBoltWriterTest`) are opt-in: `RUN_CONTAINER_TESTS=1 ./gradlew test`.
+
+## Conventions
+
+- Work is issue-driven: GitHub issue → branch named `minor/issue-<N>-<slug>` (or
+  `major|patch/` by semver impact) → PR to `main`.
+- Spotless formatting is enforced at compile time; Lombok for entities; logs via `utils/Log`,
+  never `System.out` (stdout is the JSON channel).
+- The `dist/`, `node_modules/`, `.astro/` dirs at the repo root are packaging/website
+  artifacts — not part of the analyzer build.
diff --git a/README.md b/README.md
@@ -5,7 +5,7 @@ Native WALA implementation of source code analysis tool for Enterprise Java Appl
 `codeanalyzer` extracts a comprehensive **symbol table** and **call graph** from Java applications
 and emits them either as the canonical `analysis.json`, or as a **Neo4j property graph**
 (`--emit neo4j`) — a `graph.cypher` snapshot or a live, incremental push over Bolt. See
-[§4. Neo4j graph output](#4-neo4j-graph-output).
+[§5. Neo4j graph output](#5-neo4j-graph-output).
 
 ## Quick install
 
@@ -104,8 +104,14 @@ Analyze java application.
                                default, the analysis JSON is printed to the console.
   -b, --build-cmd=<build>    Custom build command. Defaults to auto build.
       --no-build             Do not build your application (use if already built).
-  -a, --analysis-level=<n>   Level of analysis: 1 (symbol table) or 2 (call graph).
-                               Default: 1. Level 2 adds J_CALLS edges to the graph.
+  -a, --analysis-level=<n>   Level of analysis: 1 (symbol table); 2 (call graph);
+                               3 (full system dependency graph). Default: 1.
+                               Level 2 adds J_CALLS edges to the graph.
+      --graphs=<sections>    Comma-separated program_graphs sections to emit at
+                               analysis level 3: cfg, pdg, sdg. Default: all.
+      --sdg-data-deps=<d>    Depth of the slicer's data dependence at analysis
+                               level 3: no-heap (fast, default) | full
+                               (heap-carried dependence; much slower).
   -t, --target-files=<f>...  Restrict analysis to specific files (incremental).
       --emit=<emit>          Output target: json (analysis.json, default) |
                                neo4j (graph.cypher or live Bolt push) |
@@ -187,13 +193,47 @@ There is a sample application in `src/test/resources/sample_apps/daytrader8/bina
 
 This will produce print the SDG on the console. Explore other flags to save the output to a JSON.
 
-## 4. Neo4j graph output
+## 4. Full system dependency graph (`-a 3`)
+
+At analysis level 3, `codeanalyzer` builds the **full system dependency graph** — control *and*
+data dependence — from WALA's slicer on top of the level-2 RTA call graph, and emits two extra
+sections in `analysis.json`:
+
+- **`system_dependency_graph`** — method-level dependence edges in the same shape as
+  `call_graph` (`source`/`target` callable, `type` = `CONTROL_DEP`/`DATA_DEP`,
+  `source_kind`/`destination_kind` = the WALA statement kinds, `weight`). This is the field the
+  CLDK Python SDK's `JApplication.system_dependency_graph` models.
+- **`program_graphs`** — statement-level graphs keyed by `(signature, node_id)`: for each
+  application callable a **CFG** (nodes = SSA instructions with source lines, synthetic
+  `ENTRY`=0/`EXIT`=last; edges labeled `fallthrough`/`true`/`false`/`switch_case`/`loop_back`/
+  `exception`/`return`) and a **PDG** (`CDG` + `DDG` edges over the same nodes), plus
+  cross-function **`sdg_edges`** (`CALL`, `PARAM_IN`, `PARAM_OUT`). Scope the sections with
+  `--graphs cfg,pdg,sdg`.
+
+```sh
+codeanalyzer -i /path/to/project -a 3 -o ./out                     # full SDG, no-heap data deps
+codeanalyzer -i /path/to/project -a 3 --sdg-data-deps=full -o ./out # + heap-carried dependence
+```
+
+By default data dependence runs with WALA's `NO_HEAP_NO_EXCEPTIONS`/`NO_EXCEPTIONAL_EDGES`
+options (fast); `--sdg-data-deps=full` opts into heap-carried data dependence, which is
+substantially slower and only as precise as the RTA builder's type-based pointer analysis.
+
+**Known unsoundness** (documented, unchanged from level 2): reflection is not modeled
+(`ReflectionOptions.NONE`), dynamic class loading and JNI are invisible, and dispatch precision
+is RTA. `SUMMARY` edges (transitive callee summaries) are not yet emitted. Levels 1 and 2 are
+completely unaffected by any of this — nothing SDG-related runs below `-a 3`.
+
+## 5. Neo4j graph output
 
 `codeanalyzer` can project the analysis IR into a [Neo4j](https://neo4j.com/) property graph instead
 of `analysis.json`. The graph is a **lossless** projection of the IR: compilation units, types,
 callables, fields, parameters, call sites, variables, enum constants, record components,
 initialization blocks, CRUD operations/queries, comments, annotations and packages are all
-first-class nodes and relationships, and (at `-a 2`) it adds `J_CALLS` edges from the call graph.
+first-class nodes and relationships, and it adds `J_CALLS` edges from the call graph (`-a 2`+)
+plus `J_CONTROL_DEP`/`J_DATA_DEP`/`J_HEAP_DATA_DEP` edges from the system dependency graph
+(`-a 3`). **`--emit neo4j` defaults to the full SDG analysis (`-a 3`)** — pass an explicit
+`-a 1`/`-a 2` to dial down.
 Every field of the Lombok entity model is represented (scalars as node properties — maps such as a
 field's per-variable initializers are kept as a `*_json` property since Neo4j has no map type;
 comments are `:JComment` nodes in addition to the convenience `docstring` property).
@@ -205,7 +245,7 @@ types `J_`-prefixed (e.g. `:JType`, `:JCallable`, `J_CALLS`) so a Java graph can
 database with the Python (`Py*`/`PY_*`) and TypeScript (`TS*`/`TS_*`) backends without colliding.
 `SCHEMA_VERSION` is stamped onto the `:JApplication` node of every emitted graph.
 
-### 4.1. Cypher snapshot (no database required)
+### 5.1. Cypher snapshot (no database required)
 
 ```sh
 codeanalyzer -i /path/to/project -a 2 --emit neo4j -o ./out
@@ -216,7 +256,7 @@ cypher-shell -u neo4j -p <password> < ./out/graph.cypher
 The snapshot is **not** incremental: it constraints, scopes-wipes this application's prior subgraph,
 then `UNWIND … MERGE`-loads the full truth.
 
-### 4.2. Live incremental push over Bolt
+### 5.2. Live incremental push over Bolt
 
 ```sh
 codeanalyzer -i /path/to/project -a 2 --emit neo4j \
@@ -228,14 +268,14 @@ compilation unit's `content_hash`, replaces just the changed units' subgraphs (i
 `MERGE` upserts), and — on a full run — prunes units whose source file vanished. Combine with
 `--target-files` for a targeted, partial re-push (orphan pruning is then skipped).
 
-### 4.3. Schema contract
+### 5.3. Schema contract
 
 ```sh
 codeanalyzer --emit schema -o ./out   # → ./out/schema.neo4j.json (no project analysis needed)
 codeanalyzer --emit schema            # → prints the contract to stdout
 ```
 
-### 4.4. Verifying the writers
+### 5.4. Verifying the writers
 
 A no-container conformance test (`Neo4jSchemaConformanceTest`) asserts the projector never emits a
 label/relationship/property the catalog doesn't declare, and that `schema.neo4j.json` is current. A

diff --git a/build.gradle b/build.gradle
@@ -82,7 +82,7 @@ dependencies {
 
     implementation 'org.apache.logging.log4j:log4j-api:2.18.0'
     implementation 'org.apache.logging.log4j:log4j-core:2.18.0'
-    def walaVersion = '1.6.7'
+    def walaVersion = '1.6.10'
 
     compileOnly 'org.projectlombok:lombok:1.18.30'
     annotationProcessor 'org.projectlombok:lombok:1.18.30'

diff --git a/schema.neo4j.json b/schema.neo4j.json
@@ -1,5 +1,5 @@
 {
-  "schema_version": "1.0.0",
+  "schema_version": "1.1.0",
   "generator": "codeanalyzer-java",
   "marker_labels": [
     "JEntrypoint"
@@ -451,6 +451,48 @@
         "destination_kind": "string"
       }
     },
+    {
+      "type": "J_CONTROL_DEP",
+      "from": [
+        "JCallable"
+      ],
+      "to": [
+        "JCallable"
+      ],
+      "properties": {
+        "weight": "integer",
+        "source_kind": "string",
+        "destination_kind": "string"
+      }
+    },
+    {
+      "type": "J_DATA_DEP",
+      "from": [
+        "JCallable"
+      ],
+      "to": [
+        "JCallable"
+      ],
+      "properties": {
+        "weight": "integer",
+        "source_kind": "string",
+        "destination_kind": "string"
+      }
+    },
+    {
+      "type": "J_HEAP_DATA_DEP",
+      "from": [
+        "JCallable"
+      ],
+      "to": [
+        "JCallable"
+      ],
+      "properties": {
+        "weight": "integer",
+        "source_kind": "string",
+        "destination_kind": "string"
+      }
+    },
     {
       "type": "J_HAS_CRUD_OPERATION",
       "from": [