Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .claude/SCHEMA_DECISIONS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
# Schema decisions (codeanalyzer-java)

Auditable record of schema-affecting design decisions, in the style of the sibling analyzers'
`.claude/SCHEMA_DECISIONS.md`. Every entry was decided with the maintainer.

## Level-3 full SDG (issue #171, 2026-07-01)

| # | Concept | Options considered | **Decision** | Rationale |
|---|---|---|---|---|
| 1 | Level mapping | (a) `-a 2` emits SDG too (pre-Mar-2025 behavior); (b) new `-a 3` | **new `-a 3`**; `-a 2` stays call-graph-only, byte-identical | level-2 perf/output untouched; matches the CLDK level ladder. Follow-up on python-sdk: default to 3, dial down on request |
| 2 | Slicer dependence options | no-heap only; no-heap + knob; full always | **`--sdg-data-deps <no-heap\|full>`, default `no-heap`** (`NO_HEAP_NO_EXCEPTIONS` + `NO_EXCEPTIONAL_EDGES`; `full` = `DataDependenceOptions.FULL` + `ControlDependenceOptions.FULL`) | old fast settings by default; heap dependence is opt-in because it is an order of magnitude slower |
| 3 | Call-graph builder feeding the SDG | RTA; 0-1-CFA conditional; 0-1-CFA always | **RTA** (`Util.makeRTABuilder`), unchanged | fast, proven on fixtures; 0-1-CFA was tried (979b298) and abandoned; adequate for no-heap deps |
| 4 | SDG edge shape | method-level `JGraphEdges` only; statement-level `program_graphs` | **both**: method-level `system_dependency_graph` (zero SDK model changes) **and** statement-level `program_graphs` per the level-3 contract | the SDK's existing `JGraphEdges` field validates today; `program_graphs` is the forward contract the SDK/SCIP indexing adapts to |
| 5 | Node identity in `program_graphs` | AST-node source-span order (contract wording) | **SSA instruction order**: `node_id` 0 = synthetic `ENTRY`, then SSA instructions by `iindex`, last = `EXIT`; source lines from ECJ/CAst positions, `-1` sentinel when unavailable | WALA nodes are SSA instructions, not AST nodes; instruction order is deterministic across runs on identical content — the property the contract actually needs |
| 6 | CFG edge kinds | full shared vocabulary | shared vocabulary with documented approximations: `true`/`false` by conditional-branch successor order, `loop_back` when target iindex < source iindex, `exception` from WALA exceptional successors, else `fallthrough`/`return` | WALA's SSACFG doesn't label edges; these derivations are deterministic and recorded here rather than invented ad hoc |
| 7 | Cross-function `sdg_edges` | full HRB vocabulary | `CALL`, `PARAM_IN`, `PARAM_OUT` now; **no `SUMMARY` edges yet** (follow-up) | WALA computes HRB summaries lazily inside `Slicer`; exposing them is real extra work and not needed for the graph itself |
| 8 | Precision posture | — | sound-leaning, over-approximate; `ReflectionOptions.NONE` (unchanged); application classes only (`GraphSlicer.prune`) | matches level 2; documented unsoundness, not silently absorbed |
| 9 | Neo4j projection of the SDG | one type + `type` property; per-kind types | **per-kind relationship types `J_CONTROL_DEP`/`J_DATA_DEP`/`J_HEAP_DATA_DEP`** (`JCallable`→`JCallable`, props `weight`/`source_kind`/`destination_kind`, same resolved-gating as `J_CALLS`); schema 1.0.0 → 1.1.0 (additive) | the writers MERGE one relationship per (type, src, dst), so a pair with both control and data dependence would lose one with a single type; WALA's `Dependency` enum is closed (exactly these three), so the vocabulary is total. Statement-level CPG (`CFGNode` etc.) stays a follow-up |
| 10 | Neo4j default analysis level | keep 1; default 3 | **`--emit neo4j` defaults to `-a 3`** (an explicit `-a` still wins) | the graph is the consumer that wants the full SDG; python-sdk mirrors the same contract (python-sdk#228) |
| 11 | Client analyses (taint, slicing) | analyzer-side `taint_flows` section; frontend-side queries | **frontend-side only**: the analyzer emits the full universal graph (`program_graphs` + `system_dependency_graph`) and never runs client analyses; taint/slicing are reachability queries over that graph in the SDK (python-sdk#228) | keeps the analyzer a pure graph provider; model packs (sources/sinks/sanitizers) evolve at SDK speed without analyzer releases. Graph *substrate* additions (per-argument PARAM nodes, `SUMMARY` edges) remain analyzer-side — they are part of the graph, not clients |
7 changes: 7 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -196,3 +196,10 @@ gradle-app.setting
bin/
etc/
/src/test/resources/sample_apps/daytrader8/output/

# Un-ignore the agent guide + schema decision record past a global gitignore that excludes these
!CLAUDE.md
!AGENTS.md
!.claude/
.claude/*
!.claude/SCHEMA_DECISIONS.md
1 change: 1 addition & 0 deletions AGENTS.md
123 changes: 123 additions & 0 deletions CLAUDE.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,123 @@
# codeanalyzer-java

The CLDK Java analyzer. Parses an enterprise Java project with
[JavaParser](https://javaparser.org/) (symbol table) and [WALA](https://github.com/wala/WALA)
(call graph / system dependency graph) and emits the **canonical CLDK `analysis.json`** — a
symbol table plus a dependency graph — so the [CLDK Python SDK](../python-sdk) can consume it
via `CLDK(language="java").analysis(...)`. It can alternatively project the same IR into a
**Neo4j property graph** (`--emit neo4j`).

It is the Java sibling of `codeanalyzer-python` and `codeanalyzer-typescript`.

## Requirements

- Java 11+ to run the jar; Java 17+ (Semeru or similar) to build. GraalVM 21+ only for
`nativeCompile`. Install via [SDKMan!](https://sdkman.io).
- Gradle via the checked-in wrapper (`./gradlew`) — never a system Gradle.

## Build / test / run

```bash
./gradlew fatJar # → build/libs/codeanalyzer-<version>.jar (the deliverable)
./gradlew test # JUnit 5; Testcontainers suites need RUN_CONTAINER_TESTS=1 + Docker/Podman
./gradlew spotlessApply # formatting (runs automatically before compileJava)
./gradlew nativeCompile -PbinDir=$HOME/.local/bin # optional GraalVM native binary

java -jar build/libs/codeanalyzer-*.jar -i <project> -a 2 -o <outdir>
```

Version lives in `gradle.properties` (bump with `./gradlew bumpVersion -PbumpType=patch|minor|major`).
Releases are tag-triggered via GitHub Actions (`.github/workflows/`); a lockstep job releases the
thin PyPI distribution (`packaging/pypi/`).

## CLI

```
codeanalyzer -i <project> [options]

-i, --input <path> project root to analyze
-s, --source-analysis <str> analyze a single string of Java source instead of a project
-o, --output <dir> write <dir>/analysis.json (omit ⇒ JSON to stdout)
-a, --analysis-level <1|2|3> 1 = symbol table (default); 2 = + RTA call graph;
3 = + full system dependency graph (WALA slicer)
--graphs <cfg,pdg,sdg> program_graphs sections to emit at -a 3 (default all)
--sdg-data-deps <d> no-heap (default) | full — slicer data-dependence depth at -a 3
-b, --build-cmd <cmd> custom build command (default: auto-detect mvn/gradle)
--no-build skip building the target app (use if already built)
-t, --target-files <f>... restrict analysis to specific files (incremental)
--emit <json|neo4j|schema> output target (default json)
--app-name / --neo4j-* Neo4j anchor name and Bolt connection (see README §5)
-v, --verbose logs to console
```

stdout is a clean JSON channel when `-o` is omitted; diagnostics go through `utils/Log`.

## Architecture (`src/main/java/com/ibm/cldk/`)

- `CodeAnalyzer.java` — picocli entrypoint; orchestrates symbol table → graph → emitter.
- `SymbolTable.java` — JavaParser + symbol solver; builds `Map<path, JavaCompilationUnit>`.
- `SystemDependencyGraph.java` — WALA-based graph construction: `ScopeUtils` builds the
analysis scope (ECJ/CAst source-level front end), `AnalysisUtils.getEntryPoints` seeds
entrypoints, then an RTA call-graph build; edges are serialized from a JGraphT graph.
- `entities/` — the Lombok data model that **is** the `analysis.json` schema
(`JavaCompilationUnit`, `Type`, `Callable`, `CallSite`, `CallEdge`, `SystemDepEdge`, …).
Schema changes here must stay in lockstep with the Python SDK's `cldk.models.java` models.
- `javaee/` — Jakarta/Java-EE entrypoint detection helpers.
- `neo4j/` — the property-graph projection: `GraphProjector` (IR → rows), `CypherWriter`
(snapshot), `BoltWriter` (live incremental push; loaded reflectively via `BoltSink` so the
native image prunes the driver), `SchemaCatalog` (`schema.neo4j.json` contract).
- `utils/` — scope/build helpers (`BuildProject` auto-builds the target app), logging.

## Output contract

```jsonc
{
"symbol_table": { "<abs/or/rel path .java>": JavaCompilationUnit, ... },
"call_graph": [ { "source": {...}, "target": {...}, "type": "CALL_DEP", "weight": ... } ], // -a 2+
"system_dependency_graph": [ ... ], // -a 3: method-level CONTROL_DEP/DATA_DEP edges (JGraphEdges shape)
"program_graphs": { "schema_version": ..., "functions": { "<fqsig>": { "cfg": ..., "pdg": ... } },
"sdg_edges": [ ... ] }, // -a 3: statement-level, keyed by (signature, node_id)
"version": "<analyzer version>"
}
```

- Callable signatures are Java method signatures (`<init>` for constructors); call-graph edge
endpoints must always resolve to a real symbol-table `Callable` — no dangling edges. Callables
discovered only by WALA (e.g. compiler-generated) are back-filled into the symbol table via
`createAndPutNewCallableInSymbolTable`.
- Neo4j labels are `J`-prefixed (`:JType`, `J_CALLS`, `J_DATA_DEP`) so Java/Python/TS graphs
can share a database. The contract is `schema.neo4j.json`; `Neo4jSchemaConformanceTest` keeps
the projector and the contract in sync — regenerate it when the model changes.
`--emit neo4j` defaults to the full SDG analysis (`-a 3`); an explicit `-a` dials down.

## Analysis levels

- **Level 1** — symbol table only (JavaParser; no WALA, no build of the target app needed
beyond dependency resolution).
- **Level 2** — plus the WALA graph: entrypoint-seeded RTA call graph over application
classes, and cyclomatic complexity stamped onto symbol-table callables.
- **Level 3** — plus the full system dependency graph from WALA's slicer: method-level
`system_dependency_graph` edges and statement-level `program_graphs` (CFG + PDG per
callable, cross-function CALL/PARAM_IN/PARAM_OUT edges). Data dependence defaults to
no-heap; `--sdg-data-deps=full` widens it. Schema decisions: `.claude/SCHEMA_DECISIONS.md`.
Levels 1/2 output and timings must never be affected by level-3 code.
The analyzer is a **pure graph provider**: client analyses (taint, slicing) live in the
frontend SDKs as reachability queries over this graph — never add them here. Graph
substrate (per-argument PARAM nodes, SUMMARY edges) does belong here.

## Tests

- Unit/integration tests in `src/test/java`; fixture apps in
`src/test/resources/test-applications/` (daytrader8 is the big end-to-end fixture; the
small ones each pin a regression — build-tool quirks, records, init blocks, generics
signature collisions, …).
- Container-backed tests (`Neo4jBoltWriterTest`) are opt-in: `RUN_CONTAINER_TESTS=1 ./gradlew test`.

## Conventions

- Work is issue-driven: GitHub issue → branch named `minor/issue-<N>-<slug>` (or
`major|patch/` by semver impact) → PR to `main`.
- Spotless formatting is enforced at compile time; Lombok for entities; logs via `utils/Log`,
never `System.out` (stdout is the JSON channel).
- The `dist/`, `node_modules/`, `.astro/` dirs at the repo root are packaging/website
artifacts — not part of the analyzer build.
58 changes: 49 additions & 9 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ Native WALA implementation of source code analysis tool for Enterprise Java Appl
`codeanalyzer` extracts a comprehensive **symbol table** and **call graph** from Java applications
and emits them either as the canonical `analysis.json`, or as a **Neo4j property graph**
(`--emit neo4j`) — a `graph.cypher` snapshot or a live, incremental push over Bolt. See
4. Neo4j graph output](#4-neo4j-graph-output).
5. Neo4j graph output](#5-neo4j-graph-output).

## Quick install

Expand Down Expand Up @@ -104,8 +104,14 @@ Analyze java application.
default, the analysis JSON is printed to the console.
-b, --build-cmd=<build> Custom build command. Defaults to auto build.
--no-build Do not build your application (use if already built).
-a, --analysis-level=<n> Level of analysis: 1 (symbol table) or 2 (call graph).
Default: 1. Level 2 adds J_CALLS edges to the graph.
-a, --analysis-level=<n> Level of analysis: 1 (symbol table); 2 (call graph);
3 (full system dependency graph). Default: 1.
Level 2 adds J_CALLS edges to the graph.
--graphs=<sections> Comma-separated program_graphs sections to emit at
analysis level 3: cfg, pdg, sdg. Default: all.
--sdg-data-deps=<d> Depth of the slicer's data dependence at analysis
level 3: no-heap (fast, default) | full
(heap-carried dependence; much slower).
-t, --target-files=<f>... Restrict analysis to specific files (incremental).
--emit=<emit> Output target: json (analysis.json, default) |
neo4j (graph.cypher or live Bolt push) |
Expand Down Expand Up @@ -187,13 +193,47 @@ There is a sample application in `src/test/resources/sample_apps/daytrader8/bina

This will produce print the SDG on the console. Explore other flags to save the output to a JSON.

## 4. Neo4j graph output
## 4. Full system dependency graph (`-a 3`)

At analysis level 3, `codeanalyzer` builds the **full system dependency graph** — control *and*
data dependence — from WALA's slicer on top of the level-2 RTA call graph, and emits two extra
sections in `analysis.json`:

- **`system_dependency_graph`** — method-level dependence edges in the same shape as
`call_graph` (`source`/`target` callable, `type` = `CONTROL_DEP`/`DATA_DEP`,
`source_kind`/`destination_kind` = the WALA statement kinds, `weight`). This is the field the
CLDK Python SDK's `JApplication.system_dependency_graph` models.
- **`program_graphs`** — statement-level graphs keyed by `(signature, node_id)`: for each
application callable a **CFG** (nodes = SSA instructions with source lines, synthetic
`ENTRY`=0/`EXIT`=last; edges labeled `fallthrough`/`true`/`false`/`switch_case`/`loop_back`/
`exception`/`return`) and a **PDG** (`CDG` + `DDG` edges over the same nodes), plus
cross-function **`sdg_edges`** (`CALL`, `PARAM_IN`, `PARAM_OUT`). Scope the sections with
`--graphs cfg,pdg,sdg`.

```sh
codeanalyzer -i /path/to/project -a 3 -o ./out # full SDG, no-heap data deps
codeanalyzer -i /path/to/project -a 3 --sdg-data-deps=full -o ./out # + heap-carried dependence
```

By default data dependence runs with WALA's `NO_HEAP_NO_EXCEPTIONS`/`NO_EXCEPTIONAL_EDGES`
options (fast); `--sdg-data-deps=full` opts into heap-carried data dependence, which is
substantially slower and only as precise as the RTA builder's type-based pointer analysis.

**Known unsoundness** (documented, unchanged from level 2): reflection is not modeled
(`ReflectionOptions.NONE`), dynamic class loading and JNI are invisible, and dispatch precision
is RTA. `SUMMARY` edges (transitive callee summaries) are not yet emitted. Levels 1 and 2 are
completely unaffected by any of this — nothing SDG-related runs below `-a 3`.

## 5. Neo4j graph output

`codeanalyzer` can project the analysis IR into a [Neo4j](https://neo4j.com/) property graph instead
of `analysis.json`. The graph is a **lossless** projection of the IR: compilation units, types,
callables, fields, parameters, call sites, variables, enum constants, record components,
initialization blocks, CRUD operations/queries, comments, annotations and packages are all
first-class nodes and relationships, and (at `-a 2`) it adds `J_CALLS` edges from the call graph.
first-class nodes and relationships, and it adds `J_CALLS` edges from the call graph (`-a 2`+)
plus `J_CONTROL_DEP`/`J_DATA_DEP`/`J_HEAP_DATA_DEP` edges from the system dependency graph
(`-a 3`). **`--emit neo4j` defaults to the full SDG analysis (`-a 3`)** — pass an explicit
`-a 1`/`-a 2` to dial down.
Every field of the Lombok entity model is represented (scalars as node properties — maps such as a
field's per-variable initializers are kept as a `*_json` property since Neo4j has no map type;
comments are `:JComment` nodes in addition to the convenience `docstring` property).
Expand All @@ -205,7 +245,7 @@ types `J_`-prefixed (e.g. `:JType`, `:JCallable`, `J_CALLS`) so a Java graph can
database with the Python (`Py*`/`PY_*`) and TypeScript (`TS*`/`TS_*`) backends without colliding.
`SCHEMA_VERSION` is stamped onto the `:JApplication` node of every emitted graph.

### 4.1. Cypher snapshot (no database required)
### 5.1. Cypher snapshot (no database required)

```sh
codeanalyzer -i /path/to/project -a 2 --emit neo4j -o ./out
Expand All @@ -216,7 +256,7 @@ cypher-shell -u neo4j -p <password> < ./out/graph.cypher
The snapshot is **not** incremental: it constraints, scopes-wipes this application's prior subgraph,
then `UNWIND … MERGE`-loads the full truth.

### 4.2. Live incremental push over Bolt
### 5.2. Live incremental push over Bolt

```sh
codeanalyzer -i /path/to/project -a 2 --emit neo4j \
Expand All @@ -228,14 +268,14 @@ compilation unit's `content_hash`, replaces just the changed units' subgraphs (i
`MERGE` upserts), and — on a full run — prunes units whose source file vanished. Combine with
`--target-files` for a targeted, partial re-push (orphan pruning is then skipped).

### 4.3. Schema contract
### 5.3. Schema contract

```sh
codeanalyzer --emit schema -o ./out # → ./out/schema.neo4j.json (no project analysis needed)
codeanalyzer --emit schema # → prints the contract to stdout
```

### 4.4. Verifying the writers
### 5.4. Verifying the writers

A no-container conformance test (`Neo4jSchemaConformanceTest`) asserts the projector never emits a
label/relationship/property the catalog doesn't declare, and that `schema.neo4j.json` is current. A
Expand Down
2 changes: 1 addition & 1 deletion build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -82,7 +82,7 @@ dependencies {

implementation 'org.apache.logging.log4j:log4j-api:2.18.0'
implementation 'org.apache.logging.log4j:log4j-core:2.18.0'
def walaVersion = '1.6.7'
def walaVersion = '1.6.10'

compileOnly 'org.projectlombok:lombok:1.18.30'
annotationProcessor 'org.projectlombok:lombok:1.18.30'
Expand Down
44 changes: 43 additions & 1 deletion schema.neo4j.json
Original file line number Diff line number Diff line change
@@ -1,5 +1,5 @@
{
"schema_version": "1.0.0",
"schema_version": "1.1.0",
"generator": "codeanalyzer-java",
"marker_labels": [
"JEntrypoint"
Expand Down Expand Up @@ -451,6 +451,48 @@
"destination_kind": "string"
}
},
{
"type": "J_CONTROL_DEP",
"from": [
"JCallable"
],
"to": [
"JCallable"
],
"properties": {
"weight": "integer",
"source_kind": "string",
"destination_kind": "string"
}
},
{
"type": "J_DATA_DEP",
"from": [
"JCallable"
],
"to": [
"JCallable"
],
"properties": {
"weight": "integer",
"source_kind": "string",
"destination_kind": "string"
}
},
{
"type": "J_HEAP_DATA_DEP",
"from": [
"JCallable"
],
"to": [
"JCallable"
],
"properties": {
"weight": "integer",
"source_kind": "string",
"destination_kind": "string"
}
},
{
"type": "J_HAS_CRUD_OPERATION",
"from": [
Expand Down
Loading