diff --git a/.claude/SCHEMA_DECISIONS.md b/.claude/SCHEMA_DECISIONS.md new file mode 100644 index 0000000..5d8b7dd --- /dev/null +++ b/.claude/SCHEMA_DECISIONS.md @@ -0,0 +1,20 @@ +# Schema decisions (codeanalyzer-java) + +Auditable record of schema-affecting design decisions, in the style of the sibling analyzers' +`.claude/SCHEMA_DECISIONS.md`. Every entry was decided with the maintainer. + +## Level-3 full SDG (issue #171, 2026-07-01) + +| # | Concept | Options considered | **Decision** | Rationale | +|---|---|---|---|---| +| 1 | Level mapping | (a) `-a 2` emits SDG too (pre-Mar-2025 behavior); (b) new `-a 3` | **new `-a 3`**; `-a 2` stays call-graph-only, byte-identical | level-2 perf/output untouched; matches the CLDK level ladder. Follow-up on python-sdk: default to 3, dial down on request | +| 2 | Slicer dependence options | no-heap only; no-heap + knob; full always | **`--sdg-data-deps `, default `no-heap`** (`NO_HEAP_NO_EXCEPTIONS` + `NO_EXCEPTIONAL_EDGES`; `full` = `DataDependenceOptions.FULL` + `ControlDependenceOptions.FULL`) | old fast settings by default; heap dependence is opt-in because it is an order of magnitude slower | +| 3 | Call-graph builder feeding the SDG | RTA; 0-1-CFA conditional; 0-1-CFA always | **RTA** (`Util.makeRTABuilder`), unchanged | fast, proven on fixtures; 0-1-CFA was tried (979b298) and abandoned; adequate for no-heap deps | +| 4 | SDG edge shape | method-level `JGraphEdges` only; statement-level `program_graphs` | **both**: method-level `system_dependency_graph` (zero SDK model changes) **and** statement-level `program_graphs` per the level-3 contract | the SDK's existing `JGraphEdges` field validates today; `program_graphs` is the forward contract the SDK/SCIP indexing adapts to | +| 5 | Node identity in `program_graphs` | AST-node source-span order (contract wording) | **SSA instruction order**: `node_id` 0 = synthetic `ENTRY`, then SSA instructions by `iindex`, last = `EXIT`; source lines from ECJ/CAst positions, `-1` sentinel when unavailable | WALA nodes are SSA instructions, not AST nodes; instruction order is deterministic across runs on identical content — the property the contract actually needs | +| 6 | CFG edge kinds | full shared vocabulary | shared vocabulary with documented approximations: `true`/`false` by conditional-branch successor order, `loop_back` when target iindex < source iindex, `exception` from WALA exceptional successors, else `fallthrough`/`return` | WALA's SSACFG doesn't label edges; these derivations are deterministic and recorded here rather than invented ad hoc | +| 7 | Cross-function `sdg_edges` | full HRB vocabulary | `CALL`, `PARAM_IN`, `PARAM_OUT` now; **no `SUMMARY` edges yet** (follow-up) | WALA computes HRB summaries lazily inside `Slicer`; exposing them is real extra work and not needed for the graph itself | +| 8 | Precision posture | — | sound-leaning, over-approximate; `ReflectionOptions.NONE` (unchanged); application classes only (`GraphSlicer.prune`) | matches level 2; documented unsoundness, not silently absorbed | +| 9 | Neo4j projection of the SDG | one type + `type` property; per-kind types | **per-kind relationship types `J_CONTROL_DEP`/`J_DATA_DEP`/`J_HEAP_DATA_DEP`** (`JCallable`→`JCallable`, props `weight`/`source_kind`/`destination_kind`, same resolved-gating as `J_CALLS`); schema 1.0.0 → 1.1.0 (additive) | the writers MERGE one relationship per (type, src, dst), so a pair with both control and data dependence would lose one with a single type; WALA's `Dependency` enum is closed (exactly these three), so the vocabulary is total. Statement-level CPG (`CFGNode` etc.) stays a follow-up | +| 10 | Neo4j default analysis level | keep 1; default 3 | **`--emit neo4j` defaults to `-a 3`** (an explicit `-a` still wins) | the graph is the consumer that wants the full SDG; python-sdk mirrors the same contract (python-sdk#228) | +| 11 | Client analyses (taint, slicing) | analyzer-side `taint_flows` section; frontend-side queries | **frontend-side only**: the analyzer emits the full universal graph (`program_graphs` + `system_dependency_graph`) and never runs client analyses; taint/slicing are reachability queries over that graph in the SDK (python-sdk#228) | keeps the analyzer a pure graph provider; model packs (sources/sinks/sanitizers) evolve at SDK speed without analyzer releases. Graph *substrate* additions (per-argument PARAM nodes, `SUMMARY` edges) remain analyzer-side — they are part of the graph, not clients | diff --git a/.gitignore b/.gitignore index c4e7b66..311103e 100644 --- a/.gitignore +++ b/.gitignore @@ -196,3 +196,10 @@ gradle-app.setting bin/ etc/ /src/test/resources/sample_apps/daytrader8/output/ + +# Un-ignore the agent guide + schema decision record past a global gitignore that excludes these +!CLAUDE.md +!AGENTS.md +!.claude/ +.claude/* +!.claude/SCHEMA_DECISIONS.md diff --git a/AGENTS.md b/AGENTS.md new file mode 120000 index 0000000..681311e --- /dev/null +++ b/AGENTS.md @@ -0,0 +1 @@ +CLAUDE.md \ No newline at end of file diff --git a/CLAUDE.md b/CLAUDE.md new file mode 100644 index 0000000..b94af8e --- /dev/null +++ b/CLAUDE.md @@ -0,0 +1,123 @@ +# codeanalyzer-java + +The CLDK Java analyzer. Parses an enterprise Java project with +[JavaParser](https://javaparser.org/) (symbol table) and [WALA](https://github.com/wala/WALA) +(call graph / system dependency graph) and emits the **canonical CLDK `analysis.json`** — a +symbol table plus a dependency graph — so the [CLDK Python SDK](../python-sdk) can consume it +via `CLDK(language="java").analysis(...)`. It can alternatively project the same IR into a +**Neo4j property graph** (`--emit neo4j`). + +It is the Java sibling of `codeanalyzer-python` and `codeanalyzer-typescript`. + +## Requirements + +- Java 11+ to run the jar; Java 17+ (Semeru or similar) to build. GraalVM 21+ only for + `nativeCompile`. Install via [SDKMan!](https://sdkman.io). +- Gradle via the checked-in wrapper (`./gradlew`) — never a system Gradle. + +## Build / test / run + +```bash +./gradlew fatJar # → build/libs/codeanalyzer-.jar (the deliverable) +./gradlew test # JUnit 5; Testcontainers suites need RUN_CONTAINER_TESTS=1 + Docker/Podman +./gradlew spotlessApply # formatting (runs automatically before compileJava) +./gradlew nativeCompile -PbinDir=$HOME/.local/bin # optional GraalVM native binary + +java -jar build/libs/codeanalyzer-*.jar -i -a 2 -o +``` + +Version lives in `gradle.properties` (bump with `./gradlew bumpVersion -PbumpType=patch|minor|major`). +Releases are tag-triggered via GitHub Actions (`.github/workflows/`); a lockstep job releases the +thin PyPI distribution (`packaging/pypi/`). + +## CLI + +``` +codeanalyzer -i [options] + + -i, --input project root to analyze + -s, --source-analysis analyze a single string of Java source instead of a project + -o, --output write /analysis.json (omit ⇒ JSON to stdout) + -a, --analysis-level <1|2|3> 1 = symbol table (default); 2 = + RTA call graph; + 3 = + full system dependency graph (WALA slicer) + --graphs program_graphs sections to emit at -a 3 (default all) + --sdg-data-deps no-heap (default) | full — slicer data-dependence depth at -a 3 + -b, --build-cmd custom build command (default: auto-detect mvn/gradle) + --no-build skip building the target app (use if already built) + -t, --target-files ... restrict analysis to specific files (incremental) + --emit output target (default json) + --app-name / --neo4j-* Neo4j anchor name and Bolt connection (see README §5) + -v, --verbose logs to console +``` + +stdout is a clean JSON channel when `-o` is omitted; diagnostics go through `utils/Log`. + +## Architecture (`src/main/java/com/ibm/cldk/`) + +- `CodeAnalyzer.java` — picocli entrypoint; orchestrates symbol table → graph → emitter. +- `SymbolTable.java` — JavaParser + symbol solver; builds `Map`. +- `SystemDependencyGraph.java` — WALA-based graph construction: `ScopeUtils` builds the + analysis scope (ECJ/CAst source-level front end), `AnalysisUtils.getEntryPoints` seeds + entrypoints, then an RTA call-graph build; edges are serialized from a JGraphT graph. +- `entities/` — the Lombok data model that **is** the `analysis.json` schema + (`JavaCompilationUnit`, `Type`, `Callable`, `CallSite`, `CallEdge`, `SystemDepEdge`, …). + Schema changes here must stay in lockstep with the Python SDK's `cldk.models.java` models. +- `javaee/` — Jakarta/Java-EE entrypoint detection helpers. +- `neo4j/` — the property-graph projection: `GraphProjector` (IR → rows), `CypherWriter` + (snapshot), `BoltWriter` (live incremental push; loaded reflectively via `BoltSink` so the + native image prunes the driver), `SchemaCatalog` (`schema.neo4j.json` contract). +- `utils/` — scope/build helpers (`BuildProject` auto-builds the target app), logging. + +## Output contract + +```jsonc +{ + "symbol_table": { "": JavaCompilationUnit, ... }, + "call_graph": [ { "source": {...}, "target": {...}, "type": "CALL_DEP", "weight": ... } ], // -a 2+ + "system_dependency_graph": [ ... ], // -a 3: method-level CONTROL_DEP/DATA_DEP edges (JGraphEdges shape) + "program_graphs": { "schema_version": ..., "functions": { "": { "cfg": ..., "pdg": ... } }, + "sdg_edges": [ ... ] }, // -a 3: statement-level, keyed by (signature, node_id) + "version": "" +} +``` + +- Callable signatures are Java method signatures (`` for constructors); call-graph edge + endpoints must always resolve to a real symbol-table `Callable` — no dangling edges. Callables + discovered only by WALA (e.g. compiler-generated) are back-filled into the symbol table via + `createAndPutNewCallableInSymbolTable`. +- Neo4j labels are `J`-prefixed (`:JType`, `J_CALLS`, `J_DATA_DEP`) so Java/Python/TS graphs + can share a database. The contract is `schema.neo4j.json`; `Neo4jSchemaConformanceTest` keeps + the projector and the contract in sync — regenerate it when the model changes. + `--emit neo4j` defaults to the full SDG analysis (`-a 3`); an explicit `-a` dials down. + +## Analysis levels + +- **Level 1** — symbol table only (JavaParser; no WALA, no build of the target app needed + beyond dependency resolution). +- **Level 2** — plus the WALA graph: entrypoint-seeded RTA call graph over application + classes, and cyclomatic complexity stamped onto symbol-table callables. +- **Level 3** — plus the full system dependency graph from WALA's slicer: method-level + `system_dependency_graph` edges and statement-level `program_graphs` (CFG + PDG per + callable, cross-function CALL/PARAM_IN/PARAM_OUT edges). Data dependence defaults to + no-heap; `--sdg-data-deps=full` widens it. Schema decisions: `.claude/SCHEMA_DECISIONS.md`. + Levels 1/2 output and timings must never be affected by level-3 code. + The analyzer is a **pure graph provider**: client analyses (taint, slicing) live in the + frontend SDKs as reachability queries over this graph — never add them here. Graph + substrate (per-argument PARAM nodes, SUMMARY edges) does belong here. + +## Tests + +- Unit/integration tests in `src/test/java`; fixture apps in + `src/test/resources/test-applications/` (daytrader8 is the big end-to-end fixture; the + small ones each pin a regression — build-tool quirks, records, init blocks, generics + signature collisions, …). +- Container-backed tests (`Neo4jBoltWriterTest`) are opt-in: `RUN_CONTAINER_TESTS=1 ./gradlew test`. + +## Conventions + +- Work is issue-driven: GitHub issue → branch named `minor/issue--` (or + `major|patch/` by semver impact) → PR to `main`. +- Spotless formatting is enforced at compile time; Lombok for entities; logs via `utils/Log`, + never `System.out` (stdout is the JSON channel). +- The `dist/`, `node_modules/`, `.astro/` dirs at the repo root are packaging/website + artifacts — not part of the analyzer build. diff --git a/README.md b/README.md index e51c7fd..413e63c 100644 --- a/README.md +++ b/README.md @@ -5,7 +5,7 @@ Native WALA implementation of source code analysis tool for Enterprise Java Appl `codeanalyzer` extracts a comprehensive **symbol table** and **call graph** from Java applications and emits them either as the canonical `analysis.json`, or as a **Neo4j property graph** (`--emit neo4j`) — a `graph.cypher` snapshot or a live, incremental push over Bolt. See -[§4. Neo4j graph output](#4-neo4j-graph-output). +[§5. Neo4j graph output](#5-neo4j-graph-output). ## Quick install @@ -104,8 +104,14 @@ Analyze java application. default, the analysis JSON is printed to the console. -b, --build-cmd= Custom build command. Defaults to auto build. --no-build Do not build your application (use if already built). - -a, --analysis-level= Level of analysis: 1 (symbol table) or 2 (call graph). - Default: 1. Level 2 adds J_CALLS edges to the graph. + -a, --analysis-level= Level of analysis: 1 (symbol table); 2 (call graph); + 3 (full system dependency graph). Default: 1. + Level 2 adds J_CALLS edges to the graph. + --graphs= Comma-separated program_graphs sections to emit at + analysis level 3: cfg, pdg, sdg. Default: all. + --sdg-data-deps= Depth of the slicer's data dependence at analysis + level 3: no-heap (fast, default) | full + (heap-carried dependence; much slower). -t, --target-files=... Restrict analysis to specific files (incremental). --emit= Output target: json (analysis.json, default) | neo4j (graph.cypher or live Bolt push) | @@ -187,13 +193,47 @@ There is a sample application in `src/test/resources/sample_apps/daytrader8/bina This will produce print the SDG on the console. Explore other flags to save the output to a JSON. -## 4. Neo4j graph output +## 4. Full system dependency graph (`-a 3`) + +At analysis level 3, `codeanalyzer` builds the **full system dependency graph** — control *and* +data dependence — from WALA's slicer on top of the level-2 RTA call graph, and emits two extra +sections in `analysis.json`: + +- **`system_dependency_graph`** — method-level dependence edges in the same shape as + `call_graph` (`source`/`target` callable, `type` = `CONTROL_DEP`/`DATA_DEP`, + `source_kind`/`destination_kind` = the WALA statement kinds, `weight`). This is the field the + CLDK Python SDK's `JApplication.system_dependency_graph` models. +- **`program_graphs`** — statement-level graphs keyed by `(signature, node_id)`: for each + application callable a **CFG** (nodes = SSA instructions with source lines, synthetic + `ENTRY`=0/`EXIT`=last; edges labeled `fallthrough`/`true`/`false`/`switch_case`/`loop_back`/ + `exception`/`return`) and a **PDG** (`CDG` + `DDG` edges over the same nodes), plus + cross-function **`sdg_edges`** (`CALL`, `PARAM_IN`, `PARAM_OUT`). Scope the sections with + `--graphs cfg,pdg,sdg`. + +```sh +codeanalyzer -i /path/to/project -a 3 -o ./out # full SDG, no-heap data deps +codeanalyzer -i /path/to/project -a 3 --sdg-data-deps=full -o ./out # + heap-carried dependence +``` + +By default data dependence runs with WALA's `NO_HEAP_NO_EXCEPTIONS`/`NO_EXCEPTIONAL_EDGES` +options (fast); `--sdg-data-deps=full` opts into heap-carried data dependence, which is +substantially slower and only as precise as the RTA builder's type-based pointer analysis. + +**Known unsoundness** (documented, unchanged from level 2): reflection is not modeled +(`ReflectionOptions.NONE`), dynamic class loading and JNI are invisible, and dispatch precision +is RTA. `SUMMARY` edges (transitive callee summaries) are not yet emitted. Levels 1 and 2 are +completely unaffected by any of this — nothing SDG-related runs below `-a 3`. + +## 5. Neo4j graph output `codeanalyzer` can project the analysis IR into a [Neo4j](https://neo4j.com/) property graph instead of `analysis.json`. The graph is a **lossless** projection of the IR: compilation units, types, callables, fields, parameters, call sites, variables, enum constants, record components, initialization blocks, CRUD operations/queries, comments, annotations and packages are all -first-class nodes and relationships, and (at `-a 2`) it adds `J_CALLS` edges from the call graph. +first-class nodes and relationships, and it adds `J_CALLS` edges from the call graph (`-a 2`+) +plus `J_CONTROL_DEP`/`J_DATA_DEP`/`J_HEAP_DATA_DEP` edges from the system dependency graph +(`-a 3`). **`--emit neo4j` defaults to the full SDG analysis (`-a 3`)** — pass an explicit +`-a 1`/`-a 2` to dial down. Every field of the Lombok entity model is represented (scalars as node properties — maps such as a field's per-variable initializers are kept as a `*_json` property since Neo4j has no map type; comments are `:JComment` nodes in addition to the convenience `docstring` property). @@ -205,7 +245,7 @@ types `J_`-prefixed (e.g. `:JType`, `:JCallable`, `J_CALLS`) so a Java graph can database with the Python (`Py*`/`PY_*`) and TypeScript (`TS*`/`TS_*`) backends without colliding. `SCHEMA_VERSION` is stamped onto the `:JApplication` node of every emitted graph. -### 4.1. Cypher snapshot (no database required) +### 5.1. Cypher snapshot (no database required) ```sh codeanalyzer -i /path/to/project -a 2 --emit neo4j -o ./out @@ -216,7 +256,7 @@ cypher-shell -u neo4j -p < ./out/graph.cypher The snapshot is **not** incremental: it constraints, scopes-wipes this application's prior subgraph, then `UNWIND … MERGE`-loads the full truth. -### 4.2. Live incremental push over Bolt +### 5.2. Live incremental push over Bolt ```sh codeanalyzer -i /path/to/project -a 2 --emit neo4j \ @@ -228,14 +268,14 @@ compilation unit's `content_hash`, replaces just the changed units' subgraphs (i `MERGE` upserts), and — on a full run — prunes units whose source file vanished. Combine with `--target-files` for a targeted, partial re-push (orphan pruning is then skipped). -### 4.3. Schema contract +### 5.3. Schema contract ```sh codeanalyzer --emit schema -o ./out # → ./out/schema.neo4j.json (no project analysis needed) codeanalyzer --emit schema # → prints the contract to stdout ``` -### 4.4. Verifying the writers +### 5.4. Verifying the writers A no-container conformance test (`Neo4jSchemaConformanceTest`) asserts the projector never emits a label/relationship/property the catalog doesn't declare, and that `schema.neo4j.json` is current. A diff --git a/build.gradle b/build.gradle index 2189bfe..e842798 100644 --- a/build.gradle +++ b/build.gradle @@ -82,7 +82,7 @@ dependencies { implementation 'org.apache.logging.log4j:log4j-api:2.18.0' implementation 'org.apache.logging.log4j:log4j-core:2.18.0' - def walaVersion = '1.6.7' + def walaVersion = '1.6.10' compileOnly 'org.projectlombok:lombok:1.18.30' annotationProcessor 'org.projectlombok:lombok:1.18.30' diff --git a/schema.neo4j.json b/schema.neo4j.json index 09fff6b..b56ef64 100644 --- a/schema.neo4j.json +++ b/schema.neo4j.json @@ -1,5 +1,5 @@ { - "schema_version": "1.0.0", + "schema_version": "1.1.0", "generator": "codeanalyzer-java", "marker_labels": [ "JEntrypoint" @@ -451,6 +451,48 @@ "destination_kind": "string" } }, + { + "type": "J_CONTROL_DEP", + "from": [ + "JCallable" + ], + "to": [ + "JCallable" + ], + "properties": { + "weight": "integer", + "source_kind": "string", + "destination_kind": "string" + } + }, + { + "type": "J_DATA_DEP", + "from": [ + "JCallable" + ], + "to": [ + "JCallable" + ], + "properties": { + "weight": "integer", + "source_kind": "string", + "destination_kind": "string" + } + }, + { + "type": "J_HEAP_DATA_DEP", + "from": [ + "JCallable" + ], + "to": [ + "JCallable" + ], + "properties": { + "weight": "integer", + "source_kind": "string", + "destination_kind": "string" + } + }, { "type": "J_HAS_CRUD_OPERATION", "from": [ diff --git a/src/main/java/com/ibm/cldk/CodeAnalyzer.java b/src/main/java/com/ibm/cldk/CodeAnalyzer.java index e475003..c37c714 100644 --- a/src/main/java/com/ibm/cldk/CodeAnalyzer.java +++ b/src/main/java/com/ibm/cldk/CodeAnalyzer.java @@ -34,8 +34,11 @@ import java.nio.file.Files; import java.nio.file.Path; import java.nio.file.Paths; +import java.util.Arrays; +import java.util.LinkedHashSet; import java.util.List; import java.util.Map; +import java.util.Set; import java.util.stream.Collectors; import org.apache.commons.lang3.tuple.Pair; import picocli.CommandLine; @@ -86,9 +89,20 @@ public class CodeAnalyzer implements Runnable { public static String projectRootPom; @Option(names = { "-a", - "--analysis-level" }, description = "Level of analysis to perform. Options: 1 (for just symbol table); 2 (for call graph). Default: 1") + "--analysis-level" }, description = "Level of analysis to perform. Options: 1 (for just symbol table); 2 (for call graph); 3 (for the full system dependency graph). Default: 1, or 3 when --emit neo4j.") + private static Integer analysisLevelOption; + + /** Resolved analysis level: an explicit -a wins; --emit neo4j defaults to the full SDG (3); else 1. */ public static int analysisLevel = 1; + @Option(names = { + "--graphs" }, description = "Comma-separated program_graphs sections to emit at analysis level 3: cfg, pdg, sdg. Default: all.") + private static String graphs; + + @Option(names = { + "--sdg-data-deps" }, description = "Depth of the slicer's data dependence at analysis level 3: no-heap (fast, default) | full (heap-carried dependence; much slower).") + private static String sdgDataDeps = "no-heap"; + @Option(names = { "--include-test-classes" }, hidden = true, description = "Print logs to console.") public static boolean includeTestClasses = false; @@ -156,8 +170,44 @@ public void run() { } } + /** Fails flag validation: the message must reach the user regardless of verbosity. */ + private static void invalidFlag(String message) { + System.err.println("ERROR: " + message); + System.exit(2); + } + + /** Validates flag values per the CLI contract: unknown values exit non-zero, never fall back. */ + private static Set validateFlags() { + if (analysisLevel < 1 || analysisLevel > 3) { + invalidFlag("Invalid --analysis-level: " + analysisLevel + ". Valid values: 1, 2, 3."); + } + if (!"no-heap".equals(sdgDataDeps) && !"full".equals(sdgDataDeps)) { + invalidFlag("Invalid --sdg-data-deps: " + sdgDataDeps + ". Valid values: no-heap, full."); + } + Set validSections = new LinkedHashSet<>(Arrays.asList("cfg", "pdg", "sdg")); + if (graphs == null) { + return validSections; + } + if (analysisLevel < 3) { + invalidFlag("--graphs requires --analysis-level 3."); + } + Set requested = new LinkedHashSet<>(); + for (String section : graphs.split(",")) { + String trimmed = section.trim().toLowerCase(); + if (!validSections.contains(trimmed)) { + invalidFlag("Invalid --graphs section: '" + trimmed + "'. Valid values: cfg, pdg, sdg."); + } + requested.add(trimmed); + } + return requested; + } + private static void analyze() throws Exception { + analysisLevel = analysisLevelOption != null ? analysisLevelOption + : ("neo4j".equalsIgnoreCase(emit) ? 3 : 1); + Set graphSections = validateFlags(); + // The Neo4j schema contract is a static artifact — no project analysis required. if ("schema".equalsIgnoreCase(emit)) { Neo4jEmitter.emitSchema(output); @@ -245,8 +295,13 @@ private static void analyze() throws Exception { build = build == null ? "auto" : build; // Is noBuild is true, we will not build the project build = noBuild ? null : build; - List sdgEdges = SystemDependencyGraph.construct(input, dependencies, build); - combinedJsonObject.add("call_graph", gson.toJsonTree(sdgEdges)); + SystemDependencyGraph.Result walaResult = SystemDependencyGraph.construct( + input, dependencies, build, analysisLevel >= 3, sdgDataDeps, graphSections); + combinedJsonObject.add("call_graph", gson.toJsonTree(walaResult.getCallEdges())); + if (analysisLevel >= 3) { + combinedJsonObject.add("system_dependency_graph", gson.toJsonTree(walaResult.getSdgEdges())); + combinedJsonObject.add("program_graphs", gson.toJsonTree(walaResult.getProgramGraphs())); + } } } // Cleanup library dependencies directory @@ -258,6 +313,9 @@ private static void analyze() throws Exception { JsonArray callGraph = combinedJsonObject.has("call_graph") ? combinedJsonObject.getAsJsonArray("call_graph") : null; + JsonArray systemDependencyGraph = combinedJsonObject.has("system_dependency_graph") + ? combinedJsonObject.getAsJsonArray("system_dependency_graph") + : null; // Connection options resolve with precedence: CLI flag > NEO4J_* env var > default. String uri = firstNonEmpty(neo4jUri, System.getenv("NEO4J_URI")); BoltConfig bolt = uri == null @@ -266,7 +324,8 @@ private static void analyze() throws Exception { firstNonEmpty(neo4jUser, System.getenv("NEO4J_USERNAME"), "neo4j"), firstNonEmpty(neo4jPassword, System.getenv("NEO4J_PASSWORD"), "neo4j"), firstNonEmpty(neo4jDatabase, System.getenv("NEO4J_DATABASE"))); - Neo4jEmitter.emit(symbolTable, callGraph, appName, input, output, targetFiles != null, bolt); + Neo4jEmitter.emit(symbolTable, callGraph, systemDependencyGraph, appName, input, output, + targetFiles != null, bolt); return; } diff --git a/src/main/java/com/ibm/cldk/SystemDependencyGraph.java b/src/main/java/com/ibm/cldk/SystemDependencyGraph.java index 066f46a..d084048 100644 --- a/src/main/java/com/ibm/cldk/SystemDependencyGraph.java +++ b/src/main/java/com/ibm/cldk/SystemDependencyGraph.java @@ -13,7 +13,6 @@ package com.ibm.cldk; -import static com.ibm.cldk.CodeAnalyzer.analysisLevel; import static com.ibm.cldk.utils.AnalysisUtils.*; import com.ibm.cldk.entities.*; @@ -23,8 +22,7 @@ import com.ibm.wala.cast.ir.ssa.AstIRFactory; import com.ibm.wala.cast.java.translator.jdt.ecj.ECJClassLoaderFactory; import com.ibm.wala.classLoader.CallSiteReference; -import com.ibm.wala.classLoader.JavaLanguage; -import com.ibm.wala.classLoader.Language; +import com.ibm.wala.classLoader.IMethod; import com.ibm.wala.ipa.callgraph.*; import com.ibm.wala.ipa.callgraph.AnalysisOptions.ReflectionOptions; import com.ibm.wala.ipa.callgraph.impl.Util; @@ -33,20 +31,27 @@ import com.ibm.wala.ipa.cha.ClassHierarchyFactory; import com.ibm.wala.ipa.cha.IClassHierarchy; import com.ibm.wala.ipa.modref.ModRef; -import com.ibm.wala.ipa.slicer.MethodEntryStatement; +import com.ibm.wala.ipa.slicer.HeapStatement; import com.ibm.wala.ipa.slicer.SDG; import com.ibm.wala.ipa.slicer.Slicer; import com.ibm.wala.ipa.slicer.Statement; +import com.ibm.wala.ipa.slicer.StatementWithInstructionIndex; +import com.ibm.wala.ssa.IR; +import com.ibm.wala.ssa.ISSABasicBlock; +import com.ibm.wala.ssa.SSAAbstractInvokeInstruction; +import com.ibm.wala.ssa.SSACFG; +import com.ibm.wala.ssa.SSAConditionalBranchInstruction; +import com.ibm.wala.ssa.SSAInstruction; +import com.ibm.wala.ssa.SSANewInstruction; +import com.ibm.wala.ssa.SSAReturnInstruction; +import com.ibm.wala.ssa.SSASwitchInstruction; +import com.ibm.wala.ssa.SSAThrowInstruction; import com.ibm.wala.types.ClassLoaderReference; -import com.ibm.wala.util.collections.HashMapFactory; import com.ibm.wala.util.graph.Graph; import com.ibm.wala.util.graph.GraphSlicer; -import com.ibm.wala.util.graph.traverse.DFS; import java.io.IOException; import java.io.PrintStream; import java.util.*; -import java.util.function.BiFunction; -import java.util.function.Supplier; import java.util.stream.Collectors; import lombok.Data; import lombok.EqualsAndHashCode; @@ -94,10 +99,46 @@ public CallDependency(CallableVertex source, CallableVertex target, AbstractGrap } /** - * The type Sdg 2 json. + * Builds the WALA-based dependency graphs: the RTA call graph (analysis level 2) and, at analysis + * level 3, the full system dependency graph from WALA's slicer — a method-level + * {@code system_dependency_graph} edge list plus the statement-level {@code program_graphs} + * section (per-callable CFG + PDG keyed by (signature, node_id), and cross-function SDG edges). */ public class SystemDependencyGraph { + /** The result of a WALA analysis run; sdgEdges/programGraphs are null below level 3. */ + @Data + public static class Result { + private final List callEdges; + private final List sdgEdges; + private final ProgramGraphs programGraphs; + } + + /** + * Per-callable bookkeeping that maps WALA SSA instruction indexes onto the stable + * (signature, node_id) identity space: ENTRY = 0, instructions in iindex order, EXIT = last. + */ + private static class MethodNodeIndex { + final String fqSignature; + final Map callable; // symbol-table vertex map (filePath/typeDeclaration/...) + final Map iindexToNode = new LinkedHashMap<>(); + final int exitNode; + + MethodNodeIndex(IMethod method, IR ir) { + this.callable = Optional.ofNullable(getCallableFromSymbolTable(method)) + .orElseGet(() -> createAndPutNewCallableInSymbolTable(method)); + this.fqSignature = callable.get("typeDeclaration") + "." + callable.get("signature"); + SSAInstruction[] instructions = ir.getInstructions(); + int nextId = 1; + for (int i = 0; i < instructions.length; i++) { + if (instructions[i] != null) { + iindexToNode.put(i, nextId++); + } + } + this.exitNode = nextId; + } + } + /** * Get a JGraphT graph exporter to save graph as JSON. * @@ -157,20 +198,24 @@ private static org.jgrapht.Graph buildOnlyCal } /** - * Construct a System Dependency Graph from a given input. + * Construct the WALA dependency graphs for a given input. * * @param input the input * @param dependencies the dependencies * @param build The build options - * @return A List of triples containing the source, destination, and edge type + * @param buildFullSdg also build the full slicer SDG (analysis level 3) + * @param sdgDataDeps "no-heap" or "full" — how deep slicer data dependence goes + * @param graphs which program_graphs sections to emit (cfg/pdg/sdg) + * @return the call graph edges plus, at level 3, the SDG edge list and program graphs * @throws IOException the io exception * @throws ClassHierarchyException the class hierarchy exception * @throws IllegalArgumentException the illegal argument exception * @throws CallGraphBuilderCancelException the call graph builder cancel * exception */ - public static List construct( - String input, String dependencies, String build) + public static Result construct( + String input, String dependencies, String build, boolean buildFullSdg, String sdgDataDeps, + Set graphs) throws IOException, ClassHierarchyException, IllegalArgumentException, CallGraphBuilderCancelException { // Initialize scope @@ -223,7 +268,7 @@ public static List construct( graph = buildOnlyCallGraph(callGraph); - List edges = graph.edgeSet().stream() + List callEdges = graph.edgeSet().stream() .map(abstractGraphEdge -> { CallableVertex source = graph.getEdgeSource(abstractGraphEdge); CallableVertex target = graph.getEdgeTarget(abstractGraphEdge); @@ -235,6 +280,398 @@ public static List construct( }) .collect(Collectors.toList()); - return edges; + if (!buildFullSdg) { + return new Result(callEdges, null, null); + } + + // ------------------------------------------------------------------ + // Analysis level 3: the full system dependency graph via WALA's slicer. + // ------------------------------------------------------------------ + Slicer.DataDependenceOptions dataOptions = "full".equals(sdgDataDeps) + ? Slicer.DataDependenceOptions.FULL + : Slicer.DataDependenceOptions.NO_HEAP_NO_EXCEPTIONS; + Slicer.ControlDependenceOptions controlOptions = "full".equals(sdgDataDeps) + ? Slicer.ControlDependenceOptions.FULL + : Slicer.ControlDependenceOptions.NO_EXCEPTIONAL_EDGES; + + Log.info("Building system dependency graph (data dependence: " + sdgDataDeps + ")."); + start_time = System.currentTimeMillis(); + SDG sdg = new SDG<>( + callGraph, + builder.getPointerAnalysis(), + new ModRef<>(), + dataOptions, + controlOptions); + + // Keep only statements of application classes, matching the call graph's scope. + Graph prunedGraph = GraphSlicer.prune(sdg, + statement -> statement.getNode() + .getMethod() + .getDeclaringClass() + .getClassLoader() + .getReference() + .equals(ClassLoaderReference.Application)); + + // (signature, node_id) index per method; the fake root and methods without IR are skipped. + Map methodIndex = new LinkedHashMap<>(); + ProgramGraphs programGraphs = new ProgramGraphs(); + programGraphs.setDataDependence(sdgDataDeps); + callGraph.forEach(cgNode -> { + IMethod method = cgNode.getMethod(); + IR ir = cgNode.getIR(); + if (ir == null || !AnalysisUtils.isApplicationClass(method.getDeclaringClass()) + || methodIndex.containsKey(method)) { + return; + } + MethodNodeIndex index = new MethodNodeIndex(method, ir); + methodIndex.put(method, index); + ProgramGraphs.FunctionProgramGraph fg = new ProgramGraphs.FunctionProgramGraph(); + fg.setSignature(index.callable.get("signature")); + fg.setTypeDeclaration(index.callable.get("typeDeclaration")); + fg.setFilePath(index.callable.get("filePath")); + if (graphs.contains("cfg")) { + fg.setCfg(buildCfg(method, ir, index)); + } + if (graphs.contains("pdg")) { + fg.setPdg(new ProgramGraphs.Pdg()); + } + programGraphs.getFunctions().put(index.fqSignature, fg); + }); + + // Walk the pruned SDG once: same-method edges become PDG edges (CDG/DDG); cross-method + // edges become method-level SystemDepEdges plus statement-level sdg_edges. + Map methodLevelEdges = new LinkedHashMap<>(); + Map methodLevelDeps = new LinkedHashMap<>(); + Set seenPdgEdges = new HashSet<>(); + Set seenSdgEdges = new HashSet<>(); + prunedGraph.forEach(p -> prunedGraph.getSuccNodes(p).forEachRemaining(s -> { + for (String label : edgeLabels(sdg, p, s)) { + if (p.getNode().equals(s.getNode())) { + if (graphs.contains("pdg")) { + addPdgEdge(programGraphs, methodIndex, p, s, label, seenPdgEdges); + } + } else { + collapseToMethodEdge(methodLevelEdges, methodLevelDeps, p, s, label); + if (graphs.contains("sdg")) { + addSdgEdge(programGraphs, methodIndex, p, s, label, seenSdgEdges); + } + } + } + })); + + List sdgEdges = methodLevelDeps.entrySet().stream() + .sorted(Map.Entry.comparingByKey()) + .map(e -> { + // Re-wrap so the serialized weight reflects the final increment count. + SDGDependency dep = (SDGDependency) e.getValue(); + return (Dependency) new SDGDependency(dep.getSource(), dep.getTarget(), + methodLevelEdges.get(e.getKey())); + }) + .collect(Collectors.toList()); + + sortProgramGraphs(programGraphs, graphs); + + Log.done("Finished construction of system dependency graph. Took " + + Math.ceil((double) (System.currentTimeMillis() - start_time) / 1000) + " seconds. " + + programGraphs.getFunctions().size() + " functions, " + + sdgEdges.size() + " method-level edges, " + + programGraphs.getSdgEdges().size() + " cross-function statement edges."); + + return new Result(callEdges, sdgEdges, programGraphs); + } + + /** + * All dependence labels on the edge (a statement pair can carry both CONTROL_DEP and + * DATA_DEP), sorted for determinism; ["UNKNOWN"] when WALA offers none. + */ + private static List edgeLabels(SDG sdg, Statement p, Statement s) { + try { + List labels = sdg.getEdgeLabels(p, s).stream() + .map(String::valueOf) + .sorted() + .collect(Collectors.toList()); + return labels.isEmpty() ? Collections.singletonList("UNKNOWN") : labels; + } catch (RuntimeException e) { + return Collections.singletonList("UNKNOWN"); + } + } + + /** + * The CFG over the shared node ids: within-block instructions chain as fallthrough; block + * successors are labeled true/false (conditional branches, fallthrough block = false), + * switch_case, exception (exceptional successors), return (to EXIT), or loop_back (back + * edges). Empty basic blocks are contracted onto their first real successor. + */ + private static ProgramGraphs.Cfg buildCfg(IMethod method, IR ir, MethodNodeIndex index) { + ProgramGraphs.Cfg cfg = new ProgramGraphs.Cfg(); + SSAInstruction[] instructions = ir.getInstructions(); + + ProgramGraphs.Node entry = new ProgramGraphs.Node(); + entry.setId(0); + entry.setKind("entry"); + cfg.getNodes().add(entry); + index.iindexToNode.forEach((iindex, nodeId) -> { + ProgramGraphs.Node node = new ProgramGraphs.Node(); + node.setId(nodeId); + node.setKind(kindOf(instructions[iindex])); + try { + IMethod.SourcePosition pos = method.getSourcePosition(iindex); + if (pos != null) { + node.setStartLine(pos.getFirstLine()); + node.setEndLine(pos.getLastLine()); + } + } catch (Exception e) { + // no debug info for this instruction; keep the -1 sentinel + } + cfg.getNodes().add(node); + }); + ProgramGraphs.Node exit = new ProgramGraphs.Node(); + exit.setId(index.exitNode); + exit.setKind("exit"); + cfg.getNodes().add(exit); + + SSACFG ssacfg = ir.getControlFlowGraph(); + Set seen = new HashSet<>(); + for (ISSABasicBlock block : ssacfg) { + List blockNodes = blockNodeIds(block, instructions, index); + + // Chain the instructions within the block. + for (int i = 0; i + 1 < blockNodes.size(); i++) { + addCfgEdge(cfg, seen, blockNodes.get(i), blockNodes.get(i + 1), "fallthrough"); + } + + int sourceNode = block.equals(ssacfg.entry()) ? 0 + : blockNodes.isEmpty() ? -1 : blockNodes.get(blockNodes.size() - 1); + if (sourceNode < 0) { + continue; // empty block: contracted below via resolveFirstNodes of its predecessors + } + SSAInstruction last = blockNodes.isEmpty() ? null + : instructions[lastInstructionIndex(block, instructions)]; + + for (ISSABasicBlock succ : ssacfg.getNormalSuccessors(block)) { + for (int targetNode : resolveFirstNodes(ssacfg, succ, instructions, index, new HashSet<>())) { + addCfgEdge(cfg, seen, sourceNode, targetNode, + normalEdgeKind(block, succ, last, sourceNode, targetNode, index)); + } + } + for (ISSABasicBlock succ : ssacfg.getExceptionalSuccessors(block)) { + for (int targetNode : resolveFirstNodes(ssacfg, succ, instructions, index, new HashSet<>())) { + addCfgEdge(cfg, seen, sourceNode, targetNode, "exception"); + } + } + } + cfg.getEdges().sort(Comparator.comparingInt(ProgramGraphs.CfgEdge::getSource) + .thenComparingInt(ProgramGraphs.CfgEdge::getTarget) + .thenComparing(ProgramGraphs.CfgEdge::getKind)); + return cfg; + } + + private static void addCfgEdge(ProgramGraphs.Cfg cfg, Set seen, int source, int target, String kind) { + if (seen.add(source + ">" + target + ">" + kind)) { + cfg.getEdges().add(new ProgramGraphs.CfgEdge(source, target, kind)); + } + } + + private static String normalEdgeKind(ISSABasicBlock block, ISSABasicBlock succ, SSAInstruction last, + int sourceNode, int targetNode, MethodNodeIndex index) { + if (targetNode == index.exitNode) { + return "return"; + } + if (last instanceof SSAConditionalBranchInstruction) { + // WALA's fallthrough (not-taken) block is the next block in layout order. + return succ.getNumber() == block.getNumber() + 1 ? "false" : "true"; + } + if (last instanceof SSASwitchInstruction) { + return "switch_case"; + } + if (targetNode <= sourceNode && sourceNode != 0) { + return "loop_back"; + } + return "fallthrough"; + } + + private static int lastInstructionIndex(ISSABasicBlock block, SSAInstruction[] instructions) { + for (int i = block.getLastInstructionIndex(); i >= block.getFirstInstructionIndex(); i--) { + if (i >= 0 && i < instructions.length && instructions[i] != null) { + return i; + } + } + return -1; + } + + private static List blockNodeIds(ISSABasicBlock block, SSAInstruction[] instructions, + MethodNodeIndex index) { + List nodes = new ArrayList<>(); + for (int i = Math.max(block.getFirstInstructionIndex(), 0); i <= block.getLastInstructionIndex() + && i < instructions.length; i++) { + if (instructions[i] != null && index.iindexToNode.containsKey(i)) { + nodes.add(index.iindexToNode.get(i)); + } + } + return nodes; + } + + /** First real node(s) reachable from a block, contracting empty blocks; EXIT for the exit block. */ + private static List resolveFirstNodes(SSACFG cfg, ISSABasicBlock block, + SSAInstruction[] instructions, MethodNodeIndex index, Set visited) { + if (!visited.add(block.getNumber())) { + return Collections.emptyList(); + } + if (block.equals(cfg.exit())) { + return Collections.singletonList(index.exitNode); + } + List nodes = blockNodeIds(block, instructions, index); + if (!nodes.isEmpty()) { + return Collections.singletonList(nodes.get(0)); + } + List resolved = new ArrayList<>(); + for (ISSABasicBlock succ : cfg.getNormalSuccessors(block)) { + resolved.addAll(resolveFirstNodes(cfg, succ, instructions, index, visited)); + } + return resolved; + } + + private static String kindOf(SSAInstruction instruction) { + if (instruction instanceof SSAAbstractInvokeInstruction) { + return "call"; + } else if (instruction instanceof SSAConditionalBranchInstruction) { + return "branch"; + } else if (instruction instanceof SSASwitchInstruction) { + return "switch"; + } else if (instruction instanceof SSAReturnInstruction) { + return "return"; + } else if (instruction instanceof SSAThrowInstruction) { + return "throw"; + } else if (instruction instanceof SSANewInstruction) { + return "new"; + } + return "instruction"; + } + + /** Maps a slicer statement onto this method's node ids; null when it has no stable anchor. */ + private static Integer statementNode(Statement statement, MethodNodeIndex index) { + switch (statement.getKind()) { + case NORMAL: + case PARAM_CALLER: + case NORMAL_RET_CALLER: + case EXC_RET_CALLER: + case CATCH: + return index.iindexToNode.get(((StatementWithInstructionIndex) statement).getInstructionIndex()); + case METHOD_ENTRY: + case PARAM_CALLEE: + return 0; + case METHOD_EXIT: + case NORMAL_RET_CALLEE: + case EXC_RET_CALLEE: + return index.exitNode; + case HEAP_PARAM_CALLER: + case HEAP_RET_CALLER: + return index.iindexToNode.get(((HeapStatement.HeapParamCaller.class.isInstance(statement)) + ? ((HeapStatement.HeapParamCaller) statement).getCall() + : ((HeapStatement.HeapReturnCaller) statement).getCall()).iIndex()); + case HEAP_PARAM_CALLEE: + return 0; + case HEAP_RET_CALLEE: + return index.exitNode; + default: // PHI, PI — no stable source anchor + return null; + } + } + + private static void addPdgEdge(ProgramGraphs programGraphs, Map methodIndex, + Statement p, Statement s, String label, Set seen) { + MethodNodeIndex index = methodIndex.get(p.getNode().getMethod()); + if (index == null) { + return; + } + Integer source = statementNode(p, index); + Integer target = statementNode(s, index); + if (source == null || target == null || source.equals(target)) { + return; + } + if (seen.add(index.fqSignature + "#" + source + ">" + target + ">" + label)) { + String type = label.contains("CONTROL") ? "CDG" : "DDG"; + programGraphs.getFunctions().get(index.fqSignature).getPdg().getEdges() + .add(new ProgramGraphs.PdgEdge(source, target, type, label)); + } + } + + private static void addSdgEdge(ProgramGraphs programGraphs, Map methodIndex, + Statement p, Statement s, String label, Set seen) { + MethodNodeIndex sourceIndex = methodIndex.get(p.getNode().getMethod()); + MethodNodeIndex targetIndex = methodIndex.get(s.getNode().getMethod()); + if (sourceIndex == null || targetIndex == null) { + return; + } + Integer source = statementNode(p, sourceIndex); + Integer target = statementNode(s, targetIndex); + if (source == null || target == null) { + return; + } + String type = crossEdgeType(p, s, label); + String key = sourceIndex.fqSignature + "#" + source + ">" + targetIndex.fqSignature + "#" + target + ">" + type; + if (seen.add(key)) { + programGraphs.getSdgEdges().add(new ProgramGraphs.SdgEdge( + new ProgramGraphs.SdgEndpoint(sourceIndex.fqSignature, source), + new ProgramGraphs.SdgEndpoint(targetIndex.fqSignature, target), + type, label)); + } + } + + /** HRB edge vocabulary from WALA's statement kinds: CALL, PARAM_IN, PARAM_OUT. */ + private static String crossEdgeType(Statement p, Statement s, String label) { + switch (s.getKind()) { + case METHOD_ENTRY: + return "CALL"; + case PARAM_CALLEE: + case HEAP_PARAM_CALLEE: + return "PARAM_IN"; + default: + break; + } + switch (p.getKind()) { + case NORMAL_RET_CALLEE: + case EXC_RET_CALLEE: + case METHOD_EXIT: + case HEAP_RET_CALLEE: + return "PARAM_OUT"; + default: + return label.contains("CONTROL") ? "CALL" : "PARAM_IN"; + } + } + + private static void collapseToMethodEdge(Map methodLevelEdges, + Map methodLevelDeps, Statement p, Statement s, String label) { + Map source = Optional.ofNullable(getCallableFromSymbolTable(p.getNode().getMethod())) + .orElseGet(() -> createAndPutNewCallableInSymbolTable(p.getNode().getMethod())); + Map target = Optional.ofNullable(getCallableFromSymbolTable(s.getNode().getMethod())) + .orElseGet(() -> createAndPutNewCallableInSymbolTable(s.getNode().getMethod())); + + String key = source.get("typeDeclaration") + "." + source.get("signature") + ">" + + target.get("typeDeclaration") + "." + target.get("signature") + ">" + label; + SystemDepEdge existing = methodLevelEdges.get(key); + if (existing != null) { + existing.incrementWeight(); + } else { + SystemDepEdge edge = new SystemDepEdge(p, s, label); + methodLevelEdges.put(key, edge); + methodLevelDeps.put(key, + new SDGDependency(new CallableVertex(source), new CallableVertex(target), edge)); + } + } + + private static void sortProgramGraphs(ProgramGraphs programGraphs, Set graphs) { + if (graphs.contains("pdg")) { + programGraphs.getFunctions().values().forEach(fg -> fg.getPdg().getEdges() + .sort(Comparator.comparingInt(ProgramGraphs.PdgEdge::getSource) + .thenComparingInt(ProgramGraphs.PdgEdge::getTarget) + .thenComparing(ProgramGraphs.PdgEdge::getType))); + } + programGraphs.getSdgEdges().sort(Comparator + .comparing((ProgramGraphs.SdgEdge e) -> e.getSource().getSignature()) + .thenComparingInt(e -> e.getSource().getNode()) + .thenComparing(e -> e.getTarget().getSignature()) + .thenComparingInt(e -> e.getTarget().getNode()) + .thenComparing(ProgramGraphs.SdgEdge::getType)); } } diff --git a/src/main/java/com/ibm/cldk/entities/ProgramGraphs.java b/src/main/java/com/ibm/cldk/entities/ProgramGraphs.java new file mode 100644 index 0000000..85d8ddf --- /dev/null +++ b/src/main/java/com/ibm/cldk/entities/ProgramGraphs.java @@ -0,0 +1,129 @@ +/* +Copyright IBM Corporation 2023, 2024 + +Licensed under the Apache Public License 2.0, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ + +package com.ibm.cldk.entities; + +import java.util.ArrayList; +import java.util.List; +import java.util.Map; +import java.util.TreeMap; +import lombok.Data; + +/** + * The statement-level program graphs emitted at analysis level 3 (the CLDK level-3 dataflow + * contract). Functions are keyed by the fully qualified signature + * {@code .}; every node inside a function is identified by a + * small integer id that is stable across runs on identical content: 0 is the synthetic ENTRY, + * SSA instructions follow in instruction-index order, and the last id is the synthetic EXIT. + * Cross-function edges reference both endpoints by (signature, node). + */ +@Data +public class ProgramGraphs { + private String schemaVersion = "1.0.0"; + /** Which data-dependence options the WALA slicer ran with: "no-heap" or "full". */ + private String dataDependence; + /** Sorted so output is deterministic across runs. */ + private Map functions = new TreeMap<>(); + private List sdgEdges = new ArrayList<>(); + + /** Per-callable graphs: the CFG over (signature, node_id) nodes and the PDG edges over them. */ + @Data + public static class FunctionProgramGraph { + private String signature; + private String typeDeclaration; + private String filePath; + private Cfg cfg; + private Pdg pdg; + } + + @Data + public static class Cfg { + private List nodes = new ArrayList<>(); + private List edges = new ArrayList<>(); + } + + @Data + public static class Pdg { + private List edges = new ArrayList<>(); + } + + @Data + public static class Node { + private int id; + /** entry | exit | call | branch | switch | return | throw | new | instruction */ + private String kind; + /** -1 when no source mapping is available (synthetic nodes, missing debug info). */ + private int startLine = -1; + private int endLine = -1; + } + + @Data + public static class CfgEdge { + private int source; + private int target; + /** fallthrough | true | false | switch_case | loop_back | exception | return */ + private String kind; + + public CfgEdge(int source, int target, String kind) { + this.source = source; + this.target = target; + this.kind = kind; + } + } + + @Data + public static class PdgEdge { + private int source; + private int target; + /** CDG | DDG */ + private String type; + /** The raw WALA dependence label (e.g. CONTROL_DEP, DATA_DEP) for provenance. */ + private String label; + + public PdgEdge(int source, int target, String type, String label) { + this.source = source; + this.target = target; + this.type = type; + this.label = label; + } + } + + /** A cross-function dependence edge; endpoints are (signature, node). */ + @Data + public static class SdgEdge { + private SdgEndpoint source; + private SdgEndpoint target; + /** CALL | PARAM_IN | PARAM_OUT */ + private String type; + /** The raw WALA dependence label for provenance. */ + private String label; + + public SdgEdge(SdgEndpoint source, SdgEndpoint target, String type, String label) { + this.source = source; + this.target = target; + this.type = type; + this.label = label; + } + } + + @Data + public static class SdgEndpoint { + private String signature; + private int node; + + public SdgEndpoint(String signature, int node) { + this.signature = signature; + this.node = node; + } + } +} diff --git a/src/main/java/com/ibm/cldk/neo4j/GraphProjector.java b/src/main/java/com/ibm/cldk/neo4j/GraphProjector.java index a5fae55..9d11551 100644 --- a/src/main/java/com/ibm/cldk/neo4j/GraphProjector.java +++ b/src/main/java/com/ibm/cldk/neo4j/GraphProjector.java @@ -75,6 +75,17 @@ private GraphProjector() {} * @param appName logical application name for the {@code :JApplication} anchor. */ public static GraphRows project(Map symbolTable, JsonArray callGraph, String appName) { + return project(symbolTable, callGraph, null, appName); + } + + /** + * @param symbolTable file path → {@link JavaCompilationUnit} (the {@code symbol_table} map). + * @param callGraph the {@code call_graph} array (level 2), or {@code null} at level 1. + * @param systemDependencyGraph the {@code system_dependency_graph} array (level 3), or {@code null}. + * @param appName logical application name for the {@code :JApplication} anchor. + */ + public static GraphRows project(Map symbolTable, JsonArray callGraph, + JsonArray systemDependencyGraph, String appName) { RowBuilder b = new RowBuilder(); NodeRef app = b.node(Collections.singletonList("JApplication"), "name", appName, @@ -92,6 +103,9 @@ public static GraphRows project(Map symbolTable, Js if (callGraph != null) { projectCallGraph(b, callGraph); } + if (systemDependencyGraph != null) { + projectSystemDependencyGraph(b, systemDependencyGraph); + } return b.finish(); } @@ -445,6 +459,48 @@ private static void projectCallGraph(RowBuilder b, JsonArray callGraph) { } } + // ---------------------------------------------------------------------------------------------- + // System dependency graph (level 3) + // ---------------------------------------------------------------------------------------------- + + /** + * Method-level SDG edges (level 3) project between the same {@code :JCallable} nodes the call + * graph resolves to. The dependence kind rides in the relationship type — WALA's + * closed {@code Dependency} vocabulary maps to {@code J_CONTROL_DEP} / {@code J_DATA_DEP} / + * {@code J_HEAP_DATA_DEP} — because the writers MERGE one relationship per (type, source, + * target): a pair with both a control and a data dependence must keep both edges. + */ + private static void projectSystemDependencyGraph(RowBuilder b, JsonArray systemDependencyGraph) { + for (JsonElement el : systemDependencyGraph) { + if (!el.isJsonObject()) { + continue; + } + JsonObject edge = el.getAsJsonObject(); + String relType = sdgRelType(str(edge, "type")); + String from = vertexId(edge.getAsJsonObject("source")); + String to = vertexId(edge.getAsJsonObject("target")); + if (relType == null || from == null || to == null) { + continue; + } + Map props = RowBuilder.prune(map( + "weight", asLong(parseIntOrNull(str(edge, "weight"))), + "source_kind", str(edge, "source_kind"), + "destination_kind", str(edge, "destination_kind"))); + // Same resolved-gating as J_CALLS: kept only if both callables were emitted as nodes. + b.edgeIfBothResolved(relType, + new NodeRef("JSymbol", "id", from), new NodeRef("JSymbol", "id", to), props); + } + } + + /** WALA's closed dependence vocabulary → relationship type; null (skip) for anything else. */ + private static String sdgRelType(String dependenceType) { + if ("CONTROL_DEP".equals(dependenceType) || "DATA_DEP".equals(dependenceType) + || "HEAP_DATA_DEP".equals(dependenceType)) { + return "J_" + dependenceType; + } + return null; + } + private static String vertexId(JsonObject vertex) { if (vertex == null) { return null; diff --git a/src/main/java/com/ibm/cldk/neo4j/Neo4jEmitter.java b/src/main/java/com/ibm/cldk/neo4j/Neo4jEmitter.java index c022baf..283f00f 100644 --- a/src/main/java/com/ibm/cldk/neo4j/Neo4jEmitter.java +++ b/src/main/java/com/ibm/cldk/neo4j/Neo4jEmitter.java @@ -55,18 +55,20 @@ public static void emitSchema(String output) throws IOException { /** * Project + emit the Neo4j graph. * - * @param symbolTable the {@code symbol_table} map. - * @param callGraph the {@code call_graph} array (level 2), or {@code null}. - * @param appName logical application name (null ⇒ derived from input dir). - * @param input the analyzed project root (used to derive appName + the cypher output dir). - * @param output output directory (null ⇒ cwd for the snapshot). - * @param targetFiles non-null when an incremental/targeted run was requested. - * @param bolt non-null ⇒ push to a live DB over Bolt; null ⇒ write graph.cypher. + * @param symbolTable the {@code symbol_table} map. + * @param callGraph the {@code call_graph} array (level 2), or {@code null}. + * @param systemDependencyGraph the {@code system_dependency_graph} array (level 3), or {@code null}. + * @param appName logical application name (null ⇒ derived from input dir). + * @param input the analyzed project root (used to derive appName + the cypher output dir). + * @param output output directory (null ⇒ cwd for the snapshot). + * @param targetedRun true when an incremental/targeted run was requested. + * @param bolt non-null ⇒ push to a live DB over Bolt; null ⇒ write graph.cypher. */ - public static void emit(Map symbolTable, JsonArray callGraph, String appName, + public static void emit(Map symbolTable, JsonArray callGraph, + JsonArray systemDependencyGraph, String appName, String input, String output, boolean targetedRun, BoltConfig bolt) throws IOException { String name = appName != null ? appName : deriveAppName(input); - GraphRows rows = GraphProjector.project(symbolTable, callGraph, name); + GraphRows rows = GraphProjector.project(symbolTable, callGraph, systemDependencyGraph, name); if (bolt != null) { BoltSink sink = loadBoltSink(); diff --git a/src/main/java/com/ibm/cldk/neo4j/SchemaCatalog.java b/src/main/java/com/ibm/cldk/neo4j/SchemaCatalog.java index ba20dda..6a0ba3d 100644 --- a/src/main/java/com/ibm/cldk/neo4j/SchemaCatalog.java +++ b/src/main/java/com/ibm/cldk/neo4j/SchemaCatalog.java @@ -35,7 +35,7 @@ public final class SchemaCatalog { private SchemaCatalog() {} - public static final String SCHEMA_VERSION = "1.0.0"; + public static final String SCHEMA_VERSION = "1.1.0"; /** Labels layered onto a node in addition to its primary/specific label. */ public static final List MARKER_LABELS = Arrays.asList("JEntrypoint"); @@ -221,6 +221,11 @@ private static List buildRelTypes() { r.add(new RelType("J_CALLS", Arrays.asList("JCallable"), Arrays.asList("JCallable"), new P().put("type", "string").put("weight", "integer") .put("source_kind", "string").put("destination_kind", "string").done())); + for (String dep : Arrays.asList("J_CONTROL_DEP", "J_DATA_DEP", "J_HEAP_DATA_DEP")) { + r.add(new RelType(dep, Arrays.asList("JCallable"), Arrays.asList("JCallable"), + new P().put("weight", "integer") + .put("source_kind", "string").put("destination_kind", "string").done())); + } r.add(new RelType("J_HAS_CRUD_OPERATION", Arrays.asList("JCallable", "JCallSite"), Arrays.asList("JCrudOperation"), none)); r.add(new RelType("J_HAS_CRUD_QUERY", Arrays.asList("JCallable", "JCallSite"), diff --git a/src/test/java/com/ibm/cldk/CodeAnalyzerIntegrationTest.java b/src/test/java/com/ibm/cldk/CodeAnalyzerIntegrationTest.java index cd933cc..8d3cd4c 100644 --- a/src/test/java/com/ibm/cldk/CodeAnalyzerIntegrationTest.java +++ b/src/test/java/com/ibm/cldk/CodeAnalyzerIntegrationTest.java @@ -154,6 +154,140 @@ void callGraphShouldHaveKnownEdges() throws Exception { ), "Expected edge not found in the system dependency graph"); } + @Test + void analysisLevelTwoShouldNotEmitSdgSections() throws Exception { + var runCodeAnalyzer = container.execInContainer( + "bash", "-c", + String.format( + "export JAVA_HOME=%s && java -jar /opt/jars/codeanalyzer-%s.jar --input=/test-applications/call-graph-test --analysis-level=2", + javaHomePath, codeanalyzerVersion + ) + ); + Assertions.assertEquals(0, runCodeAnalyzer.getExitCode(), "Command should execute successfully"); + JsonObject jsonObject = new Gson().fromJson(runCodeAnalyzer.getStdout(), JsonObject.class); + Assertions.assertTrue(jsonObject.has("call_graph"), "Level 2 must emit the call graph"); + Assertions.assertFalse(jsonObject.has("system_dependency_graph"), + "Level 2 must not emit the system dependency graph"); + Assertions.assertFalse(jsonObject.has("program_graphs"), "Level 2 must not emit program graphs"); + } + + @Test + void fullSystemDependencyGraphShouldBeEmittedAtAnalysisLevelThree() throws Exception { + var runCodeAnalyzer = container.execInContainer( + "bash", "-c", + String.format( + "export JAVA_HOME=%s && java -jar /opt/jars/codeanalyzer-%s.jar --input=/test-applications/call-graph-test --analysis-level=3", + javaHomePath, codeanalyzerVersion + ) + ); + Assertions.assertEquals(0, runCodeAnalyzer.getExitCode(), "Command should execute successfully"); + JsonObject jsonObject = new Gson().fromJson(runCodeAnalyzer.getStdout(), JsonObject.class); + + // --- Method-level SDG edges (the JGraphEdges shape the SDK models) --- + JsonArray systemDependencyGraph = jsonObject.getAsJsonArray("system_dependency_graph"); + Assertions.assertNotNull(systemDependencyGraph, "Level 3 must emit the system dependency graph"); + Assertions.assertTrue(StreamSupport.stream(systemDependencyGraph.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .anyMatch(entry -> + "CONTROL_DEP".equals(entry.get("type").getAsString()) && + entry.getAsJsonObject("source").get("signature").getAsString().equals("log()") && + entry.getAsJsonObject("target").get("signature").getAsString().equals("loglog()") + ), "Expected control dependence log() -> loglog() not found"); + Assertions.assertTrue(StreamSupport.stream(systemDependencyGraph.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .anyMatch(entry -> + "DATA_DEP".equals(entry.get("type").getAsString()) && + "PARAM_CALLER".equals(entry.get("source_kind").getAsString()) && + "PARAM_CALLEE".equals(entry.get("destination_kind").getAsString()) && + entry.getAsJsonObject("source").get("signature").getAsString().equals("helloString()") && + entry.getAsJsonObject("target").get("signature").getAsString().equals("getName()") + ), "Expected data dependence helloString() -> getName() not found"); + + // --- Statement-level program graphs --- + JsonObject programGraphs = jsonObject.getAsJsonObject("program_graphs"); + Assertions.assertNotNull(programGraphs, "Level 3 must emit program graphs"); + JsonObject functions = programGraphs.getAsJsonObject("functions"); + JsonObject helloString = functions.getAsJsonObject("org.example.User.helloString()"); + Assertions.assertNotNull(helloString, "helloString() must have program graphs"); + + // CFG gate: exactly one synthetic ENTRY (id 0) and one synthetic EXIT (the max id). + JsonArray cfgNodes = helloString.getAsJsonObject("cfg").getAsJsonArray("nodes"); + long entryCount = countNodesOfKind(cfgNodes, "entry"); + long exitCount = countNodesOfKind(cfgNodes, "exit"); + Assertions.assertEquals(1, entryCount, "CFG must have exactly one ENTRY node"); + Assertions.assertEquals(1, exitCount, "CFG must have exactly one EXIT node"); + + // PDG gate: ENTRY-anchored control dependence and at least one def-use edge. + JsonArray pdgEdges = helloString.getAsJsonObject("pdg").getAsJsonArray("edges"); + Assertions.assertTrue(StreamSupport.stream(pdgEdges.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .anyMatch(edge -> "CDG".equals(edge.get("type").getAsString()) + && edge.get("source").getAsInt() == 0), + "Expected a CDG edge from the ENTRY node"); + Assertions.assertTrue(StreamSupport.stream(pdgEdges.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .anyMatch(edge -> "DDG".equals(edge.get("type").getAsString())), + "Expected at least one DDG edge"); + + // SDG gate: known cross-function edges, anchored at the callee's ENTRY node. + JsonArray sdgEdges = programGraphs.getAsJsonArray("sdg_edges"); + Assertions.assertTrue(StreamSupport.stream(sdgEdges.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .anyMatch(edge -> "CALL".equals(edge.get("type").getAsString()) && + edge.getAsJsonObject("source").get("signature").getAsString() + .equals("org.example.User.helloString()") && + edge.getAsJsonObject("target").get("signature").getAsString() + .equals("org.example.User.log()") && + edge.getAsJsonObject("target").get("node").getAsInt() == 0), + "Expected CALL edge helloString() -> log()#0"); + Assertions.assertTrue(StreamSupport.stream(sdgEdges.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .anyMatch(edge -> "PARAM_OUT".equals(edge.get("type").getAsString()) && + edge.getAsJsonObject("source").get("signature").getAsString() + .equals("org.example.User.getName()") && + edge.getAsJsonObject("target").get("signature").getAsString() + .equals("org.example.User.helloString()")), + "Expected PARAM_OUT edge getName() -> helloString()"); + + // No-dangling gate: every cross-function endpoint resolves to an emitted function graph + // and a node id within that function's CFG node range. + for (JsonElement element : sdgEdges) { + for (String end : new String[] { "source", "target" }) { + JsonObject endpoint = element.getAsJsonObject().getAsJsonObject(end); + String signature = endpoint.get("signature").getAsString(); + int node = endpoint.get("node").getAsInt(); + JsonObject function = functions.getAsJsonObject(signature); + Assertions.assertNotNull(function, "Dangling endpoint signature: " + signature); + int maxNodeId = StreamSupport + .stream(function.getAsJsonObject("cfg").getAsJsonArray("nodes").spliterator(), false) + .mapToInt(n -> n.getAsJsonObject().get("id").getAsInt()).max().orElse(-1); + Assertions.assertTrue(node >= 0 && node <= maxNodeId, + "Dangling endpoint node " + signature + "#" + node); + } + } + } + + @Test + void invalidGraphSelectorShouldFailFast() throws Exception { + var runCodeAnalyzer = container.execInContainer( + "bash", "-c", + String.format( + "export JAVA_HOME=%s && java -jar /opt/jars/codeanalyzer-%s.jar --input=/test-applications/call-graph-test --analysis-level=3 --graphs=bogus", + javaHomePath, codeanalyzerVersion + ) + ); + Assertions.assertNotEquals(0, runCodeAnalyzer.getExitCode(), "Unknown --graphs value must exit non-zero"); + Assertions.assertTrue(runCodeAnalyzer.getStderr().contains("Invalid --graphs"), + "Unknown --graphs value must print a clear error"); + } + + private static long countNodesOfKind(JsonArray nodes, String kind) { + return StreamSupport.stream(nodes.spliterator(), false) + .map(JsonElement::getAsJsonObject) + .filter(node -> kind.equals(node.get("kind").getAsString())) + .count(); + } + @Test void corruptMavenShouldNotBuildWithWrapper() throws IOException, InterruptedException { // Make executable diff --git a/src/test/java/com/ibm/cldk/neo4j/GraphProjectorSystemDepTest.java b/src/test/java/com/ibm/cldk/neo4j/GraphProjectorSystemDepTest.java new file mode 100644 index 0000000..79e4113 --- /dev/null +++ b/src/test/java/com/ibm/cldk/neo4j/GraphProjectorSystemDepTest.java @@ -0,0 +1,128 @@ +/* +Copyright IBM Corporation 2023, 2024 + +Licensed under the Apache Public License 2.0, Version 2.0 (the "License"); +you may not use this file except in compliance with the License. + +Unless required by applicable law or agreed to in writing, software +distributed under the License is distributed on an "AS IS" BASIS, +WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. +See the License for the specific language governing permissions and +limitations under the License. +*/ +package com.ibm.cldk.neo4j; + +import static org.junit.jupiter.api.Assertions.assertEquals; +import static org.junit.jupiter.api.Assertions.assertFalse; + +import com.google.gson.JsonArray; +import com.google.gson.JsonObject; +import com.ibm.cldk.entities.Callable; +import com.ibm.cldk.entities.JavaCompilationUnit; +import com.ibm.cldk.entities.Type; +import com.ibm.cldk.neo4j.GraphRows.EdgeRow; +import java.util.HashMap; +import java.util.Map; +import org.junit.jupiter.api.Test; + +/** + * Unit-level guard for the level-3 system-dependency-graph projection: method-level SDG edges + * become {@code J_CONTROL_DEP}/{@code J_DATA_DEP}/{@code J_HEAP_DATA_DEP} relationships between + * the same {@code :JCallable} nodes the call graph resolves to. The dependence kind must ride in + * the relationship type (the writers MERGE one relationship per (type, source, target), so a + * pair with both a control and a data dependence would otherwise lose one), and edges are gated + * out when either endpoint has no node — the same rules as {@code J_CALLS}. + */ +public class GraphProjectorSystemDepTest { + + private static final String FQN = "com.x.Foo"; + + private static Map symbolTable() { + Type type = new Type(); + Map callables = new HashMap<>(); + callables.put("bar()", callable("bar()")); + callables.put("baz()", callable("baz()")); + type.setCallableDeclarations(callables); + + Map types = new HashMap<>(); + types.put(FQN, type); + + JavaCompilationUnit cu = new JavaCompilationUnit(); + cu.setFilePath("Foo.java"); + cu.setTypeDeclarations(types); + + Map st = new HashMap<>(); + st.put("Foo.java", cu); + return st; + } + + private static Callable callable(String signature) { + Callable c = new Callable(); + c.setSignature(signature); + return c; + } + + private static JsonObject vertex(String typeDecl, String signature) { + JsonObject v = new JsonObject(); + v.addProperty("type_declaration", typeDecl); + v.addProperty("signature", signature); + v.addProperty("callable_declaration", signature); + return v; + } + + private static JsonObject sdgEdge(JsonObject source, JsonObject target, String type) { + JsonObject e = new JsonObject(); + e.add("source", source); + e.add("target", target); + e.addProperty("type", type); + e.addProperty("source_kind", "NORMAL"); + e.addProperty("destination_kind", "METHOD_ENTRY"); + e.addProperty("weight", "2"); + return e; + } + + @Test + public void bothDependenceKindsSurviveBetweenTheSamePair() { + JsonArray sdg = new JsonArray(); + sdg.add(sdgEdge(vertex(FQN, "bar()"), vertex(FQN, "baz()"), "CONTROL_DEP")); + sdg.add(sdgEdge(vertex(FQN, "bar()"), vertex(FQN, "baz()"), "DATA_DEP")); + + GraphRows rows = GraphProjector.project(symbolTable(), null, sdg, "app"); + + for (String relType : new String[] { "J_CONTROL_DEP", "J_DATA_DEP" }) { + assertEquals(1, rows.edges.stream().filter(e -> e.type.equals(relType)).count(), + "exactly one " + relType + " relationship expected"); + } + for (EdgeRow er : rows.edges) { + if (!er.type.startsWith("J_CONTROL") && !er.type.startsWith("J_DATA")) { + continue; + } + assertEquals(FQN + "#bar()", er.from.value); + assertEquals(FQN + "#baz()", er.to.value); + assertEquals(2L, er.props.get("weight")); + } + } + + @Test + public void unknownDependenceKindIsSkipped() { + JsonArray sdg = new JsonArray(); + sdg.add(sdgEdge(vertex(FQN, "bar()"), vertex(FQN, "baz()"), "SOMETHING_ELSE")); + + GraphRows rows = GraphProjector.project(symbolTable(), null, sdg, "app"); + + assertFalse(rows.edges.stream().anyMatch(e -> e.type.endsWith("_DEP")), + "an edge outside WALA's closed dependence vocabulary should be skipped"); + } + + @Test + public void unresolvableEndpointIsGatedOut() { + JsonArray sdg = new JsonArray(); + // Target callable has no JCallable node (e.g. a WALA-synthesized method) — no edge. + sdg.add(sdgEdge(vertex(FQN, "bar()"), vertex(FQN, "access$000()"), "DATA_DEP")); + + GraphRows rows = GraphProjector.project(symbolTable(), null, sdg, "app"); + + assertFalse(rows.edges.stream().anyMatch(e -> e.type.equals("J_DATA_DEP")), + "edge to a callable with no node should be gated out"); + } +}