Skip to content

Client analyses (slicing + taint) as SDG queries over program_graphs — shared engine + per-language model packs #229

Description

@rahlk

Establishes the SDK side of the analyzer/SDK boundary now standard across the codeanalyzer-* family (cldk-forge PR #7; reference instantiation codellm-devkit/codeanalyzer-java#171). The analyzers are pure graph providers — at -a 3 they emit program_graphs (CFG/PDG/SDG with transitive HRB SUMMARY edges) and nothing more. Client analyses live here, in the SDK.

This issue is the shared, cross-language client-analysis engine — built once over the shared ProgramGraphs models, reused by every backend that emits level-3 graphs (Java, Python, C/C++, Go, Rust as they land). It is the destination for the slicing/taint work removed from the analyzer epics: codeanalyzer-python#67, codeanalyzer-clang#2, codeanalyzer-go#3, codeanalyzer-rust#25.

Scope

  1. Shared graph models (if not already present): ProgramGraphs / FunctionGraphs / GraphNode / GraphEdge / SDGEdge, validating analysis.json's program_graphs section. Modeled once, not per-language (the parity clause holds across analyzers).
  2. Backward/forward slicing as a reachability query: reverse reachability over CDG ∪ DDG ∪ PARAM_* ∪ SUMMARY from a (signature, node_id) criterion, context-sensitive via the two-phase HRB traversal (ascend PARAM_IN/CALL, then descend PARAM_OUT; SUMMARY edges carry across calls without re-descending). Exact expected-set gate on a fixture.
  3. Taint as a labeled reachability query: seed at sources, propagate along dependence edges, block/flag at sanitizers on the path, report when a source label reaches a matching sink. Witness paths reconstructed lazily as (signature, node_id) chains with the model id per hop.
  4. Sources/sinks/sanitizers/library models as data — a JSON spec validated against a JSON Schema, precedence built-in pack < config file < caller-supplied. Per-language model packs (e.g. Python flask/os/subprocess; C libc; Go net/http/os/exec) ship as data alongside the shared engine.
  5. Client-result modelsTaintFlow / slice-result ({ source, sink, rule, sanitized, path }) — SDK models, not analyzer output.
  6. Facade methods on the query surface: get_backward_slice(...), get_taint_flows(spec=...), etc.
  7. Surface graph over-approximations in results rather than absorbing them: ENTRY-anchored PARAM_IN (argument arity collapsed until the analyzer ships per-argument PARAM nodes, e.g. codeanalyzer-java#173), missing SUMMARY edges before that analyzer PR lands, heap flows only under the analyzer's heap-dependence mode.

Gates (frontend)

  • Slice: backward slice of a named fixture variable equals the hand-computed node set — exact, not "non-empty"; and does not contain callee-internal nodes a naive phase-1 walk would leak (proves SUMMARY edges are used).
  • Taint: one known source→sink flow found; the same flow with a sanitizer interposed reported sanitized; witness path names every hop.

Contract references: cldk-forge cldk-sdk-frontend (SKILL.md § Client analyses, sdk-testing.md § 3b). Related: #228 (Java backend adoption of -a 3, whose ask #6 this generalizes).

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions