Skip to content

Rewrite the backend in native C++ (Clang LibTooling), replacing the Python/libclang implementation #1

Description

@rahlk

Why

CLDK backends are written in the language they analyze — the analyzer lives in that language's
own ecosystem to reach its best tooling (Java→JVM+WALA, Python→Python+Jedi, TS→Node+ts-morph, Go→Go).
The initial codeanalyzer-clang (v0.1.0) was bootstrapped in Python via the libclang bindings
(clang.cindex) as a one-time exception to move fast; the correct long-term backend for C/C++ is a
native C++ binary built on Clang LibTooling / the full Clang AST.

This issue tracks that rewrite. (Being taken on as a hands-on exercise by @rahlk.)

Target architecture

  • Runtime: native clang toolchain; a self-contained binary (the standard packaging rule —
    SDK users need no runtime installed).
  • Structural + resolution (Tier 1): full Clang AST via LibTooling / RecursiveASTVisitor /
    ASTMatchers + Sema — stronger than libclang's cursor API for template instantiation and
    overload/virtual resolution.
  • Framework backend (Tier 2, optional): LLVM-IR + SVF/Phasar points-to (already scaffolded as
    a stub in the Python version under semantic_analysis/svf/).
  • Build/deps: compile_commands.json compilation database (as today).
  • Packaging: per-target native build matrix (LLVM can't cross-compile cleanly) → GitHub Release
    binaries + a Homebrew formula reusing those assets; the SDK invokes it as a subprocess (not
    in-process — that changes the SDK facade from the Python-only in-process path to the subprocess
    path).

Contract to preserve (must not drift)

The rewrite must emit the byte-identical analysis.json contract the Python version already
produces and validates against:

  • Root ClangApplication { symbol_table: {path: ClangModule}, call_graph: [ClangCallEdge] },
    identity-only edges, snake_case keys.
  • One signatureOf() canonicalizer producing the same human-readable, overload-disambiguated,
    fully-qualified ids (e.g. app::Point::add(int)), from cursor.canonical on both sides.
  • Every field and node kind in .claude/SCHEMA_DECISIONS.md (record_kind, enums, typedefs, macros,
    virtual/pure_virtual/const/static/variadic flags, access specifier, templates, namespaces, USR tag,
    callsite flags).
  • The same Neo4j schema contract (neo4j/schema.py--emit schema), enforced by a conformance test.

Definition of done

  • Native C++ binary passes the equivalent of the current pytest gates on testdata/fixture/
    (symbol-table, call-graph, caching, flag-validation, Neo4j conformance).
  • Output validates against the SDK ClangApplication model — no dangling edges.
  • Level-1 parity with the Python version on the fixture (same signatures, same named edges).
  • Cross-compiled release binaries + Homebrew formula build via a tag-triggered release.yml.
  • README Architecture & Tooling updated to the native backend; SDK pin/invocation switched
    from in-process to subprocess (coordinate with cldk-sdk-frontend).

Reference

  • The Python implementation is the behavioral spec + fixture to match (commit b83d3f5).
  • Skill guidance: cldk-forge:codeanalyzer-backendtooling-menu.md § "C++ (the clang/libclang
    path)" and § "Packaging".

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions