Skip to content

[FEATURE] Memory & knowledge system — Serena-style markdown memory + 3-tier registry + wiki ingestion #60

Description

@Wolfvin

Summary

Add persistent memory system so AI agents can record and recall project context across sessions. Three complementary layers: Serena-style markdown memory (mem: references), RepoAudit 3-tier registry (syntactic / semantic / report), and UnderstandAnything wiki/knowledge-base ingestion.

Worker consensus (4 reports — complementary, not competing)

Worker Source Contribution
Serena update!/CodeLens_vs_Serena_Upgrade_Analysis.md S5 Markdown memory at .codelens/memories/ (project) + ~/.codelens/memories/global/ (global). mem:NAME reference convention with referential integrity check + auto-fix. 6 memory commands. Read-only pattern protects global.
Serena same file S6 Onboarding process — auto-generate project_overview.md, project_structure.md, key_modules.md, conventions.md, suggested_commands.md on first run.
RepoAudit update!/CodeLens_Upgrade_Issues_from_RepoAudit.md CL-041 3-tier memory: SyntacticMemory (AST-derived immutable, persisted .codelens/memory/syntactic.pkl), SemanticMemory (per-agent ephemeral, thread-safe), ReportMemory (immutable append-only findings, persisted .codelens/memory/report.json). Refactor registry.py to delegate.
UnderstandAnything update!/CodeLens_vs_UnderstandAnything_Upgrade_Analysis.md U8 Knowledge base analysis (Karpathy-pattern wiki) — 5-phase pipeline: DETECT → SCAN → ANALYZE (LLM) → MERGE → SAVE. Dispatches article-analyzer subagent per batch. Output knowledge-graph.json with kind: "knowledge".
OpenTaint update!/CodeLens_vs_OpenTaint_Upgrade_Analysis.md E2 Documentation pattern — update references/agent-integration.md with multi-skill orchestrator pattern, state tracking, resource limits. (Not a feature — just docs.)

Proposed phased scope

Phase 1 — Serena-style markdown memory (P2, 2-3 weeks)

  • Memory at .codelens/memories/ (project) + ~/.codelens/memories/global/ (global)
  • Markdown files (human-readable, versionable)
  • Topic via / in name maps to subdirectory
  • mem:NAME reference convention with referential integrity check (codelens memory check)
  • Auto-fix (codelens memory check --fix)
  • Reference rename propagation
  • 6 commands: codelens memory write/read/edit/delete/rename/list/check
  • Read-only pattern (read_only_memory_patterns regex) protects global memory
  • Seed memory_maintenance.md at first run describing convention
  • 6 memory commands exposed as MCP tools
  • New files: scripts/memories/memory_manager.py, scripts/memories/memory_reference_analysis.py, scripts/commands/memory.py

Phase 2 — Onboarding process (P2, 1-2 weeks, depends on Phase 1)

  • Detect first-time run via empty .codelens/memories/
  • Auto-generate: project_overview.md (reuse detect), project_structure.md (reuse outline), key_modules.md (reuse entrypoints + api-map), conventions.md (reuse smell + complexity), suggested_commands.md
  • --skip-onboarding and --re-onboard flags
  • codelens init auto-triggers onboarding
  • New file: scripts/onboarding_engine.py, scripts/commands/onboarding.py

Phase 3 — 3-tier registry refactor (P3, 2-3 weeks, optional — large refactor)

  • Split monolithic registry.py (440 LOC) into 3 tiers:
    • SyntacticMemory — AST-derived immutable facts, persisted .codelens/memory/syntactic.pkl
    • SemanticMemory — per-agent ephemeral, thread-safe with threading.RLock() per-field
    • ReportMemory — immutable append-only findings, persisted .codelens/memory/report.json
  • Refactor registry.py to delegate to SyntacticMemory (backward-compat API)
  • Refactor ast_taint_engine.py, dataflow_engine.py, crossfile_taint_engine.py to use SemanticMemory
  • Refactor output formatters to consume ReportMemory
  • codelens migrate-memory script converts old .codelens/registry.json to 3-tier
  • New files: scripts/memory/{syntactic,semantic,report}.py, scripts/commands/migrate_memory.py

Phase 4 — Knowledge base wiki ingestion (P3, 3-4 weeks, depends on LLM integration issue)

  • codelens knowledge [wiki-directory] command
  • Detect index.md + multiple .md files with [[wikilink]] syntax
  • 5-phase pipeline: DETECT (format) → SCAN (article/source/topic node + wikilink/category edge) → ANALYZE (LLM: entity/claim/source node + cites/contradicts/builds_on edge) → MERGE (dedupe + normalize + layer/tour) → SAVE (validate + meta.json)
  • Dispatch article-analyzer subagent per batch (10-15 articles, up to 3 concurrent)
  • Output knowledge-graph.json with kind: "knowledge" → dashboard uses force-directed layout
  • New files: scripts/knowledge_base_parser.py, scripts/knowledge_graph_merger.py, scripts/agents/article_analyzer.md, scripts/commands/knowledge.py

Acceptance criteria

  • Phase 1: codelens memory write/read/list works; mem:NAME references resolve
  • Phase 1: codelens memory check detects broken references; --fix repairs them
  • Phase 2: first-time codelens init auto-generates 5 onboarding memory files
  • Phase 3: 3-tier registry passes existing test suite (no regression)
  • Phase 4: codelens knowledge ingests a 50-article wiki and produces valid knowledge-graph.json

Relationship to #16

#16 (ADR via manage_adr MCP tool) is a narrower use case of this broader memory system. Once Phase 1 ships, ADR management can be implemented as a memory subdirectory (.codelens/memories/adr/) without a separate MCP tool.

License note

Serena is MIT — memory_manager.py logic can be ported directly. RepoAudit is Purdue Non-Commercial — design influenced, reimplement from scratch. UA is unspecified license — design only.

Metadata

Metadata

Assignees

No one assigned

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions