Skip to content

Unify code generation pipeline across oold, osw-python, and osw-python-package-generator #83

@simontaurus

Description

@simontaurus

Problem

The Python code generation pipeline is spread across three repositories with overlapping responsibilities:

  1. oold-python (src/oold/generator.py, src/oold/utils/codegen.py)

    • Schema preprocessing: range to $ref conversion
    • OOLDJsonSchemaParser: custom keyword preservation, _deep_merge override
    • Calls datamodel-code-generator with settings (use_title_as_name, reuse_model, allof_class_hierarchy)
  2. osw-python (src/osw/core.py lines 540-900)

    • _fetch_schema(): recursive schema resolution, downloads from wiki/offline pages
    • Writes resolved schemas to temp directory for datamodel-code-generator
    • Does NOT run oold preprocessing on dependency schemas
  3. osw-python-package-generator (src/osw_python_package_generator/main.py)

    • Downloads schema packages from GitHub
    • Calls osw.fetch_schema() then replace_duplicated_classes_with_imports()
    • Post-processing: class deduplication (name-based, UUID-based, numbered variants, pass-only cleanup)
    • Hotfix replacements for raw OSW ID type annotations

Current issues caused by this split

  • Dependency schemas not preprocessed: _fetch_schema() in osw-python fetches dependency schemas but doesn't run oold's preprocess() on them. This means range fields in dependency schemas (e.g., Device.risk_assessment) keep their "type": "string" alongside the unresolved range, causing datamodel-code-generator to use raw OSW IDs as class names instead of schema titles.

  • Post-processing workarounds: The package generator has growing hotfix logic (raw OSW ID replacement, sentinel object cleanup, lambda : formatting) that patches symptoms of upstream issues.

  • Duplicate merge logic: _deep_merge override in oold and merge_deep in json_tools.py exist to fix array deduplication. The fix had to be applied in oold because that's where datamodel-code-generator is monkey-patched, but the merge utility lives separately.

  • reuse_model failures: datamodel-code-generator's reuse_model=True doesn't deduplicate schemas resolved via different $ref paths (e.g., Tool referenced from both Process and ProcessType). This is worked around by post-processing in the package generator.

  • Ad-hoc schema generation rebuilds full chain: When a user wants to generate Python code for a single custom schema via osw-python, _fetch_schema() resolves the entire dependency chain up to Entity. Only the user's schema should be built; dependency classes should be imported from existing packages.

Proposed solution

Consolidate the generation pipeline in oold-python:

  1. Move schema resolution (_fetch_schema logic) from osw-python into oold's Generator
  2. Run preprocess() on every schema after resolution, not just top-level schemas
  3. Move class deduplication logic from the package generator into oold's Generator as a post-processing step
  4. Support dependency resolution from installed packages: when resolving a $ref to a schema that exists in an installed dependency package, import the class instead of regenerating it
  5. osw-python and the package generator become thin wrappers that provide schema sources (wiki API, GitHub zips, local files) and call oold.Generator.generate()

This would make oold-python the single source of truth for: schema preprocessing, code generation settings, post-processing, and deduplication.

Affected repositories

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions