Benchmarking json schemas by JordanMaples · Pull Request #1154 · microsoft/DiskANN

JordanMaples · 2026-06-12T18:42:06Z

To increase clarity into what the available options are for our json inputs, Mark and I discussed walking the ASTs of the json body using Schemar and rendering a breakdown of the possible options. This adds JSON Schema documentation for benchmark inputs using the --schema and --field options

Summary

Adds schemars-based JSON Schema generation and a custom tree-style terminal renderer so users can discover benchmark input fields without reading source code.

Usage

Full schema for an input type

cargo run --release -p diskann-benchmark -- inputs graph-index-build --schema

Drill into a specific field

cargo run --release -p diskann-benchmark -- inputs graph-index-build --field source.start_point_strategy

What's included

schemars 1.2 added to workspace; JsonSchema derived on all input types
Custom renderer (diskann-benchmark-runner/src/schema.rs) with: - Colored terminal output (bold field names, cyan types, yellow enum variants)
Multi-line description alignment
Handles internally/externally-tagged enums and $ref newtypes
MAX_DEPTH guard against recursive schemas
Manual JsonSchema impls for custom-serde types: - NonNegativeFinite (number with minimum)
StartPointStrategyRef (externally-tagged enum, with drift test)
QuantizationTypeSchema proxy (keeps schemars out of diskann-disk)
JsonSchema bound added to Input::Raw trait
README documentation for the new --schema and --field options

Sample output

Full schema

cargo run -p diskann-benchmark --all-features -- inputs --schema graph-index-build-bftree-spherical-quantization

Schema for "graph-index-build-bftree-spherical-quantization":

├── build: object
    ├── alpha: number
    │
    ├── backedge_ratio: number
    │
    ├── data: string
    │   # A file that is used as an input to for a benchmark.
    │
    ├── data_type: one of ["float64", "float32", "float16", "uint8", "uint16", "uint32", "uint64", "int8", "int16", "int32", "int64", "bool"]
    │   # An enum representation for common DiskANN data types.
    │
    ├── distance: one of ["squared_l2", "inner_product", "cosine", "cosine_normalized"]
    │
    ├── insert_retry (optional): (any of)
        ├── num_insert_attempts: integer (≥1)
        │
        ├── retry_threshold: number
        │
        └── saturate_inserts: boolean
    │
    ├── l_build: integer (≥0)
    │
    ├── max_degree: integer (≥0)
    │
    ├── multi_insert (optional): (any of)
        ├── batch_parallelism: integer (≥1)
        │
        ├── batch_size: integer (≥1)
        │
        └── intra_batch_candidates: (one of)
            # A one-to-one correspondence with [`diskann::index::config::IntraBatchCandidates`].
            ├─ "none" — No intra-batch candidates will be considered.
            ├─ "max"
            └─ "all" — Consider all elements in the batch for intra-batch candidates.
    │
    ├── num_threads: integer (≥0)
    │
    ├── save_path: any (optional)
    │
    └── start_point_strategy: (one of)
        # Strategy for selecting graph start points.
        ├─ "medoid" — Use the medoid as the starting point.
        ├─ "first_vector" — Use the first vector in the dataset.
        ├─ "random_vectors" — Randomly select vector(s) with given norm.
        │  ├── norm: number
        │  ├── nsamples: integer (≥1)
        │  └── seed: integer
        ├─ "random_samples" — Sample data from the dataset.
        │  ├── nsamples: integer (≥1)
        │  └── seed: integer
        └─ "latin_hyper_cube" — Use Latin Hypercube sampling.
           ├── nsamples: integer (≥1)
           └── seed: integer
│
├── neighbor_store_config (optional): (any of)
    ├── cache_only: any (optional)
    │   # If true, only use the in-memory circular buffer (no disk pages).
    │
    ├── cb_copy_on_access_ratio: any (optional)
    │   # Ratio of buffer used before copy-on-access kicks in.
    │
    ├── cb_max_record_size: any (optional)
    │   # Maximum record size that can be stored in the circular buffer.
    │
    ├── cb_min_record_size: any (optional)
    │   # Minimum record size for the circular buffer.
    │
    ├── cb_size_byte: integer (≥0)
    │   # Size of the circular buffer (in-memory write cache) in bytes.
    │
    ├── leaf_page_size: integer (≥0)
    │   # Size of leaf pages in bytes.
    │
    ├── read_promotion_rate: any (optional)
    │   # Probability (0-100) of promoting a read record to the front of the buffer.
    │
    ├── read_record_cache: any (optional)
    │   # Whether to cache full pages on read.
    │
    └── scan_promotion_rate: any (optional)
        # Probability (0-100) of promoting a scanned record to the front of the buffer.
│
├── num_bits: integer (≥1)
│
├── pre_scale (optional): (any of)
    ├─ one of ["none", "reciprocal_mean_norm"]
    └─ "some"
│
├── quant_store_config (optional): (any of)
    ├── cache_only: any (optional)
    │   # If true, only use the in-memory circular buffer (no disk pages).
    │
    ├── cb_copy_on_access_ratio: any (optional)
    │   # Ratio of buffer used before copy-on-access kicks in.
    │
    ├── cb_max_record_size: any (optional)
    │   # Maximum record size that can be stored in the circular buffer.
    │
    ├── cb_min_record_size: any (optional)
    │   # Minimum record size for the circular buffer.
    │
    ├── cb_size_byte: integer (≥0)
    │   # Size of the circular buffer (in-memory write cache) in bytes.
    │
    ├── leaf_page_size: integer (≥0)
    │   # Size of leaf pages in bytes.
    │
    ├── read_promotion_rate: any (optional)
    │   # Probability (0-100) of promoting a read record to the front of the buffer.
    │
    ├── read_record_cache: any (optional)
    │   # Whether to cache full pages on read.
    │
    └── scan_promotion_rate: any (optional)
        # Probability (0-100) of promoting a scanned record to the front of the buffer.
│
├── search_phase: (one of)
    ├─ "topk"
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    ├─ "range"
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    ├─ "topk-beta-filter"
    │  ├── beta: number
    │  ├── data_labels: string
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── query_predicates: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    ├─ "topk-multihop-filter"
    │  ├── data_labels: string
    │  ├── groundtruth: string
    │  ├── num_threads: array of integer (≥1)
    │  ├── queries: string
    │  ├── query_predicates: string
    │  ├── reps: integer (≥1)
    │  └── runs: array of any
    └─ "topk-inline-filter"
       ├── adaptive_l: (union) (optional)
       ├── data_labels: string
       ├── groundtruth: string
       ├── num_threads: array of integer (≥1)
       ├── queries: string
       ├── query_predicates: string
       ├── reps: integer (≥1)
       └── runs: array of any
│
├── seed: integer (≥0)
│
├── transform_kind: (one of)
    ├─ one of ["null"]
    ├─ "padding_hadamard"
    ├─ "random_rotation"
    └─ "double_hadamard"
│
└── vector_store_config (optional): (any of)
    ├── cache_only: any (optional)
    │   # If true, only use the in-memory circular buffer (no disk pages).
    │
    ├── cb_copy_on_access_ratio: any (optional)
    │   # Ratio of buffer used before copy-on-access kicks in.
    │
    ├── cb_max_record_size: any (optional)
    │   # Maximum record size that can be stored in the circular buffer.
    │
    ├── cb_min_record_size: any (optional)
    │   # Minimum record size for the circular buffer.
    │
    ├── cb_size_byte: integer (≥0)
    │   # Size of the circular buffer (in-memory write cache) in bytes.
    │
    ├── leaf_page_size: integer (≥0)
    │   # Size of leaf pages in bytes.
    │
    ├── read_promotion_rate: any (optional)
    │   # Probability (0-100) of promoting a read record to the front of the buffer.
    │
    ├── read_record_cache: any (optional)
    │   # Whether to cache full pages on read.
    │
    └── scan_promotion_rate: any (optional)
        # Probability (0-100) of promoting a scanned record to the front of the buffer.

Example:

{
  "content": {
    "build": {
      "alpha": 1.2000000476837158,
      "backedge_ratio": 1.0,
      "data": "path/to/data",
      "data_type": "float32",
      "distance": "squared_l2",
      "insert_retry": null,
      "l_build": 50,
      "max_degree": 32,
      "multi_insert": {
        "batch_parallelism": 32,
        "batch_size": 128,
        "intra_batch_candidates": "none"
      },
      "num_threads": 1,
      "save_path": null,
      "start_point_strategy": "medoid"
    },
    "neighbor_store_config": null,
    "num_bits": 1,
    "pre_scale": null,
    "quant_store_config": null,
    "search_phase": {
      "groundtruth": "path/to/groundtruth",
      "num_threads": [
        1,
        2,
        4,
        8
      ],
      "queries": "path/to/queries",
      "reps": 5,
      "runs": [
        {
          "recall_k": 10,
          "search_l": [
            10,
            20,
            30,
            40
          ],
          "search_n": 10
        }
      ],
      "search-type": "topk"
    },
    "seed": 42,
    "transform_kind": "null",
    "vector_store_config": null
  },
  "type": "graph-index-build-bftree-spherical-quantization"
}

Single Field

cargo run -p diskann-benchmark --all-features -- inputs --schema graph-index-build-bftree-spherical-quantization --field build.start_point_strategy


Schema for "graph-index-build-bftree-spherical-quantization".build.start_point_strategy:

├─ "medoid" — Use the medoid as the starting point.
├─ "first_vector" — Use the first vector in the dataset.
├─ "random_vectors" — Randomly select vector(s) with given norm.
│  ├── norm: number
│  ├── nsamples: integer (≥1)
│  └── seed: integer
├─ "random_samples" — Sample data from the dataset.
│  ├── nsamples: integer (≥1)
│  └── seed: integer
└─ "latin_hyper_cube" — Use Latin Hypercube sampling.
   ├── nsamples: integer (≥1)
   └── seed: integer

Example:

"medoid"

Copilot

Pull request overview

This PR adds JSON Schema generation (via schemars) and a tree-style terminal renderer so diskann-benchmark users can discover benchmark input fields/variants using --schema and --field, rather than reading source.

Changes:

Add schemars::JsonSchema coverage across benchmark input/tolerance DTOs and generate per-input JSON Schemas from Input::Raw.
Introduce diskann-benchmark-runner::schema to render schemas (and drill into sub-fields) as human-readable CLI documentation.
Add schema/serialization drift tests and wire new CLI flags + README documentation.

Reviewed changes

Copilot reviewed 26 out of 27 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
diskann-disk/src/build/configuration/quantization_types.rs	Adds a test intended to guard drift across `QuantizationType` variants/serialization.
diskann-disk/Cargo.toml	Adds `serde_json` as a dev-dependency for new tests.
diskann-benchmark/src/utils/mod.rs	Derives `JsonSchema` for `SimilarityMeasure` used in inputs.
diskann-benchmark/src/inputs/multi_vector.rs	Derives `JsonSchema` for multi-vector input types.
diskann-benchmark/src/inputs/graph_index.rs	Derives `JsonSchema` broadly; adds manual `JsonSchema` for `StartPointStrategyRef` + drift test; annotates schema override for the remote-serde field.
diskann-benchmark/src/inputs/filters.rs	Derives `JsonSchema` for filter-related inputs.
diskann-benchmark/src/inputs/exhaustive.rs	Derives `JsonSchema` for exhaustive-benchmark inputs.
diskann-benchmark/src/inputs/disk.rs	Adds schema proxy for `QuantizationType` (to avoid schemars dependency in `diskann-disk`) and derives `JsonSchema` for disk-index inputs.
diskann-benchmark/src/inputs/bftree.rs	Derives `JsonSchema` for bf_tree inputs.
diskann-benchmark/src/backend/multi_vector/driver.rs	Derives `JsonSchema` for multi-vector tolerance input.
diskann-benchmark/src/backend/disk_index/benchmarks.rs	Derives `JsonSchema` for disk-index tolerance input.
diskann-benchmark/README.md	Documents `--schema` and `--field` usage.
diskann-benchmark/Cargo.toml	Adds `schemars` dependency for input schema generation.
diskann-benchmark-simd/src/lib.rs	Derives `JsonSchema` for SIMD input/tolerance types.
diskann-benchmark-simd/Cargo.toml	Adds `schemars` dependency.
diskann-benchmark-runner/src/utils/num.rs	Implements `JsonSchema` for `NonNegativeFinite`.
diskann-benchmark-runner/src/utils/datatype.rs	Derives `JsonSchema` for `DataType`.
diskann-benchmark-runner/src/test/typed.rs	Updates test inputs/tolerances to derive `JsonSchema`.
diskann-benchmark-runner/src/test/dim.rs	Updates test inputs/tolerances to derive `JsonSchema`.
diskann-benchmark-runner/src/schema.rs	Adds schema renderer + path resolver + unit tests.
diskann-benchmark-runner/src/lib.rs	Exposes the new `schema` module.
diskann-benchmark-runner/src/input.rs	Requires `Input::Raw: JsonSchema` and adds `Registered::schema()` plumbing.
diskann-benchmark-runner/src/files.rs	Derives `JsonSchema` for `InputFile`.
diskann-benchmark-runner/src/app.rs	Adds `--schema`/`--field` CLI flags and wiring to render schema docs + example.
diskann-benchmark-runner/Cargo.toml	Adds `colored` + `schemars` dependencies.
Cargo.toml	Adds `schemars` to workspace dependencies.
Cargo.lock	Locks new `schemars`/`colored` (and transitive) dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+    /// Ensures the manual `JsonSchema` impl stays in sync with actual variants.
+    /// If a variant is added to `QuantizationType`, this match will fail to compile.
+    #[test]
+    fn schema_covers_all_quantization_variants() {


+            }
+            s
+        }
+        Some("number") => "number".to_string(),


+            let generator =
+                schemars::generate::SchemaSettings::default().into_generator();
+            let schema = generator.into_root_schema_for::<T::Raw>();
+            serde_json::to_value(schema).unwrap_or_default()


codecov-commenter · 2026-06-12T18:58:53Z

Codecov Report

❌ Patch coverage is 65.70122% with 225 lines in your changes missing coverage. Please review.
✅ Project coverage is 89.30%. Comparing base (d9ce362) to head (f764227).

Files with missing lines	Patch %	Lines
diskann-benchmark-runner/src/schema.rs	74.74%	124 Missing ⚠️
diskann-benchmark/src/inputs/graph_index.rs	41.66%	49 Missing ⚠️
diskann-benchmark-runner/src/app.rs	20.93%	34 Missing ⚠️
diskann-benchmark-runner/src/utils/num.rs	0.00%	9 Missing ⚠️
diskann-benchmark-runner/src/input.rs	0.00%	8 Missing ⚠️
...disk/src/build/configuration/quantization_types.rs	95.23%	1 Missing ⚠️

❌ Your patch status has failed because the patch coverage (65.70%) is below the target coverage (90.00%). You can increase the patch coverage or adjust the target coverage.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1154      +/-   ##
==========================================
- Coverage   89.47%   89.30%   -0.17%     
==========================================
  Files         486      487       +1     
  Lines       92161    92810     +649     
==========================================
+ Hits        82458    82883     +425     
- Misses       9703     9927     +224

Flag	Coverage Δ
miri	`89.30% <65.70%> (-0.17%)`	⬇️
unittests	`88.95% <65.70%> (-0.17%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
diskann-benchmark-runner/src/files.rs	`100.00% <ø> (ø)`
diskann-benchmark-runner/src/test/dim.rs	`89.79% <ø> (ø)`
diskann-benchmark-runner/src/test/typed.rs	`97.10% <ø> (ø)`
diskann-benchmark-runner/src/utils/datatype.rs	`100.00% <ø> (ø)`
diskann-benchmark-simd/src/lib.rs	`83.03% <ø> (ø)`
diskann-benchmark/src/inputs/disk.rs	`1.56% <ø> (ø)`
diskann-benchmark/src/inputs/exhaustive.rs	`26.83% <ø> (ø)`
diskann-benchmark/src/inputs/filters.rs	`67.74% <ø> (ø)`
diskann-benchmark/src/inputs/multi_vector.rs	`19.67% <ø> (ø)`
diskann-benchmark/src/utils/mod.rs	`83.33% <ø> (ø)`
... and 6 more

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adds schemars-based JSON Schema generation and a custom tree-style terminal renderer for benchmark input types. Users can run `inputs <name> --schema` to see field documentation with types, optionality, enum variants, and descriptions — followed by the example JSON. Implementation: - Add schemars 1.2 to workspace; derive JsonSchema on all input types - Custom renderer in diskann-benchmark-runner/src/schema.rs with: - Colored output (field names bold, types cyan, variants yellow) - Multi-line description alignment - Handles internally/externally-tagged enums, newtypes with $ref - MAX_DEPTH guard against recursive schemas - Manual JsonSchema impls for custom-serde types: - NonNegativeFinite (number with minimum) - StartPointStrategyRef (externally-tagged enum, with drift test) - QuantizationTypeSchema proxy (keeps schemars out of diskann-disk) - JsonSchema bound added to Input::Raw trait Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

JordanMaples requested review from a team and Copilot June 12, 2026 18:42

Copilot started reviewing on behalf of JordanMaples June 12, 2026 18:42 View session

Copilot AI reviewed Jun 12, 2026

View reviewed changes

JordanMaples and others added 4 commits June 15, 2026 13:35

Document --schema and --field CLI options in benchmark README

c3734a4

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

remove lifetime

d59a1c8

fmt

f764227

JordanMaples force-pushed the jordanmaples/benchmark_schema branch from 83b159e to f764227 Compare June 15, 2026 20:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking json schemas#1154

Benchmarking json schemas#1154
JordanMaples wants to merge 4 commits into
mainfrom
jordanmaples/benchmark_schema

JordanMaples commented Jun 12, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 12, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

JordanMaples commented Jun 12, 2026

Summary

Usage

Full schema for an input type

Drill into a specific field

What's included

Sample output

Full schema

Single Field

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

codecov-commenter commented Jun 12, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

codecov-commenter commented Jun 12, 2026 •

edited

Loading