WIP feat: swe bench scorer by tianmu-li · Pull Request #342 · mlcommons/endpoints

tianmu-li · 2026-06-05T23:06:54Z

What does this PR do?

Type of change

Bug fix
New feature
Documentation update
Refactor/cleanup

Related issues

Testing

Tests added/updated
All tests pass locally
Manual testing completed

Checklist

Code follows project style
Pre-commit hooks pass
Documentation updated (if needed)

github-actions · 2026-06-05T23:07:03Z

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

gemini-code-assist

Code Review

This pull request introduces support for SWE-bench accuracy evaluation by adding a new accuracy-only SWEBench dataset, a SWEBenchScorer that runs evaluations using mini-swe-agent in an isolated environment, and associated configuration templates, tests, and runbooks. Feedback on the changes focuses on improving the robustness of the SWEBenchScorer implementation, specifically by safely handling missing or null values when parsing configuration templates, benchmark configurations, and evaluation results, as well as gracefully handling cases where the Docker binary is missing from the system's PATH during preflight checks.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist · 2026-06-05T23:08:45Z

+        docker_result = subprocess.run(
+            ["docker", "version"],
+            stdout=subprocess.DEVNULL,
+            stderr=subprocess.PIPE,
+            timeout=10,
+        )
+        if docker_result.returncode != 0:
+            raise SetupError(
+                "Docker daemon is not running or docker is not on PATH. "
+                "Start Docker and retry."
+            )


If docker is not installed or not present on the system's PATH, calling subprocess.run(["docker", ...]) will raise a FileNotFoundError rather than returning a non-zero exit code. This will cause the preflight check to crash with an unhandled traceback. Checking if docker is on PATH using shutil.which and wrapping the execution in a try-except block ensures a clean SetupError is raised.

Suggested change

docker_result = subprocess.run(

["docker", "version"],

stdout=subprocess.DEVNULL,

stderr=subprocess.PIPE,

timeout=10,

)

if docker_result.returncode != 0:

raise SetupError(

"Docker daemon is not running or docker is not on PATH. "

"Start Docker and retry."

)

if shutil.which("docker") is None:

raise SetupError(

"docker is not on PATH. Install Docker and retry."

)

try:

docker_result = subprocess.run(

["docker", "version"],

stdout=subprocess.DEVNULL,

stderr=subprocess.PIPE,

timeout=10,

)

except Exception as e:

raise SetupError(f"Failed to execute docker command: {e}")

if docker_result.returncode != 0:

raise SetupError(

"Docker daemon is not running. Start Docker and retry."

)

gemini-code-assist · 2026-06-05T23:08:45Z

+        with self.swebench_config_template.open() as _f:
+            _tmpl = yaml.safe_load(_f)
+        if not isinstance((_tmpl or {}).get("model", {}).get("model_kwargs"), dict):
+            raise ValueError(
+                f"swebench template {self.swebench_config_template} must have a "
+                "'model.model_kwargs' dict; check the template structure."
+            )


The validation of the swebench_config_template structure is fragile and can raise an AttributeError or TypeError if _tmpl is parsed as None (e.g., an empty file), or if "model" is not a dictionary (e.g., a string or None). Using a safer dictionary retrieval pattern prevents these potential runtime crashes.

Suggested change

with self.swebench_config_template.open() as _f:

_tmpl = yaml.safe_load(_f)

if not isinstance((_tmpl or {}).get("model", {}).get("model_kwargs"), dict):

raise ValueError(

f"swebench template {self.swebench_config_template} must have a "

"'model.model_kwargs' dict; check the template structure."

)

with self.swebench_config_template.open() as _f:

_tmpl = yaml.safe_load(_f) or {}

model_cfg = _tmpl.get("model")

if not isinstance(model_cfg, dict) or not isinstance(model_cfg.get("model_kwargs"), dict):

raise ValueError(

f"swebench template {self.swebench_config_template} must have a "

"'model.model_kwargs' dict; check the template structure."

)

gemini-code-assist · 2026-06-05T23:08:45Z

+        model_params = benchmark_config_dict.get("model_params", {})
+        endpoints = benchmark_config_dict.get("endpoint_config", {}).get(
+            "endpoints", []
+        )


If endpoint_config or model_params is defined as empty or null in the YAML configuration, benchmark_config_dict.get(...) can return None. Calling .get() on None will raise an AttributeError. Using the or {} pattern ensures safe dictionary access.

model_params = benchmark_config_dict.get("model_params") or {} endpoint_cfg = benchmark_config_dict.get("endpoint_config") or {} endpoints = endpoint_cfg.get("endpoints", [])

gemini-code-assist · 2026-06-05T23:08:45Z

+        submitted = result.get("submitted_instances", 0)
+        resolved = result.get("resolved_instances", 0)


If the SWE-bench evaluation results JSON contains null values for submitted_instances or resolved_instances, .get(..., 0) will return None instead of 0. This will subsequently cause a TypeError when calculating the resolved rate. Using or 0 ensures that any None values are safely defaulted to 0.

Suggested change

submitted = result.get("submitted_instances", 0)

resolved = result.get("resolved_instances", 0)

submitted = result.get("submitted_instances") or 0

resolved = result.get("resolved_instances") or 0

Adds SWEBenchScorer (scorer_id="swe_bench_scorer") and a predefined SWEBench dataset that downloads from princeton-nlp/SWE-bench_Verified or SWE-bench_Lite on HuggingFace. The scorer patches a committed swebench_template.yaml with model name, endpoint URL, and all sampling parameters from model_params at runtime (no duplication into extras), then runs mini-swe-agent and grades predictions with the swebench evaluation harness, reporting resolved_instances / submitted_instances as the accuracy score. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

swebench.harness.run_evaluation writes its result JSON to its CWD. The eval subprocess was running with cwd=self.mini_swe_agent_dir, scattering files in the user's external venv directory. Change CWD to output_dir (report_dir/swe_bench_output/) so all outputs are self-contained under the report directory. Update the result file lookup accordingly and fix the test fixture to match. Add examples/10_SWEBench_Example/accuracy/RUNBOOK.md with venv setup and smoke-test instructions, mirroring the VBench runbook pattern. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Replace the external mini-swe-agent venv model with an in-repo uv subproject at examples/10_SWEBench_Example/accuracy/ (pyproject.toml + uv.lock), mirroring how VBenchScorer isolates vbench dependencies. - Add accuracy/pyproject.toml pinning mini-swe-agent==2.3.0 and swebench==4.1.0 with package=false; generate uv.lock - Rename _DEFAULT_MINI_SWE_AGENT_DIR/_MINI_SWE_AGENT_DIR_ENV to _DEFAULT_SWE_BENCH_PROJECT_PATH/_SWE_BENCH_PROJECT_PATH_ENV; default now points into the repo (examples/10_SWEBench_Example/accuracy) - _run_subprocess now wraps commands with `uv run --project <path>` instead of manually activating a .venv; drop manual env patching - Init check now validates pyproject.toml presence, not .venv/bin/python - Update RUNBOOK to reflect `uv sync` setup; remove test-dev paths - Update swe_bench_accuracy.yaml: remove hardcoded /home/user path, rename key to swe_bench_project_path - Update tests: rename fixture mini_swe_dir → swe_bench_project, parameter mini_swe_agent_dir → swe_bench_project_path - Exclude accuracy subproject uv.lock files from the large-file hook Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Minimal end-to-end smoke config (5 perf samples, 1 accuracy instance) for validating the SWEBenchScorer pipeline without running a full 600-second benchmark. Uses a 1-row JSONL for the accuracy phase to avoid issuing 500 predefined-dataset requests to the endpoint before the scorer runs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Adds a multi-turn + SWE-bench accuracy smoke config (swe_bench_multiturn_smoke.yaml) and its minimal 2-conversation perf dataset. Confirmed end-to-end: multi-turn perf phase completes, then SWEBenchScorer runs mini-swe-agent + harness and returns a score without errors (exit 0). Also adds a comment to swe_bench_accuracy.yaml noting that the perf dataset can be replaced with a multi-turn dataset without other changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…phase Adds Dataset.ACCURACY_ONLY class variable (default False). SWEBench sets it to True — problem statements sent directly to the model without an agent framework don't reflect real SWE-bench usage, so using the predefined dataset as a performance dataset is now rejected with InputValidationError. The check in _load_datasets() fires before create_loader() via a PREDEFINED lookup, so it gives a clear error rather than a confusing downstream failure. Updates swe_bench_accuracy.yaml, swe_bench_accuracy_smoke.yaml, and swe_bench_multiturn_smoke.yaml to use an explicit JSONL for the perf phase. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Replace two-branch subset ternary with _SWE_BENCH_HF_MAP dict lookup; validate subset in __init__ so unknown values (e.g. "full") raise ValueError at construction time rather than silently scoring against the wrong dataset. - Hoist `import yaml` to module top level; declare pyyaml==6.0.3 in pyproject.toml (was already a de-facto transitive dep, now explicit). - Add msgspec.json.decode(..., type=dict) on harness result file so a non-dict JSON response raises DecodeError instead of AttributeError. - Validate swebench template schema at construction: raise ValueError if model.model_kwargs is missing, surfacing bad templates early. - Add unit tests for all four fixes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Scorer.preflight() class method hook — no-op by default — is called in _load_datasets for every accuracy scorer before the perf phase begins. SWEBenchScorer overrides it to verify: 1. uv is on PATH 2. mini-extra is runnable in the accuracy subproject (uv sync was run) 3. swebench is importable in the subproject 4. Docker daemon is running All four raise SetupError with actionable messages so a misconfigured environment is caught upfront rather than after a potentially long performance run. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

num_instances=100, workers=10, max_eval_workers=10, subset="verified" reflect the intended accuracy evaluation target out of the box. Removes the None fallback for num_instances and cleans up redundant extras from YAML examples. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Add four tests to match VBenchScorer coverage depth: - subprocess non-zero exit code raises RuntimeError - subprocess timeout raises RuntimeError - result file found via glob fallback when exact name absent - submitted_instances=0 guard returns None score Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

VBench does not have smoke variants; these files used dummy perf data and a 1-instance accuracy run that users can achieve via --extras num_instances=1 on the CLI. No other files reference them. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Scorer.preflight: collapse to a one-liner (matches available_scorers style) - SWEBenchScorer class docstring: drop numbered step walkthrough and extras parameter table; keep the what and the uv-run isolation rationale Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

- Remove test_score_result_non_dict_raises_decode_error — asserts on msgspec behavior, not SWEBenchScorer logic - Merge subset pair into test_subset_maps_to_correct_hf_dataset_name with parametrize - Merge slice pair into test_num_instances_produces_correct_slice with parametrize 20 test functions, 22 pytest cases (was 23 functions, 23 cases) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…dir to absolute SWEBenchScorer delegates entirely to mini-extra and never reads endpoint responses, so the accuracy endpoint phase was sending 500 samples through the endpoint for nothing. Add SKIP_ENDPOINT_PHASE class variable to Scorer (default False, True on SWEBenchScorer) and guard _build_phases() and the total_samples count with it. Also add external_sample_count() classmethod so scorers that skip the endpoint phase can still surface their sample count in the setup log. SWEBenchScorer returns num_instances from extras. Fix a second bug where a relative report_dir (e.g. logs/foo launched from repo root) caused mini-extra to fail with FileNotFoundError on the patched config YAML because all derived paths were relative and mini-extra runs with cwd=output_dir. Resolving self.report_dir to absolute in SWEBenchScorer.__init__ fixes all derived paths at once. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

SWEBenchScorer never uses sample_index_map (it scores via mini-extra subprocess), but Scorer.__init__ unconditionally called _load_sample_index_map, which KeyErrors when the accuracy phase was skipped and 'swe_bench' was therefore absent from sample_idx_map.json. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Double /v1 in _patch_config: strip trailing /v1 from endpoint URL before appending it, so http://host/v1 doesn't become /v1/v1 - Falsy chat_template_kwargs: use `is not None` guard to match the pattern used for every other model field in the same method - slice_str capped: min(num_instances, actual dataset rows) so mini-swe-agent isn't asked for more instances than exist - Corrupt parquet cache: wrap read_parquet in try/except with an actionable message pointing to force=True - Duplicate HF repo map: remove _SWE_BENCH_HF_MAP from scoring.py and import _REPO_MAP from the dataset module as the single source - Duplicate subprocess helper: extract _run_subprocess_with_log and have both VBenchScorer and SWEBenchScorer delegate to it - Bare assert on dataset.data: replace with conditional len() that handles None without crashing - None guard in base Scorer.score(): assert sample_index_map is not None with a clear message for future SKIP_ENDPOINT_PHASE scorers - Collapse two adjacent SKIP_ENDPOINT_PHASE loops in setup_benchmark into one; change strict=False to strict=True - config.yaml guard in SWEBenchScorer.score(): raise FileNotFoundError with scorer context instead of bare open() failure - Test isolation: clean up _test_skip_endpoint_phase from Scorer.PREDEFINED in a finally block so it doesn't leak into test_scorer_enum_matches_registry Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Call scorer_cls.preflight() in _load_datasets for each accuracy scorer so environment checks (uv, mini-extra, Docker) run before the perf phase; preflight is a no-op on all existing scorers - Fix unreachable None guard on num_samples in finalize_benchmark: collapsed the len() call and the dict-literal conditional into a single guarded assignment so data=None is handled safely - Move import yaml to the third-party import group in scoring.py (ruff/isort) - Remove duplicate _run_benchmark_async import in test_benchmark.py - Lift inline imports in TestAccuracyOnlyDataset to module level - Add test verifying a failing preflight() propagates SetupError before run Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Restore corrupted Apache 2.0 license header in scoring.py - Move import yaml into sorted third-party block (ruff I001) - Add pass body and clarify noqa comment on base Scorer.preflight - Add type annotation and assert guards for sample_index_map (fixes mypy) - Change zip(strict=False) to strict=True for accuracy_datasets/eval_configs - Move _SelfContainedScorer from inside test method to module level; remove lazy imports - Filter _-prefixed test-only scorers from TestScorerMethodSync registry check - Add test_external_sample_count covering all branches of SWEBenchScorer.external_sample_count - Collapse two SKIP_ENDPOINT_PHASE loops in setup_benchmark into one sum comprehension Cherry-picked from c57bf66 (endpoints_2)

SWE-bench is the accuracy evaluation for the multi-turn agentic workload, so its config, agent template, accuracy subproject, and runbook belong in 09_MultiTurn/ rather than a separate 10_SWEBench_Example/ folder. - Rename examples/10_SWEBench_Example/ → examples/09_MultiTurn/ (4 files) - Update _DEFAULT_SWE_BENCH_PROJECT_PATH and _DEFAULT_SWE_BENCH_TEMPLATE constants in scoring.py to point at the new location - Update all self-referential paths in swe_bench_accuracy.yaml, RUNBOOK.md, and pyproject.toml - Add SWE-bench Accuracy section to examples/09_MultiTurn/README.md Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

…accuracy.yaml dummy_1k.jsonl uses the column name text_input; the OpenAI adapter's ColumnFilter requires prompt. Without the parser remap the benchmark fails at dataset load with a KeyError on the prompt column. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

The branch introduced blank lines that violated ruff I001 in both files. Aligning with the main-branch import layout so pre-commit passes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Add missing scorer_cls.preflight() call in the perf-dataset accuracy branch of _load_datasets(); the accuracy-dataset path already called it but the perf-dataset path did not - Remove redundant bare assert self.sample_index_map is not None in Scorer.score() — the same condition is already checked with a detailed message four lines earlier - Lift execute_mod module import to module level in test_benchmark.py; remove three inline imports from test_preflight_error_propagates (Scorer, SetupError already at module level) - Remove inline import random from TestAggregatorArgs._make_ctx (random already at module level) Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

SWE-bench is the accuracy evaluation for the agentic inference workload, so its config, template, accuracy subproject, and runbook belong in 10_Agentic_Inference/ alongside the kimi benchmark files. - Rename examples/09_MultiTurn/{swe_bench_accuracy.yaml,swebench_template.yaml, accuracy/} → examples/10_Agentic_Inference/ - Remove examples/09_MultiTurn/README.md (SWE-bench section merged into examples/10_Agentic_Inference/README.md) - Update _DEFAULT_SWE_BENCH_PROJECT_PATH and _DEFAULT_SWE_BENCH_TEMPLATE constants in scoring.py - Update all self-referential paths in RUNBOOK.md and pyproject.toml Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Move preflight() call before dataset load to fail fast on missing deps - Inject target_concurrency as workers into swe_bench_scorer extras via schema validator; explicit extras.workers is never overridden - Fix score() base-class assert message to accurately describe the failure - Replace assert on dataset.dataframe with explicit RuntimeError - Standardise log/docstring terminology: mini-extra swebench vs mini-swe-agent - Move _FailingPreflightScorer to module scope; remove try/finally cleanup - Remove redundant _write_sample_idx_map calls in scorer unit tests - Add HF cold-start download comment to swe_bench_accuracy.yaml Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Fix zip(strict=True) crash when perf dataset carries accuracy_config: iterate eval_configs directly instead of zipping with accuracy_datasets - Guard model_params["name"] in _patch_config with a clear ValueError - Remove max_new_tokens from swe_bench_accuracy.yaml (let model run unbounded) - Switch example YAML to concurrency load pattern so workers auto-injection fires - Fix num_samples=0 in report for SKIP_ENDPOINT_PHASE datasets by falling back to dataframe length when dataset.data is None - Log warning when num_instances exceeds dataset size in SWEBenchScorer.score() - Deduplicate _resolve_project_path by extracting _resolve_subproject_path helper Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

- Extract _make_fake_run() helper in test_swe_bench_scorer.py to eliminate repeated mini-extra preamble across four tests - Merge TestPatchConfigMissingName into TestSWEBenchScorer (one-method class with no organizational benefit) - Remove test_column_names from test_swe_bench_dataset.py (duplicate of column assertions already in test_downloads_and_caches/test_lite_subset) - Move two concurrency-injection tests from TestAccuracyOnlyDataset into TestCLIConfigModels where config-construction validation tests belong Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

tianmu-li force-pushed the feat/swe_bench_scorer branch from 89af1d1 to 6016c01 Compare June 8, 2026 22:52

tianmu-li and others added 18 commits June 15, 2026 04:20

Remove unintended changes

57c30f9

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

tianmu-li force-pushed the feat/swe_bench_scorer branch from aa7f559 to bb9b307 Compare June 15, 2026 05:16

tianmu-li and others added 8 commits June 15, 2026 05:40

Trim example doc and yaml

e7d1f65

Signed-off-by: Li, Tianmu <tianmu.li@intel.com>

fix: restore import order in dataset.py and shopify __init__.py

4dcabe7

The branch introduced blank lines that violated ruff I001 in both files. Aligning with the main-branch import layout so pre-commit passes. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

tianmu-li and others added 2 commits June 15, 2026 06:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP feat: swe bench scorer#342

WIP feat: swe bench scorer#342
tianmu-li wants to merge 28 commits into
mlcommons:mainfrom
tianmu-li:feat/swe_bench_scorer

tianmu-li commented Jun 5, 2026

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		submitted = result.get("submitted_instances", 0)
		resolved = result.get("resolved_instances", 0)

Conversation

tianmu-li commented Jun 5, 2026

What does this PR do?

Type of change

Related issues

Testing

Checklist

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

gemini-code-assist Bot Jun 5, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented Jun 5, 2026 •

edited

Loading