Skip to content

UniPat-AI/EvoCodeBench

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

28 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

EvoCode-Bench logo EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

Dataset GitHub Paper More Work Framework Leaderboard


News

2026-06-20 — Results re-release (v2). The leaderboard, per-task results, and trajectories published before this date are superseded. We re-ran the entire benchmark on a patched harness after fixing an evaluation-integrity leak in Harbor's shared multi-step verifier mode (agents could read the grader during their own turn; reported upstream as #1960 / #1961), one contaminated task, and 11 task/test defects. See Known issues & responsible disclosure and CHANGELOG.md. Treat any EvoCode-Bench number dated before 2026-06-20 as outdated.

June 2026. EvoCode-Bench runs on the Harbor official multi-step task format. Each task is a sequence of [[steps]] run in one persistent container, with a per-step verifier after each step and trial-level reward aggregation.

EvoCode-Bench tests whether coding agents can keep a project working as user requests change. It contains 26 stateful coding tasks and 227 evaluated rounds (Harbor steps). Each task keeps the same workspace and agent session for 5-15 rounds, while cumulative executable tests check new requirements and still-active prior requirements.

The original paper evaluation used a different runner (harbor_multiturn) and the MT@4 / SR / Comp metrics. That framework, the legacy task layout, and the paper results are kept under legacy/ for reproducibility — not needed for normal use.

Overview

Most coding benchmarks evaluate one specification followed by one final assessment. EvoCode-Bench instead evaluates an interactive coding session. Later rounds inherit earlier implementation decisions, dependencies, file layouts, API choices, and test behavior. Each round (Harbor step) is scored by a cumulative verifier, and the trial reward is the mean of the per-step rewards.

The benchmark is organized along two axes from the paper:

Engineering activity Explorative Contractual Document-driven Total
Construction 9 / 80 3 / 37 1 / 7 13 / 124
Spec Evolution 1 / 8 1 / 7 1 / 7 3 / 22
Review 3 / 21 1 / 7 1 / 9 5 / 37
Migration 3 / 29 1 / 7 1 / 8 5 / 44
Total 16 / 138 6 / 58 4 / 31 26 / 227

Each cell reports tasks / rounds. A round maps one-to-one to a Harbor step.

Task Format

EvoCode-Bench tasks use the Harbor official multi-step layout — one sub-directory per step under steps/, executed in the order declared by the [[steps]] array in task.toml:

task/
├── task.toml                       # metadata + [[steps]] list + reward strategy
├── environment/
│   └── Dockerfile                  # single container shared across all steps
└── steps/
    ├── round-1/
    │   ├── instruction.md          # this round's user request (WHAT, not HOW)
    │   ├── solution/solve.sh        # reference delta for this round
    │   └── tests/test.sh           # cumulative tests through this round
    ├── round-2/
    │   ├── instruction.md
    │   ├── solution/solve.sh
    │   └── tests/test.sh
    └── round-N/ ...

task.toml follows the official schema (schema_version = "1.2"):

schema_version = "1.2"
multi_step_reward_strategy = "mean"      # trial reward = mean of per-step rewards

[metadata]
name = "service-mesh-health-router"
difficulty = "hard"
category = "systems-networking"

[metadata.requirement_chain]
num_steps = 8

[[metadata.requirement_chain.steps]]
step = "round-1"
change_types = ["extension"]
# ... one entry per step (extension / correction / conflict)

[agent]
timeout_sec = 1800.0                      # global default; override per step via [steps.agent]

[verifier]
timeout_sec = 1800.0                      # global default; override per step via [steps.verifier]

[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 4096
storage_mb = 10240

[[steps]]
name = "round-1"                          # matches steps/round-1/

[[steps]]
name = "round-2"
# ... one [[steps]] entry per step, in execution order

The task format is built around three constraints:

  • Persistent workspace: the same Docker container carries files, dependencies, and generated artifacts across steps.
  • Continuous agent session: the agent receives a sequence of user requests rather than independent prompts.
  • Cumulative tests: round i verifies every still-active requirement from rounds 1..i, so regressions are caught immediately. Each step's tests/test.sh writes a binary reward to /logs/verifier/reward.txt.

Framework

EvoCode-Bench's standard multi-step evaluation runs on upstream Harbor — the same framework used by Terminal-Bench 2.0 — using its native multi-step support. No fork is required to run a full task (all steps).

uv tool install harbor      # or: pip install harbor
harbor run --help

Upstream Harbor's official multi-step runner provides:

  • native [[steps]] sequencing in the order declared in task.toml;
  • a single persistent Docker workspace shared across all steps;
  • a continuous agent session across steps;
  • a per-step verifier run against the cumulative test suite after each step;
  • trial-level reward aggregation via multi_step_reward_strategy (mean for EvoCode-Bench).

Need single-round fast-forward (SR), or want to reproduce the paper? See legacy/ — it covers our Harbor fork and the original harbor_multiturn framework. Not required for normal use.

Quick Start

1. Prerequisites

  • Python 3.11+ (the evaluation/*.py helpers use the stdlib tomllib).
  • Docker running, or a remote Daytona target.
  • A model endpoint for your agent.

Install the Harbor CLI:

# uv runs the Harbor CLI. See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install harbor      # or: pip install harbor

pip install harbor (upstream) runs full tasks (all steps).

2. Prepare Tasks

Download the released EvoCode-Bench task directories from Hugging Face and place them under data/EvoCodeBench. If you already have the Terminal-X repository, the tasks are also available under Terminal-X/data/EvoCodeBench/.

3. Configure Model Endpoint

For the claude-code agent:

export AGENT_TYPE="claude-code"
export AGENT_MODEL="claude-opus-4-7"
export ANTHROPIC_BASE_URL="https://api.your-provider.com"
export ANTHROPIC_AUTH_TOKEN="sk-..."

For the terminus-2 agent (OpenAI-compatible):

export AGENT_TYPE="terminus-2"
export AGENT_MODEL="openai/gpt-5.5"
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"

4. Validate the Dataset

python evaluation/validate_dataset.py data/EvoCodeBench

The released benchmark should report 26 tasks and 227 steps.

5. Run One Task

# Agent (pass@1 by default; set AGENT_ATTEMPTS for pass@k)
AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agent

# Oracle verification (reference solutions; should score 1.0 on every step)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation oracle

# No-op baseline (empty submission; should score 0)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation nop

6. Run All Tasks

AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_all.sh data/EvoCodeBench agent

Each task writes Harbor outputs under:

data/EvoCodeBench/<task>/harbor_jobs/<model>/

Metrics

Each step is scored with a binary reward — 1 if all of that step's key requirements pass, 0 otherwise — written by the verifier to /logs/verifier/reward.txt. Harbor aggregates a trial's per-step rewards into a trial-level reward via multi_step_reward_strategy = "mean".

The score is the mean per-step reward:

  • per-task score = (passed steps) / (total steps) for the trial;
  • dataset score = mean of per-task scores across the 26 tasks.

A complementary case score uses the same shape at finer granularity. Each step's verifier also reports per-test-case results (CASE_SUMMARY total_cases=… success_count=…). Define:

  • per-step case ratio = success_count / total_cases for the step (a step whose code fails to build, or that the chain never reached, has ratio 0);
  • per-task case score = mean of the per-step case ratios over the task's steps;
  • dataset case score = mean of per-task case scores across the 26 tasks ×100.

The dataset score is all-or-nothing per step, so it rewards finishing a step exactly; the case score credits partial progress (e.g. passing 44 of 45 cases). A large gap between the two for a model means it gets most of the work right but rarely lands a whole step.

python evaluation/compute_metrics.py \
  --tasks-dir data/EvoCodeBench \
  --results-dir data/EvoCodeBench \
  --model claude-opus-4-7          # score one agent; add --json for machine-readable output

--model selects the harbor_jobs/<model>/ results to score (the oracle and nop baselines are excluded by default).

The paper's MT@4 / SR / Comp metrics and single-round fast-forward evaluation live in legacy/.

Results

Evaluated on the current dataset release with the Harbor official multi-step runner: full 5–15 round chains, one attempt per task (no best-of-k). The score is the dataset score defined in Metrics — the mean over 26 tasks of each task's passed_rounds / total_rounds. oracle scores 1.0 and nop scores 0 on every task. On the hardest, longest tasks some agents exhaust the 30-minute-per-round time budget and the chain aborts before later rounds; those rounds count as 0 (see CHANGELOG.md).

Numbers below are the 2026-06-20 re-release. The previous (June 13–16) leaderboard is superseded — see Known issues & responsible disclosure. The old values are shown in parentheses where they moved.

Agent Reasoning Dataset score Case score Perfect tasks
Claude-Opus-4.8 effort xhigh 59.1 (42.5) 96.6 (89.9) 9/26
GPT-5.5 effort high 29.5 (23.5) 81.8 (77.2) 0/26
MiniMax-M3 thinking adaptive 23.4 (15.2) 61.5 (69.2) 2/26
GLM-5.2 thinking on¹ 16.2 47.3 1/26
DeepSeek-V4-Pro effort high 14.1 (10.8) 61.7 (58.3) 1/26
Kimi-K2.6 thinking on¹ 13.2 (23.1) 65.7 (75.2) 0/26
DeepSeek-V4-Flash effort high 12.2 (4.6) 58.9 (52.5) 0/26
Qwen3.7-Max thinking on¹ 11.9 (7.6) 67.4 (64.7) 0/26
Qwen3.6-Plus thinking on¹ 9.7 (10.1) 67.7 (64.4) 0/26
Kimi-K2.7-Code thinking on¹ 7.8 45.4 0/26
GLM-5.1 thinking on¹ 5.9 (6.3) 52.5 (48.4) 0/26
MiniMax-M2.7 reasoning split 5.1 (0.8) 44.9 (42.6) 0/26

GLM-5.2 and Kimi-K2.7-Code are new in this release (no prior value).

Dataset score is the mean per-task score ×100, where a task's score is passed_rounds / total_rounds and a round is "passed" only if it earns the binary reward 1 (every test case of that round passes).

Case score is the finer-grained companion. For each task, take each round's passed_test_cases / total_test_cases, average over the task's rounds (a round whose code fails to build, or that the chain never reached, counts as 0), then average over the 26 tasks ×100. It credits the partial progress the all-or-nothing round reward hides — e.g. GPT-5.5 scores 29.5 on rounds but passes 81.8% of test cases, because it often misses a round by just one or two cases. Both scores rank Opus-4.8 first, but the case score spreads the field more smoothly.

Reasoning is the thinking configuration used for each model: models with an effort knob ran at the listed level (Opus at its highest, xhigh; the rest at high); ¹ models without an effort knob ran with their native thinking simply enabled (Qwen enable_thinking, GLM/Kimi thinking.type=enabled), and MiniMax M3/M2.7 used their adaptive / split reasoning modes. All agents used the terminus-2 scaffold. Per-task / per-round / per-test-case detail: evaluation/sweeps/sweep_2026-06_single_shot.csv and the interactive results site.

Explore the results interactively → unipat-ai.github.io/EvoCodeBench — one page per task with a per-round × per-model test-case heatmap, drill-down into the exact cases each model failed (intent / expected / actual / reason), and a written difficulty and performance-gap analysis. The same pages are under docs/ and render locally with any static server (python3 -m http.server from docs/).

The original paper results (MT@4 / SR / Comp, legacy runner) are in legacy/.

Known issues & responsible disclosure

Verifier readable during the agent phase (Harbor shared-step leak)

While auditing per-model, per-round trajectories from our first evaluation (the v1 leaderboard, June 13–16), we found that on some tasks the agent could read the verifier's grading script (/tests/test.sh) and the previous step's verifier output (/logs/verifier/reward.txt, test-stdout.txt) from inside its own step.

Root cause (framework, not the tasks). This is a property of Harbor's default shared multi-step verifier mode: the verifier runs in the agent's container, and /tests + /logs/verifier are cleared only right before each verifier — never before the next step's agent phase. So from step 2 onward, the previous step's cumulative grading script and reward persist and are readable to the agent. It reproduces on upstream Harbor and is not specific to EvoCode-Bench. We reported it upstream:

Remediation. We patched our evaluation harness (same fix as PR #1961: clear /tests and /logs/verifier at the start of every agent phase) and re-ran the entire benchmark on the patched harness. The current Results are from these clean runs. The numbers and trajectories published before 2026-06-20 are superseded — see CHANGELOG.md.

What we observed in v1 (now withdrawn). Across the 26 tasks, agents read or ran the leaked grader (or read the prior reward) in at least one round on 12 tasks / 22 (task, model) pairs / 47 round-level occurrences. Because the leaked file is the previous step's grader, accesses only land from round 2 on. The behavior was uneven across models — heavily concentrated in a few:

Task Model Rounds Behavior Access
d5_w1 DeepSeek-V4-Flash R7 read grader, read reward succeeded
d12_w1 DeepSeek-V4-Pro R2 read grader succeeded
d9_w11 DeepSeek-V4-Pro R2,R3,R5,R6 read+ran grader succeeded
d11_w9 Kimi-K2.6 R4,R5,R6 read+ran grader succeeded
d12_w1 Kimi-K2.6 R3 read grader, read reward succeeded
d1_w9 Kimi-K2.6 R7 read grader succeeded
d5_w1 Kimi-K2.6 R2,R4,R13 read+ran grader, read reward succeeded
d9_w11 Kimi-K2.6 R2,R7 read grader, read reward succeeded
d1_w9 MiniMax-M2.7 R2 read grader unclear
d10_w12 MiniMax-M3 R5,R6,R7 read+ran grader succeeded
d10_w4 MiniMax-M3 R6 read+ran grader succeeded
d10_w5 MiniMax-M3 R4 read+ran grader, read reward succeeded
d10_w9 MiniMax-M3 R5,R6,R7 read+ran grader succeeded
d11_w2 MiniMax-M3 R5 read+ran grader succeeded
d12_w1 MiniMax-M3 R7 read+ran grader succeeded
d1_w5 MiniMax-M3 R2,R7,R8 read+ran grader, read reward succeeded
d5_w1 MiniMax-M3 R7,R9,R10,R11,R12 read+ran grader, read reward succeeded
d8_w5 MiniMax-M3 R4,R5,R6,R10 read+ran grader succeeded
d9_w11 MiniMax-M3 R3,R4,R6,R7,R15 read+ran grader, read reward succeeded
d12_w1 Opus-4.8 R7 read+ran grader, read reward succeeded
d5_w1 Opus-4.8 R9 read+ran grader, read reward succeeded
d12_w1 Qwen3.7-Max R7 read+ran grader unclear

"Access = succeeded" means the agent's terminal actually received grader/reward content. In almost all cases reading the grader did not help (the models still failed the round); we withdrew the affected v1 results regardless. If you evaluate with stock upstream Harbor in shared mode, apply the same sanitization or use separate-verifier mode.

Other corrections in the 2026-06-20 re-release

The re-run also fixed one contaminated task (d12_w1) and 11 task/test defects. Full details, dates, and the old-vs-new framing are in CHANGELOG.md.

Relation to Terminal-X

EvoCode-Bench is the iteration component of Terminal-X, alongside DeepTerminalBench for single-shot depth and RoadmapBench for version upgrades. Terminal-X contains the combined benchmark suite and cross-dataset blog; this repository focuses on the EvoCode-Bench task format, evaluation protocol, and official-Harbor runner.

Citation

@misc{shen2026evocodebench,
  title = {EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions},
  author = {Haiyang Shen and Xuanzhong Chen and Wendong Xu and Yun Ma and Liang Chen and Kuan Li},
  year = {2026},
  eprint = {2605.24110},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE},
  url = {https://arxiv.org/abs/2605.24110}
}

License

Code in this repository is released under the MIT License. Dataset terms follow the dataset release metadata.

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors