2026-06-20 — Results re-release (v2). The leaderboard, per-task results, and
trajectories published before this date are superseded. We re-ran the entire
benchmark on a patched harness after fixing an evaluation-integrity leak in Harbor's
shared multi-step verifier mode (agents could read the grader during their own turn;
reported upstream as #1960 /
#1961), one contaminated task,
and 11 task/test defects. See
Known issues & responsible disclosure
and CHANGELOG.md. Treat any EvoCode-Bench number dated before 2026-06-20
as outdated.
June 2026. EvoCode-Bench runs on the Harbor official multi-step task format. Each task is a sequence of [[steps]] run in one persistent container, with a per-step verifier after each step and trial-level reward aggregation.
EvoCode-Bench tests whether coding agents can keep a project working as user requests change. It contains 26 stateful coding tasks and 227 evaluated rounds (Harbor steps). Each task keeps the same workspace and agent session for 5-15 rounds, while cumulative executable tests check new requirements and still-active prior requirements.
The original paper evaluation used a different runner (
harbor_multiturn) and the MT@4 / SR / Comp metrics. That framework, the legacy task layout, and the paper results are kept underlegacy/for reproducibility — not needed for normal use.
Most coding benchmarks evaluate one specification followed by one final assessment. EvoCode-Bench instead evaluates an interactive coding session. Later rounds inherit earlier implementation decisions, dependencies, file layouts, API choices, and test behavior. Each round (Harbor step) is scored by a cumulative verifier, and the trial reward is the mean of the per-step rewards.
The benchmark is organized along two axes from the paper:
| Engineering activity | Explorative | Contractual | Document-driven | Total |
|---|---|---|---|---|
| Construction | 9 / 80 | 3 / 37 | 1 / 7 | 13 / 124 |
| Spec Evolution | 1 / 8 | 1 / 7 | 1 / 7 | 3 / 22 |
| Review | 3 / 21 | 1 / 7 | 1 / 9 | 5 / 37 |
| Migration | 3 / 29 | 1 / 7 | 1 / 8 | 5 / 44 |
| Total | 16 / 138 | 6 / 58 | 4 / 31 | 26 / 227 |
Each cell reports tasks / rounds. A round maps one-to-one to a Harbor step.
EvoCode-Bench tasks use the Harbor official multi-step layout — one sub-directory per step under steps/, executed in the order declared by the [[steps]] array in task.toml:
task/
├── task.toml # metadata + [[steps]] list + reward strategy
├── environment/
│ └── Dockerfile # single container shared across all steps
└── steps/
├── round-1/
│ ├── instruction.md # this round's user request (WHAT, not HOW)
│ ├── solution/solve.sh # reference delta for this round
│ └── tests/test.sh # cumulative tests through this round
├── round-2/
│ ├── instruction.md
│ ├── solution/solve.sh
│ └── tests/test.sh
└── round-N/ ...
task.toml follows the official schema (schema_version = "1.2"):
schema_version = "1.2"
multi_step_reward_strategy = "mean" # trial reward = mean of per-step rewards
[metadata]
name = "service-mesh-health-router"
difficulty = "hard"
category = "systems-networking"
[metadata.requirement_chain]
num_steps = 8
[[metadata.requirement_chain.steps]]
step = "round-1"
change_types = ["extension"]
# ... one entry per step (extension / correction / conflict)
[agent]
timeout_sec = 1800.0 # global default; override per step via [steps.agent]
[verifier]
timeout_sec = 1800.0 # global default; override per step via [steps.verifier]
[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 4096
storage_mb = 10240
[[steps]]
name = "round-1" # matches steps/round-1/
[[steps]]
name = "round-2"
# ... one [[steps]] entry per step, in execution orderThe task format is built around three constraints:
- Persistent workspace: the same Docker container carries files, dependencies, and generated artifacts across steps.
- Continuous agent session: the agent receives a sequence of user requests rather than independent prompts.
- Cumulative tests: round
iverifies every still-active requirement from rounds1..i, so regressions are caught immediately. Each step'stests/test.shwrites a binary reward to/logs/verifier/reward.txt.
EvoCode-Bench's standard multi-step evaluation runs on upstream Harbor — the same framework used by Terminal-Bench 2.0 — using its native multi-step support. No fork is required to run a full task (all steps).
uv tool install harbor # or: pip install harbor
harbor run --helpUpstream Harbor's official multi-step runner provides:
- native
[[steps]]sequencing in the order declared intask.toml; - a single persistent Docker workspace shared across all steps;
- a continuous agent session across steps;
- a per-step verifier run against the cumulative test suite after each step;
- trial-level reward aggregation via
multi_step_reward_strategy(meanfor EvoCode-Bench).
Need single-round fast-forward (SR), or want to reproduce the paper? See
legacy/— it covers our Harbor fork and the originalharbor_multiturnframework. Not required for normal use.
- Python 3.11+ (the
evaluation/*.pyhelpers use the stdlibtomllib). - Docker running, or a remote Daytona target.
- A model endpoint for your agent.
Install the Harbor CLI:
# uv runs the Harbor CLI. See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install harbor # or: pip install harborpip install harbor (upstream) runs full tasks (all steps).
Download the released EvoCode-Bench task directories from Hugging Face and place them under data/EvoCodeBench. If you already have the Terminal-X repository, the tasks are also available under Terminal-X/data/EvoCodeBench/.
For the claude-code agent:
export AGENT_TYPE="claude-code"
export AGENT_MODEL="claude-opus-4-7"
export ANTHROPIC_BASE_URL="https://api.your-provider.com"
export ANTHROPIC_AUTH_TOKEN="sk-..."For the terminus-2 agent (OpenAI-compatible):
export AGENT_TYPE="terminus-2"
export AGENT_MODEL="openai/gpt-5.5"
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"python evaluation/validate_dataset.py data/EvoCodeBenchThe released benchmark should report 26 tasks and 227 steps.
# Agent (pass@1 by default; set AGENT_ATTEMPTS for pass@k)
AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agent
# Oracle verification (reference solutions; should score 1.0 on every step)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation oracle
# No-op baseline (empty submission; should score 0)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation nopAGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
./evaluation/run_all.sh data/EvoCodeBench agentEach task writes Harbor outputs under:
data/EvoCodeBench/<task>/harbor_jobs/<model>/
Each step is scored with a binary reward — 1 if all of that step's key requirements pass, 0 otherwise — written by the verifier to /logs/verifier/reward.txt. Harbor aggregates a trial's per-step rewards into a trial-level reward via multi_step_reward_strategy = "mean".
The score is the mean per-step reward:
- per-task score = (passed steps) / (total steps) for the trial;
- dataset score = mean of per-task scores across the 26 tasks.
A complementary case score uses the same shape at finer granularity. Each step's verifier
also reports per-test-case results (CASE_SUMMARY total_cases=… success_count=…). Define:
- per-step case ratio =
success_count / total_casesfor the step (a step whose code fails to build, or that the chain never reached, has ratio 0); - per-task case score = mean of the per-step case ratios over the task's steps;
- dataset case score = mean of per-task case scores across the 26 tasks ×100.
The dataset score is all-or-nothing per step, so it rewards finishing a step exactly; the case score credits partial progress (e.g. passing 44 of 45 cases). A large gap between the two for a model means it gets most of the work right but rarely lands a whole step.
python evaluation/compute_metrics.py \
--tasks-dir data/EvoCodeBench \
--results-dir data/EvoCodeBench \
--model claude-opus-4-7 # score one agent; add --json for machine-readable output--model selects the harbor_jobs/<model>/ results to score (the oracle and nop baselines are excluded by default).
The paper's MT@4 / SR / Comp metrics and single-round fast-forward evaluation live in
legacy/.
Evaluated on the current dataset release with the Harbor official multi-step runner:
full 5–15 round chains, one attempt per task (no best-of-k). The score is the
dataset score defined in Metrics — the mean over 26 tasks of each
task's passed_rounds / total_rounds. oracle scores 1.0 and nop scores 0 on every
task. On the hardest, longest tasks some agents exhaust the 30-minute-per-round time
budget and the chain aborts before later rounds; those rounds count as 0 (see
CHANGELOG.md).
Numbers below are the 2026-06-20 re-release. The previous (June 13–16) leaderboard is superseded — see Known issues & responsible disclosure. The old values are shown in parentheses where they moved.
| Agent | Reasoning | Dataset score | Case score | Perfect tasks |
|---|---|---|---|---|
| Claude-Opus-4.8 | effort xhigh |
59.1 (42.5) | 96.6 (89.9) | 9/26 |
| GPT-5.5 | effort high |
29.5 (23.5) | 81.8 (77.2) | 0/26 |
| MiniMax-M3 | thinking adaptive |
23.4 (15.2) | 61.5 (69.2) | 2/26 |
| GLM-5.2 | thinking on¹ | 16.2 | 47.3 | 1/26 |
| DeepSeek-V4-Pro | effort high |
14.1 (10.8) | 61.7 (58.3) | 1/26 |
| Kimi-K2.6 | thinking on¹ | 13.2 (23.1) | 65.7 (75.2) | 0/26 |
| DeepSeek-V4-Flash | effort high |
12.2 (4.6) | 58.9 (52.5) | 0/26 |
| Qwen3.7-Max | thinking on¹ | 11.9 (7.6) | 67.4 (64.7) | 0/26 |
| Qwen3.6-Plus | thinking on¹ | 9.7 (10.1) | 67.7 (64.4) | 0/26 |
| Kimi-K2.7-Code | thinking on¹ | 7.8 | 45.4 | 0/26 |
| GLM-5.1 | thinking on¹ | 5.9 (6.3) | 52.5 (48.4) | 0/26 |
| MiniMax-M2.7 | reasoning split | 5.1 (0.8) | 44.9 (42.6) | 0/26 |
GLM-5.2 and Kimi-K2.7-Code are new in this release (no prior value).
Dataset score is the mean per-task score ×100, where a task's score is passed_rounds / total_rounds
and a round is "passed" only if it earns the binary reward 1 (every test case of that round passes).
Case score is the finer-grained companion. For each task, take each round's
passed_test_cases / total_test_cases, average over the task's rounds (a round whose code fails to build,
or that the chain never reached, counts as 0), then average over the 26 tasks ×100. It credits the partial
progress the all-or-nothing round reward hides — e.g. GPT-5.5 scores 29.5 on rounds but passes 81.8% of
test cases, because it often misses a round by just one or two cases. Both scores rank Opus-4.8 first, but
the case score spreads the field more smoothly.
Reasoning is the thinking configuration used for each model: models with an effort knob ran at the
listed level (Opus at its highest, xhigh; the rest at high); ¹ models without an
effort knob ran with their native thinking simply enabled (Qwen enable_thinking,
GLM/Kimi thinking.type=enabled), and MiniMax M3/M2.7 used their adaptive / split
reasoning modes. All agents used the terminus-2 scaffold.
Per-task / per-round / per-test-case detail: evaluation/sweeps/sweep_2026-06_single_shot.csv
and the interactive results site.
Explore the results interactively → unipat-ai.github.io/EvoCodeBench —
one page per task with a per-round × per-model test-case heatmap, drill-down into the exact
cases each model failed (intent / expected / actual / reason), and a written difficulty and
performance-gap analysis. The same pages are under docs/ and render locally with any
static server (python3 -m http.server from docs/).
The original paper results (MT@4 / SR / Comp, legacy runner) are in
legacy/.
While auditing per-model, per-round trajectories from our first evaluation (the
v1 leaderboard, June 13–16), we found that on some tasks the agent could read the
verifier's grading script (/tests/test.sh) and the previous step's verifier output
(/logs/verifier/reward.txt, test-stdout.txt) from inside its own step.
Root cause (framework, not the tasks). This is a property of Harbor's default
shared multi-step verifier mode: the verifier runs in the agent's container, and
/tests + /logs/verifier are cleared only right before each verifier — never before
the next step's agent phase. So from step 2 onward, the previous step's cumulative
grading script and reward persist and are readable to the agent. It reproduces on
upstream Harbor and is not specific to EvoCode-Bench. We reported it upstream:
- Issue: harbor-framework/harbor#1960
- Fix PR: harbor-framework/harbor#1961
Remediation. We patched our evaluation harness (same fix as PR #1961: clear
/tests and /logs/verifier at the start of every agent phase) and re-ran the entire
benchmark on the patched harness. The current Results are from these clean
runs. The numbers and trajectories published before 2026-06-20 are superseded — see
CHANGELOG.md.
What we observed in v1 (now withdrawn). Across the 26 tasks, agents read or ran the leaked grader (or read the prior reward) in at least one round on 12 tasks / 22 (task, model) pairs / 47 round-level occurrences. Because the leaked file is the previous step's grader, accesses only land from round 2 on. The behavior was uneven across models — heavily concentrated in a few:
| Task | Model | Rounds | Behavior | Access |
|---|---|---|---|---|
| d5_w1 | DeepSeek-V4-Flash | R7 | read grader, read reward | succeeded |
| d12_w1 | DeepSeek-V4-Pro | R2 | read grader | succeeded |
| d9_w11 | DeepSeek-V4-Pro | R2,R3,R5,R6 | read+ran grader | succeeded |
| d11_w9 | Kimi-K2.6 | R4,R5,R6 | read+ran grader | succeeded |
| d12_w1 | Kimi-K2.6 | R3 | read grader, read reward | succeeded |
| d1_w9 | Kimi-K2.6 | R7 | read grader | succeeded |
| d5_w1 | Kimi-K2.6 | R2,R4,R13 | read+ran grader, read reward | succeeded |
| d9_w11 | Kimi-K2.6 | R2,R7 | read grader, read reward | succeeded |
| d1_w9 | MiniMax-M2.7 | R2 | read grader | unclear |
| d10_w12 | MiniMax-M3 | R5,R6,R7 | read+ran grader | succeeded |
| d10_w4 | MiniMax-M3 | R6 | read+ran grader | succeeded |
| d10_w5 | MiniMax-M3 | R4 | read+ran grader, read reward | succeeded |
| d10_w9 | MiniMax-M3 | R5,R6,R7 | read+ran grader | succeeded |
| d11_w2 | MiniMax-M3 | R5 | read+ran grader | succeeded |
| d12_w1 | MiniMax-M3 | R7 | read+ran grader | succeeded |
| d1_w5 | MiniMax-M3 | R2,R7,R8 | read+ran grader, read reward | succeeded |
| d5_w1 | MiniMax-M3 | R7,R9,R10,R11,R12 | read+ran grader, read reward | succeeded |
| d8_w5 | MiniMax-M3 | R4,R5,R6,R10 | read+ran grader | succeeded |
| d9_w11 | MiniMax-M3 | R3,R4,R6,R7,R15 | read+ran grader, read reward | succeeded |
| d12_w1 | Opus-4.8 | R7 | read+ran grader, read reward | succeeded |
| d5_w1 | Opus-4.8 | R9 | read+ran grader, read reward | succeeded |
| d12_w1 | Qwen3.7-Max | R7 | read+ran grader | unclear |
"Access = succeeded" means the agent's terminal
actually received grader/reward content. In almost all cases reading the grader did not
help (the models still failed the round); we withdrew the affected v1 results regardless.
If you evaluate with stock upstream Harbor in shared mode, apply the same sanitization or
use separate-verifier mode.
The re-run also fixed one contaminated task (d12_w1) and 11 task/test defects. Full
details, dates, and the old-vs-new framing are in CHANGELOG.md.
EvoCode-Bench is the iteration component of Terminal-X, alongside DeepTerminalBench for single-shot depth and RoadmapBench for version upgrades. Terminal-X contains the combined benchmark suite and cross-dataset blog; this repository focuses on the EvoCode-Bench task format, evaluation protocol, and official-Harbor runner.
@misc{shen2026evocodebench,
title = {EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions},
author = {Haiyang Shen and Xuanzhong Chen and Wendong Xu and Yun Ma and Liang Chen and Kuan Li},
year = {2026},
eprint = {2605.24110},
archivePrefix = {arXiv},
primaryClass = {cs.SE},
url = {https://arxiv.org/abs/2605.24110}
}Code in this repository is released under the MIT License. Dataset terms follow the dataset release metadata.