EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

News

2026-06-20 — Results re-release (v2). The leaderboard, per-task results, and trajectories published before this date are superseded. We re-ran the entire benchmark on a patched harness after fixing an evaluation-integrity leak in Harbor's shared multi-step verifier mode (agents could read the grader during their own turn; reported upstream as #1960 / #1961), one contaminated task, and 11 task/test defects. See Known issues & responsible disclosure and CHANGELOG.md. Treat any EvoCode-Bench number dated before 2026-06-20 as outdated.

June 2026. EvoCode-Bench runs on the Harbor official multi-step task format. Each task is a sequence of [[steps]] run in one persistent container, with a per-step verifier after each step and trial-level reward aggregation.

EvoCode-Bench tests whether coding agents can keep a project working as user requests change. It contains 26 stateful coding tasks and 227 evaluated rounds (Harbor steps). Each task keeps the same workspace and agent session for 5-15 rounds, while cumulative executable tests check new requirements and still-active prior requirements.

The original paper evaluation used a different runner (harbor_multiturn) and the MT@4 / SR / Comp metrics. That framework, the legacy task layout, and the paper results are kept under legacy/ for reproducibility — not needed for normal use.

Overview

Most coding benchmarks evaluate one specification followed by one final assessment. EvoCode-Bench instead evaluates an interactive coding session. Later rounds inherit earlier implementation decisions, dependencies, file layouts, API choices, and test behavior. Each round (Harbor step) is scored by a cumulative verifier, and the trial reward is the mean of the per-step rewards.

The benchmark is organized along two axes from the paper:

Engineering activity	Explorative	Contractual	Document-driven	Total
Construction	9 / 80	3 / 37	1 / 7	13 / 124
Spec Evolution	1 / 8	1 / 7	1 / 7	3 / 22
Review	3 / 21	1 / 7	1 / 9	5 / 37
Migration	3 / 29	1 / 7	1 / 8	5 / 44
Total	16 / 138	6 / 58	4 / 31	26 / 227

Each cell reports tasks / rounds. A round maps one-to-one to a Harbor step.

Task Format

EvoCode-Bench tasks use the Harbor official multi-step layout — one sub-directory per step under steps/, executed in the order declared by the [[steps]] array in task.toml:

task/
├── task.toml                       # metadata + [[steps]] list + reward strategy
├── environment/
│   └── Dockerfile                  # single container shared across all steps
└── steps/
    ├── round-1/
    │   ├── instruction.md          # this round's user request (WHAT, not HOW)
    │   ├── solution/solve.sh        # reference delta for this round
    │   └── tests/test.sh           # cumulative tests through this round
    ├── round-2/
    │   ├── instruction.md
    │   ├── solution/solve.sh
    │   └── tests/test.sh
    └── round-N/ ...

task.toml follows the official schema (schema_version = "1.2"):

schema_version = "1.2"
multi_step_reward_strategy = "mean"      # trial reward = mean of per-step rewards

[metadata]
name = "service-mesh-health-router"
difficulty = "hard"
category = "systems-networking"

[metadata.requirement_chain]
num_steps = 8

[[metadata.requirement_chain.steps]]
step = "round-1"
change_types = ["extension"]
# ... one entry per step (extension / correction / conflict)

[agent]
timeout_sec = 1800.0                      # global default; override per step via [steps.agent]

[verifier]
timeout_sec = 1800.0                      # global default; override per step via [steps.verifier]

[environment]
build_timeout_sec = 600.0
cpus = 1
memory_mb = 4096
storage_mb = 10240

[[steps]]
name = "round-1"                          # matches steps/round-1/

[[steps]]
name = "round-2"
# ... one [[steps]] entry per step, in execution order

The task format is built around three constraints:

Persistent workspace: the same Docker container carries files, dependencies, and generated artifacts across steps.
Continuous agent session: the agent receives a sequence of user requests rather than independent prompts.
Cumulative tests: round i verifies every still-active requirement from rounds 1..i, so regressions are caught immediately. Each step's tests/test.sh writes a binary reward to /logs/verifier/reward.txt.

Framework

EvoCode-Bench's standard multi-step evaluation runs on upstream Harbor — the same framework used by Terminal-Bench 2.0 — using its native multi-step support. No fork is required to run a full task (all steps).

uv tool install harbor      # or: pip install harbor
harbor run --help

Upstream Harbor's official multi-step runner provides:

native [[steps]] sequencing in the order declared in task.toml;
a single persistent Docker workspace shared across all steps;
a continuous agent session across steps;
a per-step verifier run against the cumulative test suite after each step;
trial-level reward aggregation via multi_step_reward_strategy (mean for EvoCode-Bench).

Need single-round fast-forward (SR), or want to reproduce the paper? See legacy/ — it covers our Harbor fork and the original harbor_multiturn framework. Not required for normal use.

Quick Start

1. Prerequisites

Python 3.11+ (the evaluation/*.py helpers use the stdlib tomllib).
Docker running, or a remote Daytona target.
A model endpoint for your agent.

Install the Harbor CLI:

# uv runs the Harbor CLI. See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
uv tool install harbor      # or: pip install harbor

pip install harbor (upstream) runs full tasks (all steps).

2. Prepare Tasks

Download the released EvoCode-Bench task directories from Hugging Face and place them under data/EvoCodeBench. If you already have the Terminal-X repository, the tasks are also available under Terminal-X/data/EvoCodeBench/.

3. Configure Model Endpoint

For the claude-code agent:

export AGENT_TYPE="claude-code"
export AGENT_MODEL="claude-opus-4-7"
export ANTHROPIC_BASE_URL="https://api.your-provider.com"
export ANTHROPIC_AUTH_TOKEN="sk-..."

For the terminus-2 agent (OpenAI-compatible):

export AGENT_TYPE="terminus-2"
export AGENT_MODEL="openai/gpt-5.5"
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"

4. Validate the Dataset

python evaluation/validate_dataset.py data/EvoCodeBench

The released benchmark should report 26 tasks and 227 steps.

5. Run One Task

# Agent (pass@1 by default; set AGENT_ATTEMPTS for pass@k)
AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agent

# Oracle verification (reference solutions; should score 1.0 on every step)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation oracle

# No-op baseline (empty submission; should score 0)
./evaluation/run_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation nop

6. Run All Tasks

AGENT_TYPE=claude-code AGENT_MODEL=claude-opus-4-7 \
  ./evaluation/run_all.sh data/EvoCodeBench agent

Each task writes Harbor outputs under:

data/EvoCodeBench/<task>/harbor_jobs/<model>/

Metrics

Each step is scored with a binary reward — 1 if all of that step's key requirements pass, 0 otherwise — written by the verifier to /logs/verifier/reward.txt. Harbor aggregates a trial's per-step rewards into a trial-level reward via multi_step_reward_strategy = "mean".

The score is the mean per-step reward:

per-task score = (passed steps) / (total steps) for the trial;
dataset score = mean of per-task scores across the 26 tasks.

A complementary case score uses the same shape at finer granularity. Each step's verifier also reports per-test-case results (CASE_SUMMARY total_cases=… success_count=…). Define:

per-step case ratio = success_count / total_cases for the step (a step whose code fails to build, or that the chain never reached, has ratio 0);
per-task case score = mean of the per-step case ratios over the task's steps;
dataset case score = mean of per-task case scores across the 26 tasks ×100.

The dataset score is all-or-nothing per step, so it rewards finishing a step exactly; the case score credits partial progress (e.g. passing 44 of 45 cases). A large gap between the two for a model means it gets most of the work right but rarely lands a whole step.

python evaluation/compute_metrics.py \
  --tasks-dir data/EvoCodeBench \
  --results-dir data/EvoCodeBench \
  --model claude-opus-4-7          # score one agent; add --json for machine-readable output

--model selects the harbor_jobs/<model>/ results to score (the oracle and nop baselines are excluded by default).

The paper's MT@4 / SR / Comp metrics and single-round fast-forward evaluation live in legacy/.

Results

Evaluated on the current dataset release with the Harbor official multi-step runner: full 5–15 round chains, one attempt per task (no best-of-k). The score is the dataset score defined in Metrics — the mean over 26 tasks of each task's passed_rounds / total_rounds. oracle scores 1.0 and nop scores 0 on every task. On the hardest, longest tasks some agents exhaust the 30-minute-per-round time budget and the chain aborts before later rounds; those rounds count as 0 (see CHANGELOG.md).

Numbers below are the 2026-06-20 re-release. The previous (June 13–16) leaderboard is superseded — see Known issues & responsible disclosure. The old values are shown in parentheses where they moved.

Agent	Reasoning	Dataset score	Case score	Perfect tasks
Claude-Opus-4.8	effort `xhigh`	59.1 (42.5)	96.6 (89.9)	9/26
GPT-5.5	effort `high`	29.5 (23.5)	81.8 (77.2)	0/26
MiniMax-M3	thinking `adaptive`	23.4 (15.2)	61.5 (69.2)	2/26
GLM-5.2	thinking on¹	16.2	47.3	1/26
DeepSeek-V4-Pro	effort `high`	14.1 (10.8)	61.7 (58.3)	1/26
Kimi-K2.6	thinking on¹	13.2 (23.1)	65.7 (75.2)	0/26
DeepSeek-V4-Flash	effort `high`	12.2 (4.6)	58.9 (52.5)	0/26
Qwen3.7-Max	thinking on¹	11.9 (7.6)	67.4 (64.7)	0/26
Qwen3.6-Plus	thinking on¹	9.7 (10.1)	67.7 (64.4)	0/26
Kimi-K2.7-Code	thinking on¹	7.8	45.4	0/26
GLM-5.1	thinking on¹	5.9 (6.3)	52.5 (48.4)	0/26
MiniMax-M2.7	reasoning split	5.1 (0.8)	44.9 (42.6)	0/26

GLM-5.2 and Kimi-K2.7-Code are new in this release (no prior value).

Dataset score is the mean per-task score ×100, where a task's score is passed_rounds / total_rounds and a round is "passed" only if it earns the binary reward 1 (every test case of that round passes).

Case score is the finer-grained companion. For each task, take each round's passed_test_cases / total_test_cases, average over the task's rounds (a round whose code fails to build, or that the chain never reached, counts as 0), then average over the 26 tasks ×100. It credits the partial progress the all-or-nothing round reward hides — e.g. GPT-5.5 scores 29.5 on rounds but passes 81.8% of test cases, because it often misses a round by just one or two cases. Both scores rank Opus-4.8 first, but the case score spreads the field more smoothly.

Reasoning is the thinking configuration used for each model: models with an effort knob ran at the listed level (Opus at its highest, xhigh; the rest at high); ¹ models without an effort knob ran with their native thinking simply enabled (Qwen enable_thinking, GLM/Kimi thinking.type=enabled), and MiniMax M3/M2.7 used their adaptive / split reasoning modes. All agents used the terminus-2 scaffold. Per-task / per-round / per-test-case detail: evaluation/sweeps/sweep_2026-06_single_shot.csv and the interactive results site.

Explore the results interactively → unipat-ai.github.io/EvoCodeBench — one page per task with a per-round × per-model test-case heatmap, drill-down into the exact cases each model failed (intent / expected / actual / reason), and a written difficulty and performance-gap analysis. The same pages are under docs/ and render locally with any static server (python3 -m http.server from docs/).

The original paper results (MT@4 / SR / Comp, legacy runner) are in legacy/.

Known issues & responsible disclosure

Verifier readable during the agent phase (Harbor shared-step leak)

While auditing per-model, per-round trajectories from our first evaluation (the v1 leaderboard, June 13–16), we found that on some tasks the agent could read the verifier's grading script (/tests/test.sh) and the previous step's verifier output (/logs/verifier/reward.txt, test-stdout.txt) from inside its own step.

Root cause (framework, not the tasks). This is a property of Harbor's default shared multi-step verifier mode: the verifier runs in the agent's container, and /tests + /logs/verifier are cleared only right before each verifier — never before the next step's agent phase. So from step 2 onward, the previous step's cumulative grading script and reward persist and are readable to the agent. It reproduces on upstream Harbor and is not specific to EvoCode-Bench. We reported it upstream:

Issue: harbor-framework/harbor#1960
Fix PR: harbor-framework/harbor#1961

Remediation. We patched our evaluation harness (same fix as PR #1961: clear /tests and /logs/verifier at the start of every agent phase) and re-ran the entire benchmark on the patched harness. The current Results are from these clean runs. The numbers and trajectories published before 2026-06-20 are superseded — see CHANGELOG.md.

What we observed in v1 (now withdrawn). Across the 26 tasks, agents read or ran the leaked grader (or read the prior reward) in at least one round on 12 tasks / 22 (task, model) pairs / 47 round-level occurrences. Because the leaked file is the previous step's grader, accesses only land from round 2 on. The behavior was uneven across models — heavily concentrated in a few:

Task	Model	Rounds	Behavior	Access
d5_w1	DeepSeek-V4-Flash	R7	read grader, read reward	succeeded
d12_w1	DeepSeek-V4-Pro	R2	read grader	succeeded
d9_w11	DeepSeek-V4-Pro	R2,R3,R5,R6	read+ran grader	succeeded
d11_w9	Kimi-K2.6	R4,R5,R6	read+ran grader	succeeded
d12_w1	Kimi-K2.6	R3	read grader, read reward	succeeded
d1_w9	Kimi-K2.6	R7	read grader	succeeded
d5_w1	Kimi-K2.6	R2,R4,R13	read+ran grader, read reward	succeeded
d9_w11	Kimi-K2.6	R2,R7	read grader, read reward	succeeded
d1_w9	MiniMax-M2.7	R2	read grader	unclear
d10_w12	MiniMax-M3	R5,R6,R7	read+ran grader	succeeded
d10_w4	MiniMax-M3	R6	read+ran grader	succeeded
d10_w5	MiniMax-M3	R4	read+ran grader, read reward	succeeded
d10_w9	MiniMax-M3	R5,R6,R7	read+ran grader	succeeded
d11_w2	MiniMax-M3	R5	read+ran grader	succeeded
d12_w1	MiniMax-M3	R7	read+ran grader	succeeded
d1_w5	MiniMax-M3	R2,R7,R8	read+ran grader, read reward	succeeded
d5_w1	MiniMax-M3	R7,R9,R10,R11,R12	read+ran grader, read reward	succeeded
d8_w5	MiniMax-M3	R4,R5,R6,R10	read+ran grader	succeeded
d9_w11	MiniMax-M3	R3,R4,R6,R7,R15	read+ran grader, read reward	succeeded
d12_w1	Opus-4.8	R7	read+ran grader, read reward	succeeded
d5_w1	Opus-4.8	R9	read+ran grader, read reward	succeeded
d12_w1	Qwen3.7-Max	R7	read+ran grader	unclear

"Access = succeeded" means the agent's terminal actually received grader/reward content. In almost all cases reading the grader did not help (the models still failed the round); we withdrew the affected v1 results regardless. If you evaluate with stock upstream Harbor in shared mode, apply the same sanitization or use separate-verifier mode.

Other corrections in the 2026-06-20 re-release

The re-run also fixed one contaminated task (d12_w1) and 11 task/test defects. Full details, dates, and the old-vs-new framing are in CHANGELOG.md.

Relation to Terminal-X

EvoCode-Bench is the iteration component of Terminal-X, alongside DeepTerminalBench for single-shot depth and RoadmapBench for version upgrades. Terminal-X contains the combined benchmark suite and cross-dataset blog; this repository focuses on the EvoCode-Bench task format, evaluation protocol, and official-Harbor runner.

Citation

@misc{shen2026evocodebench,
  title = {EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions},
  author = {Haiyang Shen and Xuanzhong Chen and Wendong Xu and Yun Ma and Liang Chen and Kuan Li},
  year = {2026},
  eprint = {2605.24110},
  archivePrefix = {arXiv},
  primaryClass = {cs.SE},
  url = {https://arxiv.org/abs/2605.24110}
}

License

Code in this repository is released under the MIT License. Dataset terms follow the dataset release metadata.

Name		Name	Last commit message	Last commit date
Latest commit History 28 Commits
assets		assets
docs		docs
evaluation		evaluation
legacy		legacy
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

News

Overview

Task Format

Framework

Quick Start

1. Prerequisites

2. Prepare Tasks

3. Configure Model Endpoint

4. Validate the Dataset

5. Run One Task

6. Run All Tasks

Metrics

Results

Known issues & responsible disclosure

Verifier readable during the agent phase (Harbor shared-step leak)

Other corrections in the 2026-06-20 re-release

Relation to Terminal-X

Citation

License

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

EvoCode-Bench: Evaluating Coding Agents in Multi-Turn Iterative Interactions

News

Overview

Task Format

Framework

Quick Start

1. Prerequisites

2. Prepare Tasks

3. Configure Model Endpoint

4. Validate the Dataset

5. Run One Task

6. Run All Tasks

Metrics

Results

Known issues & responsible disclosure

Verifier readable during the agent phase (Harbor shared-step leak)

Other corrections in the 2026-06-20 re-release

Relation to Terminal-X

Citation

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages