Sign in to view source links and access this dataset
Description
EvoCode-Bench contains 26 executable tasks with 227 total rounds for evaluating coding agents in persistent software engineering interactions. The dataset, created by UnipatAI, uses the Harbor multi-step task format and includes workspaces, task metadata, and verification assets. It was last updated on June 20, 2026.
Use Cases
Benchmarking coding agents on multi-turn software tasks based on the Harbor task format.
Evaluating agent performance across 227 rounds of interaction described in the dataset.
Testing AI systems on executable verification assets included with each task.
Strengths
Contains 26 distinct executable tasks for structured evaluation.
Provides 227 total rounds of interaction for multi-turn analysis.
Includes task metadata, round-level instructions, and verification assets as described.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-06-20 03:55:59; freshness should be verified.
Provenance
Source
UnipatAI via Hugging Face.
Collection Method
Likely constructed as a benchmark for evaluating coding agents.
Freshness
Last updated 2026-06-20 03:55:59.
License is unknown; terms of use must be verified.