Terminal-Bench Verified Dataset for Code Agent Evaluation

Name: Terminal-Bench Verified Dataset for Code Agent Evaluation
Creator: zai-org
Published: 2026-02-05T06:41:09
Keywords: Benchmark Evaluation, Code Agents, Arxiv260111868, Arxiv260215763, Text, Regionus, Reinforcement Learning, Instruction Tuning, Licenseapache 20

by zai-orgUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Terminal-Bench 2.0 Verified is a corrected version of a benchmark for evaluating AI code agents, addressing identified environment and instruction issues. The dataset was reviewed and modified by the organization zai-org, with the verified version released in February 2026. It includes updated Dockerfiles and instructions specifically to support the runtime of the Claude Code Agent.

Use Cases

Benchmarking code agent performance like GLM-5 using the verified instruction set.
Testing agent-environment interaction stability with the provided updated Dockerfiles.
Comparing model outputs, such as from Step 3.5-Flash, against a standardized, corrected task suite.

Strengths

Dataset underwent a comprehensive review to identify and fix various issues from the original Terminal-Bench 2.0.
Includes specific fixes for environment setup and instruction clarity to support a named agent runtime.

Limitations

The exact size, row count, and specific data columns are not provided in the description.
The verification scope is limited to issues found by the maintainers, which may not cover all potential problems.

Provenance

Source: zai-org on Hugging Face.
Collection Method: Modification and verification of the original Terminal-Bench 2.0 dataset.
Freshness: Last updated on 2026-02-27.

Full description, including details on fixes and usage, is hosted externally on the Hugging Face dataset page. License information is not specified in the provided input.

Text Benchmark Evaluation Code Agents Arxiv260111868 Arxiv260215763 Regionus Reinforcement Learning Instruction Tuning Licenseapache 20

Related Datasets

Quality Score

C41

Description

42

Source

39

Reputation

56

Access

22

Community

1.8K downloads

64 likes

0 views

Dataset Info

Author: zai-org
Created: Feb 5, 2026
Updated: Feb 27, 2026
Last synced: Jun 20, 2026

Access

22

Community

1.8K downloads

64 likes

0 views

Dataset Info

Author: zai-org
Created: Feb 5, 2026
Updated: Feb 27, 2026
Last synced: Jun 20, 2026

Terminal-Bench Verified Dataset for Code Agent Evaluation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info