Sign in to view source links and access this dataset
Description
Terminal-Bench Pro is a benchmark dataset for evaluating AI agents on terminal-based tasks. It contains 400 tasks across eight domains, including data processing, games, debugging, and machine learning, derived from real-world scenarios and GitHub issues. The dataset was created by alibabagroup and last updated on January 5, 2026.
Use Cases
Benchmarking terminal agent performance based on the 400 expert-designed tasks.
Training AI agents for system administration based on tasks in that domain.
Evaluating agent debugging capabilities based on tasks derived from GitHub issues.
Assessing agent proficiency in scientific computing and machine learning workflows.
Strengths
Contains 400 tasks, with 200 public and 200 private.
Tasks are derived from real-world scenarios and GitHub issues.
Covers eight distinct domains, including system administration and security.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Provenance
Source
alibabagroup via Hugging Face
Collection Method
Expert-designed tasks derived from real-world scenarios and GitHub issues.
Time Range
null
Freshness
Last updated 2026-01-05 22:15:49; freshness should be verified.
Geography
null
License is unknown; restrictions should be verified before use.