DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Terminal-Bench Pro: 400 Expert-Designed Tasks for AI Agent Evaluation | DataSalon

Home Software Engineering & SecurityTerminal-Bench Pro: 400 Expert-Designed Tasks for AI Agent Evaluation

Software Engineering & Security

Terminal-Bench Pro: 400 Expert-Designed Tasks for AI Agent Evaluation

Name: Terminal-Bench Pro: 400 Expert-Designed Tasks for AI Agent Evaluation
Creator: alibabagroup
Published: 2025-12-30T01:57:52
Keywords: Ai Evaluation, Task Dataset, Text, Terminal Agent, Software Benchmark

by alibabagroup·Updated 5mo ago

Available on 1 platform

Description

Terminal-Bench Pro is a benchmark dataset for evaluating AI agents on terminal-based tasks. It contains 400 tasks across eight domains, including data processing, games, debugging, and machine learning, derived from real-world scenarios and GitHub issues. The dataset was created by alibabagroup and last updated on January 5, 2026.

Use Cases

Benchmarking terminal agent performance based on the 400 expert-designed tasks.
Training AI agents for system administration based on tasks in that domain.
Evaluating agent debugging capabilities based on tasks derived from GitHub issues.
Assessing agent proficiency in scientific computing and machine learning workflows.

Strengths

Contains 400 tasks, with 200 public and 200 private.
Tasks are derived from real-world scenarios and GitHub issues.
Covers eight distinct domains, including system administration and security.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: alibabagroup via Hugging Face
Collection Method: Expert-designed tasks derived from real-world scenarios and GitHub issues.
Time Range: null
Freshness: Last updated 2026-01-05 22:15:49; freshness should be verified.
Geography: null

License is unknown; restrictions should be verified before use.

Text Ai Evaluation Task Dataset Terminal Agent Software Benchmark

Related Datasets

Quality Score

D40

Description

Source

Reputation

Quality Score

D40

Description

Source

Reputation

Access

Community

112 downloads

4 likes

0 views

Dataset Info

Author: alibabagroup
Created: Dec 30, 2025
Updated: Jan 5, 2026
Last synced: Jun 1, 2026

Access

Community

112 downloads

4 likes

0 views

Dataset Info

Author: alibabagroup
Created: Dec 30, 2025
Updated: Jan 5, 2026
Last synced: Jun 1, 2026

Terminal-Bench Pro: 400 Expert-Designed Tasks for AI Agent Evaluation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info