Name: Auto-ClawEval: 1,040 AI Agent Tasks Across 104 Scenarios
Creator: AIcell
Published: 2026-04-11T17:23:33
Keywords: Benchmark, Api Testing, Text, Ai Agent Evaluation, Synthetic

Description

Auto-ClawEval is an auto-generated benchmark for evaluating AI agents, containing 1,040 tasks across 104 unique scenarios. It was created by ClawEnvKit and published by AIcell on Hugging Face, with a last update timestamp of 2026-04-21. The tasks are a mix of API-based (77%) and file-dependent (23%) types, spanning 24 categories and involving 20 mock services.

Use Cases

Benchmarking agent performance on API-based tasks based on the 77% API-based task type mentioned.
Testing agent capabilities in file-dependent scenarios based on the 23% file-dependent task type described.
Evaluating agent robustness across diverse scenarios based on the 104 unique scenarios and 24 categories.
Developing and validating agent evaluation harnesses based on the integration with ClawEnvKit.

Strengths

Contains 1,040 distinct evaluation tasks, providing a substantial test suite.
Covers 104 unique scenarios across 24 categories, suggesting diversity in test conditions.
Includes 20 mock services, which likely enables controlled testing environments.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and file size are unknown, which may limit suitability assessment.
The dataset is auto-generated, which may introduce patterns or biases not present in real-world data.

Provenance

Source: AIcell via Hugging Face.
Collection Method: Auto-generated by ClawEnvKit.
Time Range: Creation and update dates suggest a 2026 timeframe.
Freshness: Last updated 2026-04-21 13:19:51; freshness should be verified.
Geography: null

Evaluation requires the ClawEnvKit Docker harness, as indicated in the quick start instructions.

Text Benchmark Api Testing Ai Agent Evaluation Synthetic

Auto-ClawEval: 1,040 AI Agent Tasks Across 104 Scenarios

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info