Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
139 agent tasks across general and multimodal splits evaluate real-world AI agent performance. The benchmark covers 24 categories including communication, finance, and operations, created by claw-eval. It was last updated in March 2026.
Dataset structure, columns, and sample data are unknown; users must visit the dataset page for full details. License is listed as MIT in tags but not confirmed in the provided input.