Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A pilot benchmark of 30 manually created tasks for evaluating LLM agents, created by Mosi-AI and last updated in April 2026. The dataset focuses on complex, real-world assistant tasks such as booking flights and debugging code. It introduces a Triple-Axis Complexity Framework to address gaps in existing benchmarks.
License is unknown; terms of use must be verified before application.