Sign in to view source links and access this dataset
Description
ClawsBench is a dataset for evaluating the capability and safety of LLM productivity agents across 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack). It contains 44 tasks, including 30 single-service and 14 cross-service tasks, with 24 safety-critical scenarios, and was used to test 6 models. The dataset was created by benchflow and last updated on 2026-04-08.
Use Cases
Benchmarking LLM agent task success rates based on 44 simulated productivity tasks.
Evaluating agent safety and harmful action prevention based on 24 safety-critical scenarios.
Comparing model performance across different services based on tasks involving Gmail, Calendar, Docs, Drive, and Slack.
Analyzing agent capability on cross-service workflows based on the 14 cross-service tasks.
Strengths
Includes 44 distinct tasks for evaluation, providing a structured test suite.
Covers 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack) for realistic simulation.
Explicitly measures both capability (task success) and safety (harmful action prevention).
Contains 24 safety-critical scenarios for focused safety testing.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-04-08 09:27:29; freshness should be verified.
Provenance
Source
benchflow
Collection Method
Likely contains simulated tasks and evaluations across mock services.
Freshness
2026-04-08
License is unknown and should be verified before use.