Name: ClawsBench: LLM Agent Evaluation on Simulated Productivity Tasks
Creator: benchflow
Published: 2026-04-08T09:04:46
Keywords: Safety Testing, Agent Benchmark, Llm Evaluation, Tabular, Productivity Tasks, Synthetic

Description

ClawsBench is a dataset for evaluating the capability and safety of LLM productivity agents across 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack). It contains 44 tasks, including 30 single-service and 14 cross-service tasks, with 24 safety-critical scenarios, and was used to test 6 models. The dataset was created by benchflow and last updated on 2026-04-08.

Use Cases

Benchmarking LLM agent task success rates based on 44 simulated productivity tasks.
Evaluating agent safety and harmful action prevention based on 24 safety-critical scenarios.
Comparing model performance across different services based on tasks involving Gmail, Calendar, Docs, Drive, and Slack.
Analyzing agent capability on cross-service workflows based on the 14 cross-service tasks.

Strengths

Includes 44 distinct tasks for evaluation, providing a structured test suite.
Covers 5 high-fidelity mock services (Gmail, Calendar, Docs, Drive, Slack) for realistic simulation.
Explicitly measures both capability (task success) and safety (harmful action prevention).
Contains 24 safety-critical scenarios for focused safety testing.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-04-08 09:27:29; freshness should be verified.

Provenance

Source: benchflow
Collection Method: Likely contains simulated tasks and evaluations across mock services.
Freshness: 2026-04-08

License is unknown and should be verified before use.

Tabular Safety Testing Agent Benchmark Llm Evaluation Productivity Tasks Synthetic

ClawsBench: LLM Agent Evaluation on Simulated Productivity Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info