NAIL-Group's ClawBench dataset evaluates AI agents on 153 everyday online tasks across 144 live websites. It captures five layers of behavioral data, including session replays, screenshots, HTTP traffic, reasoning traces, and browser actions, and provides human ground-truth and step-level diagnostics via an agentic evaluator. The dataset was last updated on April 10,我们发现了一个错误。
Use Cases
- Benchmarking AI agent performance on real-world web tasks based on the 153 defined tasks.
- Analyzing agent reasoning and decision-making processes based on the captured reasoning traces.
- Developing or training agentic evaluators based on the step-level diagnostic scoring methodology.
- Studying multimodal agent behavior based on the five captured data layers (session replay, screenshots, HTTP traffic, etc.).
Strengths
- Includes 153 distinct everyday tasks for evaluation.
- Captures data across 144 live websites, providing real-world context.
- Collects five complementary layers of behavioral data per task.
- Provides human ground-truth and step-level diagnostic scoring for each task.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Freshness should be verified as the data involves interactions with live websites.
Provenance
- Source
- NAIL-Group
- Collection Method
- Data captured from AI agents performing tasks on live websites.
- Time Range
- null
- Freshness
- Last updated 2026-04-10 12:41:07
- Geography
- null