ClawBench: AI Agent Performance on 153 Everyday Web Tasks

Name: ClawBench: AI Agent Performance on 153 Everyday Web Tasks
Creator: NAIL-Group
Published: 2026-04-10T12:20:01
Keywords: Task Evaluation, Ai Agents, Web Interaction, Behavioral Data, Multimodal

by NAIL-GroupUpdated 2mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

NAIL-Group's ClawBench dataset evaluates AI agents on 153 everyday online tasks across 144 live websites. It captures five layers of behavioral data, including session replays, screenshots, HTTP traffic, reasoning traces, and browser actions, and provides human ground-truth and step-level diagnostics via an agentic evaluator. The dataset was last updated on April 10,我们发现了一个错误。

Use Cases

Benchmarking AI agent performance on real-world web tasks based on the 153 defined tasks.
Analyzing agent reasoning and decision-making processes based on the captured reasoning traces.
Developing or training agentic evaluators based on the step-level diagnostic scoring methodology.
Studying multimodal agent behavior based on the five captured data layers (session replay, screenshots, HTTP traffic, etc.).

Strengths

Includes 153 distinct everyday tasks for evaluation.
Captures data across 144 live websites, providing real-world context.
Collects five complementary layers of behavioral data per task.
Provides human ground-truth and step-level diagnostic scoring for each task.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Freshness should be verified as the data involves interactions with live websites.

Provenance

Source: NAIL-Group
Collection Method: Data captured from AI agents performing tasks on live websites.
Time Range: null
Freshness: Last updated 2026-04-10 12:41:07
Geography: null

null

Multimodal Task Evaluation Ai Agents Web Interaction Behavioral Data

Related Datasets

Quality Score

D36

Description

39

Source

36

Reputation

35

Access

26

Community

1 likes

0 views

Dataset Info

Author: NAIL-Group
Created: Apr 10, 2026
Updated: Apr 10, 2026
Last synced: Apr 29, 2026

Access

26

Community

1 likes

0 views

Dataset Info

Author: NAIL-Group
Created: Apr 10, 2026
Updated: Apr 10, 2026
Last synced: Apr 29, 2026

ClawBench: AI Agent Performance on 153 Everyday Web Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info