DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Claw Eval Agent Benchmark with 24 Task Categories | DataSalon

Home Government & LegalClaw Eval Agent Benchmark with 24 Task Categories

Government & Legal

Claw Eval Agent Benchmark with 24 Task Categories

Name: Claw Eval Agent Benchmark with 24 Task Categories
Creator: claw-eval
Published: 2026-03-25T08:45:55
Keywords: Librarypolars, Languagezh, OPTIMIZED-PARQUET, Languageen, Size Categoriesn1 K, Modalitytext, Real World, Agent Bench, Librarymlcroissant, Evaluation, Librarydatasets, Librarypandas, Parquet, Regionus, Licensemit, Multimodal

by claw-eval·Updated 2mo ago

Available on 1 platform

Description

139 agent tasks across general and multimodal splits evaluate real-world AI agent performance. The benchmark covers 24 categories including communication, finance, and operations, created by claw-eval. It was last updated in March 2026.

Use Cases

Benchmark agent performance across the 24 task categories like communication and finance
Evaluate multimodal agent capabilities on tasks requiring perception and creation, such as webpage generation
Analyze task_id results to compare agent strategies on the general and multimodal splits

Strengths

139 total examples across two defined splits
Covers 24 distinct real-world task categories
Includes a dedicated multimodal split of 35 tasks

Limitations

Small scale with only 139 total task examples
Unknown data structure, columns, and sample data details limit reproducibility
Potential bias from unknown data collection and curation methods

Provenance

Source: claw-eval on Hugging Face
Collection Method: Benchmark dataset for evaluating AI agents
Time Range: null
Freshness: Last updated March 2026
Geography: null

Dataset structure, columns, and sample data are unknown; users must visit the dataset page for full details. License is listed as MIT in tags but not confirmed in the provided input.

Multimodal OPTIMIZED-PARQUET Parquet Librarypolars Languagezh Languageen Size Categoriesn1 K Modalitytext Real World Agent Bench Librarymlcroissant Evaluation Librarydatasets Librarypandas Regionus Licensemit

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

426 downloads

10 likes

0 views

Dataset Info

Author: claw-eval
Created: Mar 25, 2026
Updated: Apr 3, 2026
Last synced: Jun 11, 2026

Access

Community

426 downloads

10 likes

0 views

Dataset Info

Author: claw-eval
Created: Mar 25, 2026
Updated: Apr 3, 2026
Last synced: Jun 11, 2026

Claw Eval Agent Benchmark with 24 Task Categories

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info