DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

ClawGym-Bench: 200 Diagnostic Tasks for AI Agent Evaluation | DataSalon

Home Government & LegalClawGym-Bench: 200 Diagnostic Tasks for AI Agent Evaluation

Government & Legal

ClawGym-Bench: 200 Diagnostic Tasks for AI Agent Evaluation

Name: ClawGym-Bench: 200 Diagnostic Tasks for AI Agent Evaluation
Creator: RUC-AIBOX
Published: 2026-05-15T06:57:49
Keywords: Claw Style Agents, Code Verification, Benchmark, Ai Agent Benchmark, Text, Task Diagnostic

by RUC-AIBOX·Updated 29d ago

Available on 1 platform

Description

200 diagnostic instances for Claw-style agents, each containing a user instruction, mock workspace resources, and a task-specific verifier. The benchmark was created by RUC-AIBOX and last updated on May 15, 2026. It uses a difficulty-aware filtering process for task selection.

Use Cases

Benchmarking AI agent performance based on the 200 diagnostic tasks.
Analyzing agent failure modes using the task-specific verifiers described.
Comparing verification methods based on the 156 code-based and 44 hybrid verification tasks.
Developing new agent training curricula based on difficulty-aware filtered tasks.

Strengths

Contains 200 distinct diagnostic instances for agent evaluation.
156 tasks use code-based verification, providing objective scoring.
44 tasks use a hybrid verification method with a defined 0.7/0.3 weighting scheme.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: RUC-AIBOX
Collection Method: Selected through difficulty-aware filtering.
Time Range: null
Freshness: Last updated 2026-05-15 07:25:39; freshness should be verified.
Geography: null

License restrictions are unknown and should be verified before use.

Text Claw Style Agents Code Verification Benchmark Ai Agent Benchmark Task Diagnostic

Related Datasets

Quality Score

D38

Description

Source

Reputation

Quality Score

D38

Description

Source

Reputation

Access

Community

26 downloads

1 likes

0 views

Dataset Info

Author: RUC-AIBOX
Created: May 15, 2026
Updated: May 15, 2026
Last synced: Jun 8, 2026

Access

Community

26 downloads

1 likes

0 views

Dataset Info

Author: RUC-AIBOX
Created: May 15, 2026
Updated: May 15, 2026
Last synced: Jun 8, 2026

ClawGym-Bench: 200 Diagnostic Tasks for AI Agent Evaluation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info