DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Claw-Eval-Live: 105 Controlled Tasks for Workflow Agent Benchmarking | DataSalon

Home Government & LegalClaw-Eval-Live: 105 Controlled Tasks for Workflow Agent Benchmarking

Government & Legal

Claw-Eval-Live: 105 Controlled Tasks for Workflow Agent Benchmarking

Name: Claw-Eval-Live: 105 Controlled Tasks for Workflow Agent Benchmarking
Creator: claw-eval-live
Published: 2026-05-07T07:00:47
Keywords: Agent Evaluation, Ai Testing, Benchmark, Workflow Benchmark, Tabular, Controlled Tasks

by claw-eval-live·Updated 1mo ago

Available on 1 platform

Description

Claw-Eval-Live is a live benchmark dataset for workflow agents containing 105 controlled tasks. It includes fixtures, mock services, sandboxed workspaces, task-specific graders, and recorded execution evidence. The dataset is a time-stamped snapshot built from public workflow-demand signals and accompanies an anonymous submission to NeurIPS 2026.

Use Cases

Benchmarking workflow agent performance based on 105 controlled tasks.
Evaluating agent execution in sandboxed workspaces based on recorded evidence.
Testing agent interaction with mock services based on provided fixtures.
Assessing task-specific grading criteria based on the benchmark design.

Strengths

Contains 105 distinct controlled tasks for structured evaluation.
Includes task-specific graders and recorded execution evidence for reproducibility.
Designed as a live benchmark with a rerunnable signal-to-task pipeline for evolving demand and models.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Built from public workflow-demand signals.
Collection Method: Signal-to-task pipeline designed to be rerun.
Freshness: Last updated 2026-05-07 07:32:27.

Tabular Agent Evaluation Ai Testing Benchmark Workflow Benchmark Controlled Tasks

Related Datasets

Quality Score

D37

Description

Source

Reputation

Quality Score

D37

Description

Source

Reputation

Access

Community

1 likes

0 views

Dataset Info

Author: claw-eval-live
Created: May 7, 2026
Updated: May 7, 2026
Last synced: May 14, 2026

Access

Community

1 likes

0 views

Dataset Info

Author: claw-eval-live
Created: May 7, 2026
Updated: May 7, 2026
Last synced: May 14, 2026

Claw-Eval-Live: 105 Controlled Tasks for Workflow Agent Benchmarking

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info