Claw-Eval-Live is a live benchmark dataset for workflow agents containing 105 controlled tasks. It includes fixtures, mock services, sandboxed workspaces, task-specific graders, and recorded execution evidence. The dataset is a time-stamped snapshot built from public workflow-demand signals and accompanies an anonymous submission to NeurIPS 2026.
Use Cases
- Benchmarking workflow agent performance based on 105 controlled tasks.
- Evaluating agent execution in sandboxed workspaces based on recorded evidence.
- Testing agent interaction with mock services based on provided fixtures.
- Assessing task-specific grading criteria based on the benchmark design.
Strengths
- Contains 105 distinct controlled tasks for structured evaluation.
- Includes task-specific graders and recorded execution evidence for reproducibility.
- Designed as a live benchmark with a rerunnable signal-to-task pipeline for evolving demand and models.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Built from public workflow-demand signals.
- Collection Method
- Signal-to-task pipeline designed to be rerun.
- Freshness
- Last updated 2026-05-07 07:32:27.