Sign in to view source links and access this dataset
Description
WeaveBench is a real-world benchmark for evaluating computer-use agents across hybrid interfaces. The dataset, created by wanlilll and last updated on June 5, 2026, assesses an agent's ability to orchestrate visual desktop control, command-line execution, code editing, browsers, and external tools within a single long-horizon workflow. The associated paper reports a best observed pairing of Claude Opus 4.7 + Claude Code achieving a 41.2% PassRate.
Use Cases
Benchmarking agent performance on long-horizon workflows based on the described hybrid GUI + CLI + code interfaces.
Training agents for real-world computer-use tasks based on the benchmark's orchestration of visual desktop control and external tools.
Evaluating the integration of code editing and command-line execution capabilities within a single agent based on the benchmark's scope.
Comparing different AI models on complex, multi-modal interaction tasks based on the reported pass rates.
Strengths
Focuses on long-horizon, real-world agent evaluation, a complex and relevant task.
Benchmarks hybrid interaction across multiple interfaces: GUI, CLI, code, browsers, and external tools.
Provides a concrete performance metric, with a reported best pass rate of 41.2% for a specific model pairing.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality and structure require manual inspection after download.
Provenance
Source
wanlilll via Hugging Face Datasets.
Collection Method
Likely constructed for benchmarking purposes, as described in the associated paper.
Time Range
null
Freshness
Last updated 2026-06-05 08:41:37; freshness should be verified.
Geography
null
License is unknown; terms of use must be verified before application.