Name: WeaveBench: A Long-Horizon Benchmark for Computer-Use Agents
Creator: wanlilll
Published: 2026-06-03T03:02:42
Keywords: Computer Use, Agent Benchmark, Benchmark, Multimodal Evaluation, Long Horizon, Multimodal

Description

WeaveBench is a real-world benchmark for evaluating computer-use agents across hybrid interfaces. The dataset, created by wanlilll and last updated on June 5, 2026, assesses an agent's ability to orchestrate visual desktop control, command-line execution, code editing, browsers, and external tools within a single long-horizon workflow. The associated paper reports a best observed pairing of Claude Opus 4.7 + Claude Code achieving a 41.2% PassRate.

Use Cases

Benchmarking agent performance on long-horizon workflows based on the described hybrid GUI + CLI + code interfaces.
Training agents for real-world computer-use tasks based on the benchmark's orchestration of visual desktop control and external tools.
Evaluating the integration of code editing and command-line execution capabilities within a single agent based on the benchmark's scope.
Comparing different AI models on complex, multi-modal interaction tasks based on the reported pass rates.

Strengths

Focuses on long-horizon, real-world agent evaluation, a complex and relevant task.
Benchmarks hybrid interaction across multiple interfaces: GUI, CLI, code, browsers, and external tools.
Provides a concrete performance metric, with a reported best pass rate of 41.2% for a specific model pairing.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset size are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality and structure require manual inspection after download.

Provenance

Source: wanlilll via Hugging Face Datasets.
Collection Method: Likely constructed for benchmarking purposes, as described in the associated paper.
Time Range: null
Freshness: Last updated 2026-06-05 08:41:37; freshness should be verified.
Geography: null

License is unknown; terms of use must be verified before application.

Multimodal Computer Use Agent Benchmark Benchmark Multimodal Evaluation Long Horizon

WeaveBench: A Long-Horizon Benchmark for Computer-Use Agents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info