100 tasks across 8 core domains comprise this benchmark for evaluating OpenClaw agents. It was originally developed as an internal benchmark for Qwen3.6-Plus and later open-sourced by skylenage-ai. The dataset was last updated on April 10, 2026.
Use Cases
- Benchmarking agent performance based on the 100 tasks across 8 domains
- Evaluating agent robustness at scale based on the real-user-distribution design
- Testing agent capabilities in isolated simulated workspaces as described
- Comparing agent architectures using the standardized tasks
Strengths
- Contains 100 tasks for evaluation
- Covers 8 core domains for broad assessment
- Designed for robust evaluation at scale
- Features isolated simulated workspaces per task
Limitations
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
- Description metadata is limited; actual data quality requires manual inspection after download
Provenance
- Source
- skylenage-ai
- Collection Method
- Originally built as an internal benchmark during the development of Qwen3.6-Plus, then optimized and open-sourced.
- Time Range
- null
- Freshness
- Last updated 2026-04-10 04:16:45; freshness should be verified
- Geography
- null