Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
200 diagnostic instances for Claw-style agents, each containing a user instruction, mock workspace resources, and a task-specific verifier. The benchmark was created by RUC-AIBOX and last updated on May 15, 2026. It uses a difficulty-aware filtering process for task selection.
License restrictions are unknown and should be verified before use.