Name: WildClawBench Agent Evaluation with 60 Real-World Tasks
Creator: internlm
Published: 2026-03-24T13:35:21
Keywords: Task Categoriesimage Text To Text, Languagezh, Task Categoriesquestion Answering, Languageen, Task Categoriesvisual Question Answering, Agents, Size Categoriesn1 K, Openclaw, Evaluation, Benchmark, Regionus, Licensemit, Multimodal

Description

WildClawBench is a benchmark containing 60 original tasks for evaluating AI agents within a live OpenClaw environment. It tests agents on end-to-end, practical work such as clipping football highlights and negotiating meeting times. The benchmark is multimodal, supporting languages including English and Chinese, and was created by internlm.

Use Cases

Evaluate agent performance on the 60 original tasks, including multimodal activities like clipping goal highlights from football matches.
Benchmark AI agents on multi-round negotiation tasks, such as coordinating meeting times, within the OpenClaw environment.
Assess agent capabilities on visual question answering and image-to-text tasks specified in the benchmark tags.

Strengths

Contains 60 original, practical tasks designed for end-to-end agent evaluation.
Benchmark is multimodal, integrating tasks across visual question answering and text-based categories.
Tasks are executed in a live OpenClaw environment, mirroring real-user conditions.

Limitations

The dataset size, row count, and specific column structure are unknown, limiting reproducibility and detailed analysis.
Task descriptions and sample data are not provided in the input, requiring users to visit the external page for full details.
Geographic and temporal coverage for the tasks is unspecified, potentially limiting generalizability.

Provenance

Source: huggingface dataset by author internlm.
Collection Method: Benchmark tasks created for evaluation within the OpenClaw personal AI assistant environment.
Freshness: Last updated on 2026-03-25.
Geography: Region tag indicates 'us', but specific spatial coverage for tasks is unknown.

Full task descriptions and data are not included in this input; users must visit the provided Hugging Face dataset page. License is indicated as MIT in tags but not confirmed in the main description.

Multimodal Task Categoriesimage Text To Text Languagezh Task Categoriesquestion Answering Languageen Task Categoriesvisual Question Answering Agents Size Categoriesn1 K Openclaw Evaluation Benchmark Regionus Licensemit

WildClawBench Agent Evaluation with 60 Real-World Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info