Name: LiveClawBench: Benchmark Trajectories for LLM Agents on Real-World Assistant Tasks
Creator: Mosi-AI
Published: 2026-04-08T07:00:37
Keywords: Llm Agents, Ai Benchmark, Benchmark, Tabular, Real World Tasks, Complexity Framework, Agent Trajectories

Description

A pilot benchmark of 30 manually created tasks for evaluating LLM agents, created by Mosi-AI and last updated in April 2026. The dataset focuses on complex, real-world assistant tasks such as booking flights and debugging code. It introduces a Triple-Axis Complexity Framework to address gaps in existing benchmarks.

Use Cases

Benchmarking agent performance on real-world tasks like flight booking based on the described task types
Evaluating agent robustness across multiple difficulty axes based on the Triple-Axis Complexity Framework
Training or fine-tuning agents for knowledge base curation based on the described task scope
Analyzing agent failure modes in complex, integrated scenarios based on the benchmark's design goal

Strengths

Introduces a Triple-Axis Complexity Framework for more holistic evaluation
Focuses on 30 manually created, complex real-world tasks
Last updated on 2026-04-08, suggesting recent maintenance

Limitations

Row count and dataset size are unknown, which may limit suitability assessment
Column-level documentation is absent; field semantics must be inferred after download
Description metadata is limited; actual data quality requires manual inspection after download

Provenance

Source: Mosi-AI via Hugging Face
Collection Method: Manually created benchmark tasks
Time Range: Creation and update timeframe includes 2026
Freshness: Last updated 2026-04-08 11:10:10
Geography: null

License is unknown; terms of use must be verified before application.

Tabular Llm Agents Ai Benchmark Benchmark Real World Tasks Complexity Framework Agent Trajectories

LiveClawBench: Benchmark Trajectories for LLM Agents on Real-World Assistant Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info