Name: ToolMaze: Benchmark for LLM Agent Tool-Use Under Perturbations
Creator: dongsheng
Published: 2026-06-01T03:22:43
Keywords: Machine Learning, Llm Agents, Tool Use, Tools, Evaluation Benchmark, Benchmark, Tabular, Dynamic Replanning, Software, Anomaly Recovery

Description

ToolMaze is an evaluation framework for testing LLM agents on tool-use tasks under various perturbation modes. The framework runs an agent in a sandboxed tool runtime, injects perturbations into tool calls, and scores results with a complexity-aware judge. It originates from the 2026 paper "When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents" and was uploaded by author dongsheng.

Use Cases

Benchmarking LLM agent robustness based on the described C1-C4 task complexity and P0-P4 perturbation modes.
Evaluating dynamic replanning strategies in agents based on the framework's ability to inject perturbations into tool calls.
Scoring agent performance on tool-use tasks based on the described complexity-aware judge.
Studying anomaly recovery in LLM agents based on the described sandboxed tool runtime environment.

Strengths

Provides a structured evaluation framework with defined task complexities (C1-C4) and perturbation modes (P0-P4).
Includes a sandboxed tool runtime and a complexity-aware judge for scoring, as described in the source paper.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and dataset scale are unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: dongsheng on Hugging Face, associated with the paper "When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents".
Collection Method: Likely contains synthetic or curated tasks and perturbations for evaluating LLM agents, as described in the paper.
Freshness: Last updated 2026-06-05 07:18:08; freshness should be verified.

License is unknown; terms of use must be verified before application.

Tabular Machine Learning Llm Agents Tool Use Tools Evaluation Benchmark Benchmark Dynamic Replanning Software Anomaly Recovery

ToolMaze: Benchmark for LLM Agent Tool-Use Under Perturbations

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info