Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
ToolMaze is an evaluation framework for testing LLM agents on tool-use tasks under various perturbation modes. The framework runs an agent in a sandboxed tool runtime, injects perturbations into tool calls, and scores results with a complexity-aware judge. It originates from the 2026 paper "When Tools Fail: Benchmarking Dynamic Replanning and Anomaly Recovery in LLM Agents" and was uploaded by author dongsheng.
License is unknown; terms of use must be verified before application.