Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Microsoft's AgentRx benchmark, updated in February 2026, provides under 1,000 annotated records of failed multi-agent LLM trajectories. It features step-level failure categories and designated root cause attributions across domains such as retail.
Requires JSON processing; users should refer to Arxiv paper 2602.02475 for methodology details.