Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
An adaptive benchmark for evaluating LLM tool-use agents on airline customer service tasks. Generated using EnvScaler by VibrantLabs and last updated in April 2026. Each task requires an agent to transform an initial database state into a final state by executing a sequence of tool calls like flight searches, bookings, and cancellations.
License is unknown, which may restrict usage.