Description

An adaptive benchmark for evaluating LLM tool-use agents on airline customer service tasks. Generated using EnvScaler by VibrantLabs and last updated in April 2026. Each task requires an agent to transform an initial database state into a final state by executing a sequence of tool calls like flight searches, bookings, and cancellations.

Use Cases

Benchmarking LLM agent performance on tool-use tasks based on the described airline customer service scenario.
Evaluating agent adaptability to different difficulty levels based on the adaptively generated tasks.
Training or fine-tuning LLMs for sequential decision-making based on the described state transformation tasks.
Studying agent failure modes in complex, multi-step workflows based on the described booking, cancellation, and update operations.

Strengths

Tasks are adaptively generated to target specific difficulty levels against a calibration model.
Focuses on a concrete, complex domain (airline customer service) requiring multiple tool calls.
Generated by VibrantLabs using their EnvScaler tool, suggesting a structured creation process.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: vibrantlabsai
Collection Method: Generated using EnvScaler by VibrantLabs.
Freshness: Last updated 2026-04-21 18:10:42

License is unknown, which may restrict usage.

Tabular Llm Benchmark Tool Use Benchmark Adaptive Generation Airline Customer Service Synthetic

Tau2 Infinity Dag: Adaptive Benchmark for LLM Tool-Use Agents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info