DecodingTrust-Agent created DTap-Bench, a dataset of agent evaluation tasks. It contains tasks across 14 domains including browser, code, CRM, and medical. Each domain is organized into a benign task split and two red-teaming splits.
Use Cases
- Benchmarking agent performance based on tasks across 14 domains.
- Red-teaming agent vulnerabilities using the direct and indirect adversarial splits.
- Training or testing specialized agents for specific domains like finance or legal.
- Comparing agent robustness between benign and adversarial scenarios.
Strengths
- Tasks span 14 distinct domains, providing broad coverage.
- Includes both benign and adversarial (red-teaming) splits for each domain.
- Each record represents a single task case, likely enabling granular evaluation.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- DecodingTrust-Agent
- Freshness
- Last updated 2026-05-07 08:36:15; freshness should be verified.