Name: OpenClaw Agent Trajectory Safety Benchmark
Creator: AI45Research
Published: 2026-03-24T07:16:41
Keywords: Benchmark, Agent Trajectory, Safety Benchmark, Tabular, Ai Agent Safety, Executable Agents

Description

ATBench-Claw provides a benchmark for evaluating safety in executable AI agent trajectories, focusing on critical decision points before actions like file deletion or code execution. Created by AI45Research, this dataset is an extension of ATBench and a companion to the AgentDoG diagnostic framework. It was last updated in March 2026.

Use Cases

Train models to classify unsafe agent trajectories using features like action_type and decision_point.
Evaluate the performance of safety guardrails on trajectory sequences containing potential hazards.
Analyze the correlation between specific agent intents and subsequent risky actions like file_deletion or message_sending.
Benchmark different agent architectures on safety metrics derived from trajectory-level annotations.

Strengths

Specifically designed for executable agent settings, a focused domain for safety evaluation.
Extends the established ATBench framework, suggesting methodological continuity.
Companion to the published AgentDoG diagnostic framework, indicating practical application.

Limitations

Unknown row count and dataset size prevent assessment of statistical power.
Specific column schema and data features are not publicly documented.
Temporal coverage and update frequency beyond the last metadata change are unclear.

Provenance

Source: AI45Research via Hugging Face.
Collection Method: Designed as a benchmark companion to the AgentDoG framework; specific data collection method is null.
Freshness: Last metadata update was March 24, 2026.

The full description and data details are hosted externally on the Hugging Face dataset page; a visit is required for complete documentation and access.

Tabular Benchmark Agent Trajectory Safety Benchmark Ai Agent Safety Executable Agents

OpenClaw Agent Trajectory Safety Benchmark

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info