Sign in to view source links and access this dataset
Description
14 model/policy combinations were evaluated across 10 environments, generating full agent trajectories. Each row contains one complete evaluation trajectory with every tool call, argument, output, error, and reasoning step. The dataset was created by eval-aware and last updated on March 28, 2026.
Use Cases
Analyzing agent failure modes based on detailed error and output logs.
Benchmarking model performance across different Linux environments.
Training or fine-tuning agents using demonstrated reasoning and action sequences.
Studying the relationship between agent policies and tool-call patterns.
Strengths
Contains trajectories from 14 distinct model/policy combinations.
Evaluations were conducted across 10 different environments.
Provides two granularity levels of detail per action.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
huggingface
Collection Method
Collected from LinuxBench/LinuxArena evaluations.
Freshness
Last updated 2026-03-28 14:17:55; freshness should be verified.
License is unknown; users should verify permissions before use.