Sign in to view source links and access this dataset
Description
SRA-Bench is a benchmark dataset for evaluating skill-retrieval-augmented large language model agents, created by WeihangSu and last updated on April 22, 2026. It contains 5,400 test instances and a skill library of 26,262 skills, of which 636 are gold skills and 25,626 are web-collected distractors. The dataset includes sub-benchmarks like TheoremQA and LogicBench for specific reasoning tasks.
Use Cases
Benchmarking agent performance on theorem application tasks based on the TheoremQA sub-benchmark.
Evaluating logical reasoning patterns in agents based on the LogicBench sub-benchmark.
Testing skill retrieval accuracy in a noisy environment based on the library containing gold skills and distractors.
Developing and comparing skill-retrieval augmentation methods for LLMs based on the provided test instances.
Strengths
Provides 5,400 test instances for agent evaluation.
Includes a skill library of 26,262 skills, with a clear distinction of 636 gold skills.
Contains structured sub-benchmarks targeting specific capability types like theorem application and logical reasoning.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
The description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
WeihangSu on Hugging Face, with associated code at github.com/oneal2000/SR-Agents.
Collection Method
Likely constructed for research, with skills embedded in a library containing both gold skills and web-collected distractors.
Time Range
null
Freshness
Last updated 2026-04-22 10:28:53; freshness should be verified.
Geography
null
License is unknown; users should verify licensing terms before use.