Sign in to view source links and access this dataset
Description
RULER is a benchmark designed to evaluate effective context length and long-context behavior beyond simple retrieval. The dataset contains pre-generated JSONL files organized by target context lengths of 4096, 8192, 16384, 32768, and 49152 tokens. It was authored by sxiong and last updated on June 19, 2026.
Use Cases
Benchmarking model performance on retrieval tasks based on the described retrieval evaluation.
Evaluating multi-hop reasoning capabilities based on the multi-hop tracing tasks mentioned.
Testing model ability for information aggregation based on the described aggregation tasks.
Assessing question-answering performance in long-context settings based on the described QA-style tasks.
Strengths
Structured for specific target context lengths up to 49152 tokens.
Covers multiple task types including retrieval, multi-hop tracing, aggregation, and question answering.
Last updated on June 19, 2026, suggesting recent maintenance.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and sample data are unknown, which may limit suitability assessment.
License information is unknown, which may restrict usage.
Provenance
Source
huggingface
Collection Method
Pre-generated for the RULER benchmark; specific collection method not detailed.
Freshness
Last updated 2026-06-19 02:42:01
License is unknown; users must verify terms of use before downloading.