Sign in to view source links and access this dataset
Description
100 billion tokens of high-quality synthetic data used for the second-stage training of the OLMo 3 7B model. The corpus was created by AllenAI and includes sources like Dolmino Math (10.7B tokens), CraneMath (5.62B tokens), and StackEdu (10.0B tokens). The dataset was last updated on January 5, 2026.
Use Cases
Fine-tuning language models for mathematical problem-solving based on the described synthetic math data.
Training models on code generation and understanding based on the StackEdu (FIM) code corpus mentioned.
Creating specialized instruction-tuning datasets for educational AI assistants using the described math and educational content.
Benchmarking model performance on synthetic versus real-world data for tasks in mathematics and coding.
Strengths
Contains 100 billion tokens, providing a substantial volume of training data.
Includes multiple high-quality, specialized sources such as Dolmino Math (10.7B tokens) and CraneMath (5.62B tokens).
Designed and used for training a specific, documented model (OLMo 3 7B), suggesting a proven utility.
Limitations
Description metadata is limited; actual data quality, formatting, and structure require manual inspection after download.
Column-level documentation is absent; field semantics and data organization must be inferred after download.
The dataset consists entirely of synthetic data, which may not fully capture the complexity and distribution of real-world problems.
Provenance
Source
AllenAI
Collection Method
Mixture of high-quality synthetic data from sources like TinyMATH, CraneMath, MegaMatt, Dolmino Math, and StackEdu.
Time Range
The temporal coverage of the source data is not specified.
Freshness
Last updated 2026-01-05 16:25:35; freshness should be verified.
Geography
The geographic origin of the source data is not specified.
License information is unknown; users must verify permissions before use.