Sign in to view source links and access this dataset
Description
AllenAI's Dolma3 Dolmino Mix 100B 1125 is a curated data pool assembled for the second-stage training of the OLMo 3 32B language model. The dataset sources include synthetic math problems, code, and question-answering data from repositories like TinyMATH, CraneMath, StackEdu, and Nemotron. It was last updated on Hugging Face in February 2026.
Use Cases
Training language models for mathematical reasoning based on synthetic math problems from sources like TinyMATH and CraneMath.
Fine-tuning models for code generation and understanding based on the StackEdu and CraneCode sources.
Improving question-answering performance using synthetic QA data from Reddit To Flashcards and Wiki To RCQA.
Conducting research on the impact of high-quality, multi-source data pools on model performance during annealing stages.
Strengths
Designed for a specific, high-stakes training stage (stage 2 annealing) of the OLMo 3 32B model.
Aggregates data from multiple named, high-quality sources across math, code, and QA domains.
Last updated on 2026-02-23, indicating recent maintenance.
Limitations
Description metadata is limited; actual data quality, size, and format require manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect source bias inherent to the synthetic generation methods and original platforms.
Provenance
Source
AllenAI, via Hugging Face.
Collection Method
Curated pool from multiple synthetic and educational data sources.
Time Range
null
Freshness
Last updated 2026-02-23 19:03:37.
Geography
null
License is unknown; users must verify terms before use.