Name: Dolma3 Dolmino Mix: High-Quality Data Pool for Large Language Model Training
Creator: allenai
Published: 2025-11-18T22:25:49
Keywords: Languageen, Language Model Training, Math Problems, Question Answering, Text, Arxiv251213961, Regionus, Licenseodc By, Synthetic Data

Description

AllenAI's Dolma3 Dolmino Mix 100B 1125 is a curated data pool assembled for the second-stage training of the OLMo 3 32B language model. The dataset sources include synthetic math problems, code, and question-answering data from repositories like TinyMATH, CraneMath, StackEdu, and Nemotron. It was last updated on Hugging Face in February 2026.

Use Cases

Training language models for mathematical reasoning based on synthetic math problems from sources like TinyMATH and CraneMath.
Fine-tuning models for code generation and understanding based on the StackEdu and CraneCode sources.
Improving question-answering performance using synthetic QA data from Reddit To Flashcards and Wiki To RCQA.
Conducting research on the impact of high-quality, multi-source data pools on model performance during annealing stages.

Strengths

Designed for a specific, high-stakes training stage (stage 2 annealing) of the OLMo 3 32B model.
Aggregates data from multiple named, high-quality sources across math, code, and QA domains.
Last updated on 2026-02-23, indicating recent maintenance.

Limitations

Description metadata is limited; actual data quality, size, and format require manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect source bias inherent to the synthetic generation methods and original platforms.

Provenance

Source: AllenAI, via Hugging Face.
Collection Method: Curated pool from multiple synthetic and educational data sources.
Time Range: null
Freshness: Last updated 2026-02-23 19:03:37.
Geography: null

License is unknown; users must verify terms before use.

Text Languageen Language Model Training Math Problems Question Answering Arxiv251213961 Regionus Licenseodc By Synthetic Data

Dolma3 Dolmino Mix: High-Quality Data Pool for Large Language Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info