Name: Dolma3 Dolmino Mix 100B: Synthetic Math and Code Corpus for Language Model Training
Creator: allenai
Published: 2025-10-12T22:14:27
Keywords: Educational, Mathematics, Code, Language Model Training, Text, Synthetic Data

Description

100 billion tokens of high-quality synthetic data used for the second-stage training of the OLMo 3 7B model. The corpus was created by AllenAI and includes sources like Dolmino Math (10.7B tokens), CraneMath (5.62B tokens), and StackEdu (10.0B tokens). The dataset was last updated on January 5, 2026.

Use Cases

Fine-tuning language models for mathematical problem-solving based on the described synthetic math data.
Training models on code generation and understanding based on the StackEdu (FIM) code corpus mentioned.
Creating specialized instruction-tuning datasets for educational AI assistants using the described math and educational content.
Benchmarking model performance on synthetic versus real-world data for tasks in mathematics and coding.

Strengths

Contains 100 billion tokens, providing a substantial volume of training data.
Includes multiple high-quality, specialized sources such as Dolmino Math (10.7B tokens) and CraneMath (5.62B tokens).
Designed and used for training a specific, documented model (OLMo 3 7B), suggesting a proven utility.

Limitations

Description metadata is limited; actual data quality, formatting, and structure require manual inspection after download.
Column-level documentation is absent; field semantics and data organization must be inferred after download.
The dataset consists entirely of synthetic data, which may not fully capture the complexity and distribution of real-world problems.

Provenance

Source: AllenAI
Collection Method: Mixture of high-quality synthetic data from sources like TinyMATH, CraneMath, MegaMatt, Dolmino Math, and StackEdu.
Time Range: The temporal coverage of the source data is not specified.
Freshness: Last updated 2026-01-05 16:25:35; freshness should be verified.
Geography: The geographic origin of the source data is not specified.

License information is unknown; users must verify permissions before use.

Text Educational Mathematics Code Language Model Training Synthetic Data

Dolma3 Dolmino Mix 100B: Synthetic Math and Code Corpus for Language Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info