Name: MegaMath Web Pro Max: English Mathematics Text for Language Models
Creator: OctoThinker
Published: 2025-06-25T11:54:53
Keywords: Librarypolars, Librarydask, Size Categories10 Mn100 M, Modalitytext, Modalitytabular, Librarymlcroissant, Librarydatasets, Parquet, Arxiv250620512, Regionus, Licenseodc By

Description

The dataset is a curated collection of English mathematics text documents, created by OctoThinker and last updated in July 2025. It is derived from the MegaMath-Web corpus and annotated using the Llama-3.1-70B-instruct model. The dataset is intended for natural language processing and large language model training, with content filtered based on a quality scoring threshold.

Use Cases

Fine-tune language models on English mathematics text for improved mathematical reasoning tasks.
Analyze the relationship between document publication year and text quality scores from the annotation process.
Use the filtered, high-scoring text documents as a training corpus for specialized mathematical language models.

Strengths

Documents are annotated for quality using the advanced Llama-3.1-70B-instruct model.
The curation process involves stratified sampling by publication year from the MegaMath-Web corpus.
Content is filtered using a defined scoring threshold to ensure quality.

Limitations

The dataset's total size, row count, and specific column structure are unknown.
The quality scoring and filtering methodology may introduce specific biases not described in detail.
The dataset's license and terms of use are unspecified.

Provenance

Source: MegaMath-Web corpus.
Collection Method: Documents were uniformly and randomly sampled, stratified by publication year, then annotated and filtered based on a scoring prompt.
Freshness: Last updated on 2025-07-06.

The full description is available only on the Hugging Face dataset page. The specific filtering threshold and detailed preprocessing steps are not provided in the input.

Parquet Librarypolars Librarydask Size Categories10 Mn100 M Modalitytext Modalitytabular Librarymlcroissant Librarydatasets Arxiv250620512 Regionus Licenseodc By

MegaMath Web Pro Max: English Mathematics Text for Language Models

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info