55 billion tokens of mathematical text across three categories: arXiv papers, OpenWebMath web content, and the Algebraic Stack code repository. The collection integrates LaTeX-formatted scientific documents with formal proof scripts and general mathematical discourse.
Use Cases
- Train language models to generate formal proofs using the Lean and Coq source code in the Algebraic Stack
- Improve LaTeX document synthesis by training on the arXiv subset
- Develop mathematical reasoning capabilities by fine-tuning on the OpenWebMath web-crawled discussions
Strengths
- 55 billion tokens of diverse mathematical content
- Includes the Algebraic Stack, featuring code from Lean, Coq, Isabelle, and Python
- Contains 15 million documents sourced from arXiv and the OpenWebMath web crawl