Sign in to view source links and access this dataset
Description
1,258,453,709 unique documents form the Dolma3 6T training mix, selected by a Bloom filter built from 1.26B deduplicated IDs. The dataset is materialized from multiple sources including Common Crawl, Stack Exchange, and scientific PDFs, and was created by HCAI-Lab. It was last updated on March 14, 2026.
Use Cases
Training large language models based on the deduplicated web-scale text corpus.
Analyzing the composition of modern training datasets based on the listed source families (Common Crawl, Stack, etc.).
Benchmarking deduplication techniques based on the Bloom filter selection method.
Studying domain-specific language patterns based on the inclusion of scientific PDFs and math content.
Creating filtered subsets for specialized NLP tasks based on the source family proportions.
Strengths
Contains 1,258,453,709 documents, providing a large-scale text corpus.
Built from a Bloom filter using 1.26B deduplicated IDs, indicating a focus on uniqueness.
Aggregates content from multiple distinct sources, including Common Crawl (1,052,782,403 documents) and Stack Exchange (135,925,364 documents).
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Multiple sources including Common Crawl, Dolma1 7-wiki-en, Finemath, Olmocr Science PDFs, RPJ Proofpile arXiv, and Stack Exchange.
Collection Method
Materialized from the Dolma3 6T training mix, selected by a Bloom filter built from 1.26B deduplicated document IDs.
Freshness
Last updated 2026-03-14 05:38:15; freshness should be verified.
Records are stored in zstandard-compressed JSONL files; specific format details require checking the full dataset page.