Name: Dolma3 6T Unique: 1.26 Billion Deduplicated Documents for Language Model Training
Creator: HCAI-Lab
Published: 2026-03-13T23:06:25
Keywords: Web Crawl, Language Model Training, Text, Deduplication, Text Corpus

Description

1,258,453,709 unique documents form the Dolma3 6T training mix, selected by a Bloom filter built from 1.26B deduplicated IDs. The dataset is materialized from multiple sources including Common Crawl, Stack Exchange, and scientific PDFs, and was created by HCAI-Lab. It was last updated on March 14, 2026.

Use Cases

Training large language models based on the deduplicated web-scale text corpus.
Analyzing the composition of modern training datasets based on the listed source families (Common Crawl, Stack, etc.).
Benchmarking deduplication techniques based on the Bloom filter selection method.
Studying domain-specific language patterns based on the inclusion of scientific PDFs and math content.
Creating filtered subsets for specialized NLP tasks based on the source family proportions.

Strengths

Contains 1,258,453,709 documents, providing a large-scale text corpus.
Built from a Bloom filter using 1.26B deduplicated IDs, indicating a focus on uniqueness.
Aggregates content from multiple distinct sources, including Common Crawl (1,052,782,403 documents) and Stack Exchange (135,925,364 documents).

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Multiple sources including Common Crawl, Dolma1 7-wiki-en, Finemath, Olmocr Science PDFs, RPJ Proofpile arXiv, and Stack Exchange.
Collection Method: Materialized from the Dolma3 6T training mix, selected by a Bloom filter built from 1.26B deduplicated document IDs.
Freshness: Last updated 2026-03-14 05:38:15; freshness should be verified.

Records are stored in zstandard-compressed JSONL files; specific format details require checking the full dataset page.

Text Web Crawl Language Model Training Deduplication Text Corpus

Dolma3 6T Unique: 1.26 Billion Deduplicated Documents for Language Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info