45,031,396 documents across 41 European languages provide a multilingual pretraining corpus. The data is built from HPLT Monolingual v3 web crawl sources and spans Germanic, Romance, Slavic, Celtic, Baltic, Finno-Ugric, Greek, and other language families. Every document has an HPLT WDS quality score of 10 or higher.
Use Cases
- Pretrain multilingual language models based on the corpus spanning 41 European languages.
- Benchmark language model performance across different European language families mentioned in the description.
- Study web text quality distributions based on the HPLT WDS quality score of 10 or higher.
- Analyze linguistic patterns across Germanic, Romance, Slavic, Celtic, Baltic, Finno-Ugric, and Greek language families.
Strengths
- 45,031,396 documents provide substantial scale.
- Corpus spans 41 European languages across multiple language families.
- All documents have a quality score of 10 or higher, indicating a filtered subset.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Last updated 2026-06-10 14:45:42; freshness should be verified.
Provenance
- Source
- HPLT Monolingual v3 high-quality web crawl data.
- Collection Method
- Built from web crawl data.
- Geography
- European languages.