Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A multilingual pretraining corpus of 34,605,630 documents across 13 Indic languages and English, built from HPLT Monolingual v3 high-quality web crawl data. It is the larger successor to Indic HPLT v1, adding 3 new Indic languages and containing approximately 25.5 billion estimated tokens. The dataset was authored by ashtok897 and last updated on Hugging Face in May 2026.
License is unknown; terms of use must be verified before application.