Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
A multilingual pretraining corpus of 9,836,075 documents (~8.4B estimated tokens) across 10 Indic languages and English. It was built from the HPLT Monolingual v3 high-quality web crawl data and is hosted on Hugging Face by author ashtok897.
The dataset is large; streaming is recommended for large-scale use.