Name: Indic Hplt V2: Multilingual Pretraining Corpus Across 13 Indic Languages
Creator: ashtok897
Published: 2026-05-19T21:30:15
Keywords: Web Crawl, Text, Multilingual, Pretraining Corpus, Natural Language Processing, Indic Languages, Multilingual Text

Description

A multilingual pretraining corpus of 34,605,630 documents across 13 Indic languages and English, built from HPLT Monolingual v3 high-quality web crawl data. It is the larger successor to Indic HPLT v1, adding 3 new Indic languages and containing approximately 25.5 billion estimated tokens. The dataset was authored by ashtok897 and last updated on Hugging Face in May 2026.

Use Cases

Pretrain multilingual language models based on the corpus spanning 13 Indic languages.
Benchmark NLP model performance across different Indic languages based on the web-crawled text.
Analyze linguistic patterns and web content across South Asia based on the high-quality crawl data.
Fine-tune translation or text generation models for low-resource Indic languages based on the included Nepali, Odia, and Assamese data.

Strengths

Contains 34,605,630 documents, approximately 3.5 times larger than the previous v1 release.
Covers 13 Indic languages plus English, adding Nepali, Odia, and Assamese compared to v1.
Built from the HPLT Monolingual v3 high-quality web crawl data source.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Freshness should be verified as the last update was 2026-05-19.

Provenance

Source: HPLT Monolingual v3 web crawl data.
Collection Method: Built from high-quality web crawl data.
Freshness: Last updated 2026-05-19.
Geography: Likely covers regions where the 13 Indic languages (including Nepali, Odia, Assamese) are spoken.

License is unknown; terms of use must be verified before application.

Text Multilingual Web Crawl Pretraining Corpus Natural Language Processing Indic Languages Multilingual Text

Indic Hplt V2: Multilingual Pretraining Corpus Across 13 Indic Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info