Indic Hplt V1: Multilingual Pretraining Corpus Across 10 Indic Languages

Name: Indic Hplt V1: Multilingual Pretraining Corpus Across 10 Indic Languages
Creator: ashtok897
Published: 2026-05-15T10:19:25
Keywords: Hplt, Web Crawl, Text, Multilingual, Pretraining Corpus, Large Scale, Natural Language Processing, Indic Languages, Multilingual Text

by ashtok897Updated 1mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A multilingual pretraining corpus of 9,836,075 documents (~8.4B estimated tokens) across 10 Indic languages and English. It was built from the HPLT Monolingual v3 high-quality web crawl data and is hosted on Hugging Face by author ashtok897.

Use Cases

Pretrain multilingual language models based on the corpus spanning 10 Indic languages.
Fine-tune models for specific Indic languages based on the language-filtered subsets.
Benchmark cross-lingual transfer learning techniques based on the parallel or comparable web-crawled data.
Analyze web text characteristics across different Indic languages based on the HPLT crawl source.

Strengths

Contains 9,836,075 documents, providing substantial scale.
Covers 10 Indic languages plus English, offering multilingual breadth.
Built from the HPLT Monolingual v3 web crawl, which suggests a focus on quality filtering.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Data may reflect geographic or source bias inherent to web-crawled content.

Provenance

Source: HPLT Monolingual v3 high-quality web crawl data.
Collection Method: Built from a web crawl.
Freshness: Last updated 2026-05-20 12:14:24; freshness should be verified.
Geography: Likely covers regions where the 10 Indic languages are spoken.

The dataset is large; streaming is recommended for large-scale use.

Text Multilingual Hplt Web Crawl Pretraining Corpus Large Scale Natural Language Processing Indic Languages Multilingual Text

Related Datasets

Quality Score

C41

Description

49

Source

36

Reputation

48

Access

26

Community

501 downloads

4 likes

0 views

Dataset Info

Author: ashtok897
Created: May 15, 2026
Updated: May 20, 2026
Last synced: May 27, 2026

Access

26

Community

501 downloads

4 likes

0 views

Dataset Info

Author: ashtok897
Created: May 15, 2026
Updated: May 20, 2026
Last synced: May 27, 2026

Indic Hplt V1: Multilingual Pretraining Corpus Across 10 Indic Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info