Name: Pashto Corpus V1: Large-Scale NLP Data Covering Afghan and Pakistani Dialects
Creator: codewithnawaB
Published: 2026-05-08T17:00:22
Keywords: Nlp Corpus, Text, Large Scale, Natural Language Processing, Pashto Language, Dialect Variation

Description

A large-scale Pashto language corpus containing 11,272,055 text items for NLP research. It comprises 2,021,382 pre-training documents and 9,250,673 instruction pairs for supervised fine-tuning. The dataset was created by codewithnawaB and last updated on Hugging Face in May 2026.

Use Cases

Train language models for Pashto based on the 2 million+ pre-training documents.
Fine-tune models for instruction-following tasks using the 9 million+ SFT instruction pairs.
Analyze linguistic differences between the Peshawari (Pakistani) and Afghan (Kandahari) dialects.
Benchmark NLP tools for low-resource language processing.

Strengths

Large scale with over 11 million total text items.
Explicitly segmented into 2,021,382 pre-training documents and 9,250,673 instruction-tuning pairs.
Documents dialect distribution, with 98% Peshawari (Pakistani) and 1% Afghan (Kandahari) content.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic bias inherent to its sources, with a heavy skew towards the Peshawari dialect.

Provenance

Source: codewithnawaB on Hugging Face; sources include BBC Pashto.
Collection Method: Aggregated from web sources, likely for language model training.
Freshness: Last updated 2026-05-14 04:49:07; freshness should be verified.
Geography: Covers both Afghanistan (Kandahari dialect) and Pakistan (Peshawari dialect).

License is unknown, which may restrict commercial use.

Text Nlp Corpus Large Scale Natural Language Processing Pashto Language Dialect Variation

Pashto Corpus V1: Large-Scale NLP Data Covering Afghan and Pakistani Dialects

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info