Sign in to view source links and access this dataset
Description
A large-scale Pashto language corpus containing 11,272,055 text items for NLP research. It comprises 2,021,382 pre-training documents and 9,250,673 instruction pairs for supervised fine-tuning. The dataset was created by codewithnawaB and last updated on Hugging Face in May 2026.
Use Cases
Train language models for Pashto based on the 2 million+ pre-training documents.
Fine-tune models for instruction-following tasks using the 9 million+ SFT instruction pairs.
Analyze linguistic differences between the Peshawari (Pakistani) and Afghan (Kandahari) dialects.
Benchmark NLP tools for low-resource language processing.
Strengths
Large scale with over 11 million total text items.
Explicitly segmented into 2,021,382 pre-training documents and 9,250,673 instruction-tuning pairs.
Documents dialect distribution, with 98% Peshawari (Pakistani) and 1% Afghan (Kandahari) content.
Limitations
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic bias inherent to its sources, with a heavy skew towards the Peshawari dialect.
Provenance
Source
codewithnawaB on Hugging Face; sources include BBC Pashto.
Collection Method
Aggregated from web sources, likely for language model training.
Freshness
Last updated 2026-05-14 04:49:07; freshness should be verified.
Geography
Covers both Afghanistan (Kandahari dialect) and Pakistan (Peshawari dialect).
License is unknown, which may restrict commercial use.