Large-scale CC0 Pashto speech dataset for Automatic Speech Recognition (ASR). The dataset is part of the Common Voice project, version 25.0, and is hosted on Kaggle. Its specific collection method, size, and contributor details are not provided in the available metadata.
Use Cases
- Training Pashto speech recognition models based on the large-scale speech data mentioned in the description
- Fine-tuning multilingual ASR systems based on the inclusion of Pashto audio
- Benchmarking ASR model performance on a low-resource language based on the Pashto language focus
- Researching speech patterns and acoustic features in Pashto based on the audio data
Strengths
- Released under a CC0 license, which allows for maximum reuse and redistribution
- Explicitly designed for Automatic Speech Recognition (ASR) tasks
- Focuses on Pashto, a language that may be underrepresented in other speech corpora
Limitations
- Row count and total dataset size are unknown, which may limit suitability assessment
- Column-level documentation is absent; field semantics must be inferred after download
- Last update date is unknown; freshness unverified
Provenance
- Source
- Common Voice project, hosted on Kaggle.
- Collection Method
- Likely contains crowd-sourced speech recordings, but the specific collection method is not detailed.
- Time Range
- null
- Freshness
- Last updated date is unknown.
- Geography
- null