A multilingual automatic speech recognition dataset covering 30 Indic dialects and languages. It contains over 2.8 million audio samples with corresponding transcriptions. The dataset was created by author grushaaaaa and last updated on Hugging Face in February 2026.
Use Cases
- Train multilingual automatic speech recognition models based on the audio and transcription features.
- Benchmark ASR system performance across different Indic languages based on the language label.
- Analyze dialectal variations in speech patterns based on the multilingual audio samples.
- Fine-tune pre-trained speech models for specific languages based on the language-specific splits.
Strengths
- Contains over 2.8 million audio samples.
- Covers 30 distinct Indic languages and dialects.
- Audio is provided in a standard 16kHz WAV format.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Hugging Face dataset by author grushaaaaa.
- Collection Method
- Aggregated from multiple source datasets, as indicated by the 'source' feature.
- Time Range
- null
- Freshness
- Last updated 2026-02-11 17:25:14; freshness should be verified.
- Geography
- Likely covers regions where Indic languages are spoken.