A curated evaluation set for Indic-language automatic speech recognition. It contains 6,169 audio samples across 7 dataset configurations, totaling approximately 13.3 hours of audio at 16 kHz. The dataset was created by ayush-shunyalabs and last updated on 2026-04-23.
Use Cases
- Benchmark ASR model performance based on curated test samples from seven source corpora.
- Evaluate multilingual speech recognition accuracy across different Indic languages.
- Compare ASR system outputs using a standardized evaluation set.
- Validate speech recognition pipelines on 16 kHz mono audio data.
Strengths
- Contains 6,169 audio samples across 7 configurations.
- Total audio duration is approximately 13.3 hours.
- All audio is standardized at a 16 kHz sampling rate.
- Each source corpus is published as its own dataset configuration.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count for individual configurations is not detailed in the provided metadata.
- Data may reflect source corpus bias inherent to the original collection methods.
Provenance
- Source
- Seven public Indic ASR corpora.
- Collection Method
- 100 samples were sampled (seed = 42) from each (source dataset × language) cell.
- Freshness
- Last updated 2026-04-23 12:48:51; freshness should be verified.
- Geography
- Indic-language regions.