March 2025, the Svarah dataset provides 9.6 hours of transcribed English audio from 117 speakers across India. It addresses the underrepresentation of Indian English speakers in existing benchmarks like LibriSpeech and Switchboard. The dataset was created by ai4bharat.
Use Cases
- Benchmarking automatic speech recognition (ASR) systems based on Indic-accented English audio
- Training accent-robust speech models based on a speaker base of roughly 130 million
- Studying phonetic variations in English speech based on data from 117 speakers
- Evaluating model performance on underrepresented accents based on the described gap
Strengths
- Contains 9.6 hours of transcribed audio
- Includes data from 117 speakers
- Specifically addresses a gap in representation for Indian English speakers
Limitations
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
- Description metadata is limited; actual data quality requires manual inspection after download
Provenance
- Source
- ai4bharat
- Freshness
- Last updated 2025-03-10 04:29:23
- Geography
- India