India-focused conversational speech data for evaluating Automatic Speech Recognition systems on Hindi-English code-mixed utterances. The dataset was curated by soketlabs and last updated on the Hugging Face platform in January 2026. It focuses on natural bilingual contexts where Hindi in Devanagari script and English in Latin script co-occur within the same utterance.
Use Cases
- Benchmarking ASR model performance on Hindi-English code-switching based on the described bilingual conversational contexts.
- Training or fine-tuning speech recognition systems for natural code-mixed speech prevalent in India.
- Studying linguistic patterns and challenges in bilingual speech recognition based on the described Devanagari and Latin script mixing.
Strengths
- Focuses on a specific and linguistically relevant phenomenon: Hindi-English code-switching in Indian conversational contexts.
- Dataset is hosted on Hugging Face, a major platform for AI datasets, suggesting potential for community use and integration.
Limitations
- Description metadata is limited; actual data quality, size, and column structure require manual inspection after download.
- Row count, file formats, and license information are unknown, which may limit suitability assessment.
Provenance
- Source
- soketlabs on Hugging Face
- Collection Method
- Curated evaluation dataset, likely gathered from bilingual conversational contexts.
- Time Range
- null
- Freshness
- Last updated 2026-01-16 11:57:16; freshness should be verified.
- Geography
- India