Sign in to view source links and access this dataset
Description
A public ASR training dataset combining multiple audio sources. It includes 4,100 Hindi-dominant YouTube podcast segments, 1,160 English podcast segments, 787 Hindi-English code-mixed podcast segments, and 24,459 synthetic Hinglish entity-normalization speech clips. The dataset was created by sajalmadan0909 and was last updated on June 4, 2026.
Use Cases
Train ASR models for Hindi-dominant speech based on the 4,100 YouTube podcast segments.
Develop models for English speech recognition based on the 1,160 English podcast segments.
Build systems to handle Hindi-English code-mixed speech based on the 787 code-mixed segments.
Train or fine-tune models on synthetic speech for entity normalization tasks based on the 24,459 synthetic clips.
Strengths
Contains a substantial number of synthetic speech clips (24,459) for entity normalization.
Includes a diverse mix of sources: YouTube podcasts, other podcasts, and synthetic speech.
Explicitly covers three language conditions: English, Hindi, and Hindi-English code-mix.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full combined dataset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
huggingface
Collection Method
Combines YouTube podcast VAD clips, English/Hinglish podcasts, and synthetic Hinglish entity-normalization speech.
Freshness
Last updated 2026-06-04 16:41:30; freshness should be verified.
License is unknown; terms of use must be verified before application.