Name: Combined Synthetic Datasets for English, Hindi, and Code-Mixed ASR Training
Creator: sajalmadan0909
Published: 2026-06-04T16:40:11
Keywords: Code Mixing, Multilingual, Audio, Synthetic Speech, Podcast Audio, Automatic Speech Recognition, Synthetic

Description

A public ASR training dataset combining multiple audio sources. It includes 4,100 Hindi-dominant YouTube podcast segments, 1,160 English podcast segments, 787 Hindi-English code-mixed podcast segments, and 24,459 synthetic Hinglish entity-normalization speech clips. The dataset was created by sajalmadan0909 and was last updated on June 4, 2026.

Use Cases

Train ASR models for Hindi-dominant speech based on the 4,100 YouTube podcast segments.
Develop models for English speech recognition based on the 1,160 English podcast segments.
Build systems to handle Hindi-English code-mixed speech based on the 787 code-mixed segments.
Train or fine-tune models on synthetic speech for entity normalization tasks based on the 24,459 synthetic clips.

Strengths

Contains a substantial number of synthetic speech clips (24,459) for entity normalization.
Includes a diverse mix of sources: YouTube podcasts, other podcasts, and synthetic speech.
Explicitly covers three language conditions: English, Hindi, and Hindi-English code-mix.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full combined dataset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: huggingface
Collection Method: Combines YouTube podcast VAD clips, English/Hinglish podcasts, and synthetic Hinglish entity-normalization speech.
Freshness: Last updated 2026-06-04 16:41:30; freshness should be verified.

License is unknown; terms of use must be verified before application.

Audio Multilingual Code Mixing Synthetic Speech Podcast Audio Automatic Speech Recognition Synthetic

Combined Synthetic Datasets for English, Hindi, and Code-Mixed ASR Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info