Name: Hinglish: A Large-Scale Speech Dataset for Hindi, Hinglish, and Indian English
Creator: agarwalayushi
Published: 2026-04-23T09:41:44
Keywords: Code Switching, Multilingual, Audio, Large Scale, Audio Processing, Speech Recognition

Description

815,171 audio clips totaling over 2,264 hours of speech, compiled by agarwalayushi and last updated in April 2026. This dataset covers Hindi, Hinglish (Hindi-English code-switching), and Indian English, sourced from 14 public corpora and custom recordings, unified into a single Parquet file.

Use Cases

Train automatic speech recognition (ASR) models based on the multilingual audio content.
Develop language identification systems based on the Hindi, Hinglish, and Indian English labels.
Research code-switching patterns in speech based on the annotated Hinglish content.
Benchmark audio processing pipelines based on the large-scale, cleaned dataset.

Strengths

Large scale with 815,171 clips and over 2,264 hours of audio.
Covers three distinct language categories: Hindi, Hinglish, and Indian English.
Compiled from 14 public sources and custom recordings, suggesting breadth.
Cleaned and annotated with a consistent schema in a single Parquet file.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Compiled from 14 public corpora and original custom recordings.
Collection Method: Unified from multiple sources into a single Parquet dataset with consistent schema.
Time Range: null
Freshness: Last updated 2026-04-24 11:41:41; freshness should be verified.
Geography: Likely focused on India, given the languages covered.

License is unknown; terms of use must be verified before application.

Audio Multilingual Code Switching Large Scale Audio Processing Speech Recognition

Hinglish: A Large-Scale Speech Dataset for Hindi, Hinglish, and Indian English

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info