Sign in to view source links and access this dataset
Description
815,171 audio clips totaling over 2,264 hours of speech, compiled by agarwalayushi and last updated in April 2026. This dataset covers Hindi, Hinglish (Hindi-English code-switching), and Indian English, sourced from 14 public corpora and custom recordings, unified into a single Parquet file.
Use Cases
Train automatic speech recognition (ASR) models based on the multilingual audio content.
Develop language identification systems based on the Hindi, Hinglish, and Indian English labels.
Research code-switching patterns in speech based on the annotated Hinglish content.
Benchmark audio processing pipelines based on the large-scale, cleaned dataset.
Strengths
Large scale with 815,171 clips and over 2,264 hours of audio.
Covers three distinct language categories: Hindi, Hinglish, and Indian English.
Compiled from 14 public sources and custom recordings, suggesting breadth.
Cleaned and annotated with a consistent schema in a single Parquet file.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Compiled from 14 public corpora and original custom recordings.
Collection Method
Unified from multiple sources into a single Parquet dataset with consistent schema.
Time Range
null
Freshness
Last updated 2026-04-24 11:41:41; freshness should be verified.
Geography
Likely focused on India, given the languages covered.
License is unknown; terms of use must be verified before application.