Sign in to view source links and access this dataset
Description
700 hours of processed speech data for Hindi, English, and Hinglish (code-mixed) text-to-speech applications. The dataset, created by adjaysagar, includes train and validation manifests and a preprocessing script. It was last updated in February 2026.
Use Cases
Train a TTS model on 700 hours of high-quality audio data for Hindi, English, and Hinglish speech synthesis.
Use the provided train.jsonl and val.jsonl manifests to manage data splits for model training and validation.
Apply the included preprocessing script to replicate the dataset preparation pipeline for custom TTS projects.
Develop multilingual or code-mixed speech synthesis systems leveraging the Hinglish audio content.
Strengths
Substantial 700-hour volume of audio data suitable for training TTS models.
Includes a preprocessing script, providing transparency into the data preparation methodology.
Covers three distinct linguistic categories: Hindi, English, and code-mixed Hinglish.
Limitations
The audio files are contained in a password-protected archive, requiring manual contact for access.
Specific details on audio quality metrics, speaker demographics, or recording conditions are not provided.
No information on file formats, sample rates, or licensing terms is available.
Provenance
Source
huggingface
Collection Method
Processed speech data; specific gathering method unknown.
Freshness
Last updated in February 2026.
The primary data archive (voice_data.zip) is password-protected; users must contact the dataset maintainer for access.