SPRINGLab's IndicTTS Malayalam dataset contains high-quality speech recordings with transcriptions for text-to-speech research. The dataset includes approximately 17.89 hours of audio from male and female speakers, sourced from the Indic TTS Database project. It was last updated on January 25, 2025.
Use Cases
- Train Malayalam text-to-speech models based on high-quality audio recordings.
- Benchmark speech synthesis systems based on male and female speaker data.
- Develop multilingual TTS pipelines based on Indic language resources.
- Study prosody and pronunciation in Malayalam based on transcribed speech.
Strengths
- Contains approximately 17.89 hours of audio data.
- Includes recordings from both male (9.7 hours) and female (8.19 hours) speakers.
- Audio files are in WAV format, suggesting high-quality recordings.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count and total file size are unknown, which may limit suitability assessment.
- The description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Indic TTS Database project.
- Collection Method
- Derived from Malayalam monolingual recordings.
- Freshness
- Last updated 2025-01-25 05:52:03.