Nearly 10 hours of studio-quality English speech recordings from a single speaker recreate expressive utterances from the Switchboard-1 Telephone Speech Corpus. These recordings feature labeled paralanguage and disfluencies across three different data components to simulate realistic informal conversations.
Use Cases
- Train expressive text-to-speech (TTS) models capable of generating natural disfluencies from text inputs
- Develop prosody modeling systems using the studio-quality audio and corresponding Switchboard-derived transcripts
- Evaluate speech recognition systems on their ability to handle informal, disfluent speech patterns in high-fidelity environments
Strengths
- Nearly 10 hours of single-speaker studio-quality audio recordings
- Derived from the Switchboard-1 Telephone Speech Corpus to capture realistic informal speech patterns
- Includes labeled paralanguage and disfluency markers for expressive synthesis
- Provides three different data components to support predictive synthesis of paralanguage from text