47,000 hours of speech audio and 19 million fine-grained speaking style captions categorized into splits like FCaps-PSCBase and FCaps-Emilia. The dataset provides open-ended descriptions of vocal characteristics for large-scale speech modeling and synthesis.
Use Cases
- Train text-to-speech models using the fine-grained speaking style descriptions to control prosody and emotion.
- Develop automated speech captioning systems to generate natural language descriptions of vocal styles.
- Fine-tune audio-language models for cross-modal retrieval using the 19 million captions.
Strengths
- 47,000 hours of speech audio data
- 19 million open-ended and fine-grained speaking style captions
- Includes FCaps-PSCBase, dev, test, and FCaps-Emilia subsets
- Integrates English-language audio from the Emilia-Dataset