Raw Emocean is a large-scale English speech dataset designed for training autoregressive text-to-speech models. It contains 8,649 audio segments totaling 15.39 hours, sourced from 22 videos, with an average segment duration of 6.4 seconds. The dataset was created by author somu9 and last updated on Hugging Face in April 2026.
Use Cases
- Training autoregressive text-to-speech models based on the dataset's stated purpose.
- Evaluating speech synthesis quality based on the provided signal-to-noise ratio (SNR) metrics.
- Benchmarking TTS model performance on a dataset with a defined duration range (3.0s–8.0s).
Strengths
- Contains 8,649 audio segments with a total duration of 15.39 hours.
- Provides detailed audio specifications including a sample rate of 24,000 Hz, 16-bit depth, and an average SNR of 49.1 dB.
- Segments have a controlled duration range of 3.0 to 8.0 seconds, which may be suitable for consistent model input.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- The dataset's source videos and potential speaker diversity are not described, which may indicate bias.
- Last updated 2026-04-24 16:07:20; freshness should be verified.
Provenance
- Source
- somu9
- Collection Method
- Likely extracted from 22 source videos.
- Freshness
- 2026-04-24 16:07:20