Libri-light is a dataset of 60,000 hours of unlabeled English speech audio from audiobooks. It serves as a benchmark for training automatic speech recognition systems with limited or no supervision.
Use Cases
- Train self-supervised speech models on 60K hours of unlabeled English audiobook audio.
- Benchmark semi-supervised ASR systems using the unlabeled speech data for pre-training.
- Develop unsupervised feature extraction methods for raw speech waveforms from audiobooks.
Strengths
- Contains 60,000 hours of speech audio, providing a large-scale resource for unsupervised learning.
- Specifically designed as a benchmark for ASR with limited supervision, offering a clear evaluation target.
- Data is sourced from audiobooks, which typically provide clear, read speech in English.
Limitations
- The speech data is entirely unlabeled, requiring significant effort or other resources for supervised tasks.
- The dataset consists solely of audiobook speech, which may not represent conversational or noisy acoustic environments.
- No column or feature-level metadata is provided in the input, limiting structured analysis.
Provenance
- Source
- HugoLaurencon via Hugging Face
- Collection Method
- Aggregated from audiobooks.
- Time Range
- null
- Freshness
- null
- Geography
- null