Librispeech 100H is a subset of the LibriSpeech corpus containing 100 hours of English speech audio. The dataset was created by namnv1906 and uploaded to Hugging Face in May 2022. It is derived from public domain audiobooks from the LibriVox project.
Use Cases
- Train acoustic models on 100 hours of English speech audio.
- Benchmark ASR system accuracy using aligned audio and transcription pairs.
- Develop speaker-independent models using data from multiple public domain audiobook readers.
Strengths
- Contains 100 hours of speech audio.
- Derived from a well-known, public domain source corpus.
Limitations
- Limited to 100 hours, a smaller subset compared to the full LibriSpeech corpus.
- Content is restricted to English audiobooks, lacking diversity in accents and domains.
Provenance
- Source
- LibriSpeech corpus (LibriVox audiobooks).
- Collection Method
- Derived from public domain audiobook recordings.
- Freshness
- Last updated on Hugging Face in May 2022.