2,620 high-quality audio clips and transcriptions derived from public domain audiobooks for evaluating speech recognition systems. The data is categorized as "clean" due to its low noise levels and high recording quality compared to other LibriSpeech subsets.
Use Cases
- Calculate Word Error Rate (WER) for speech-to-text models by comparing predictions to the text field
- Test the zero-shot capabilities of Audio Large Language Models using the audio input and text ground truth
- Perform speaker verification or identification tasks using the speaker_id labels
Strengths
- 2,620 audio samples paired with normalized text transcriptions
- Audio files provided in 16kHz FLAC format to ensure lossless signal quality
- Metadata includes speaker_id, chapter_id, and id for tracking source audiobooks