800 million words of normalized text and pre-trained n-gram models derived from 14,500 public domain books. These resources provide the linguistic foundation for the LibriSpeech ASR corpus across multiple model formats.
Use Cases
- Train neural language models using the 'librispeech-lm-norm.txt.gz' corpus to improve word error rates in speech recognition
- Integrate the 4-gram ARPA models into a decoding pipeline to transcribe LibriSpeech audio
- Perform vocabulary expansion for speech-to-text systems using the 'librispeech-vocab.txt' list
- Benchmark n-gram smoothing techniques using the provided 3-gram and 4-gram ARPA files
Strengths
- 800-million-word normalized text corpus provided in the 'librispeech-lm-norm.txt.gz' file
- Pre-trained 3-gram and 4-gram language models provided in ARPA format
- Vocabulary list 'librispeech-vocab.txt' containing 200,000 unique words
- Text data sourced from 14,500 public domain books from the LibriVox project