LibriSpeech English Speech Corpus with 1000 Hours of Audio

Name: LibriSpeech English Speech Corpus with 1000 Hours of Audio
Creator: patrickvonplaten
Published: 2022-03-02T23:29:22
Keywords: Regionus

by patrickvonplatenUpdated 4y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

The LibriSpeech corpus contains approximately 1000 hours of read English speech, sampled at 16 kHz. It was prepared by Vassil Panayotov with assistance from Daniel Povey, derived from audiobooks in the LibriVox project.

Use Cases

Train automatic speech recognition models on 1000 hours of English speech data.
Segment and align audio data derived from LibriVox audiobooks for linguistic analysis.
Process audio stored in .flac format for conversion to float32 arrays using the provided mapping function.

Strengths

Approximately 1000 hours of audio data provides a substantial volume for training.
Audio is carefully segmented and aligned, indicating structured preparation.
Data is derived from the LibriVox project, a known source of public domain audiobooks.

Limitations

The audio is stored in .flac format, requiring conversion before typical array-based processing.
The dataset consists solely of read speech from audiobooks, which may not represent conversational or spontaneous speech patterns.
Specific details on speaker demographics, recording conditions, or transcript accuracy are not provided in the input.

Provenance

Source: LibriVox project audiobooks.
Collection Method: Derived from read audiobooks, carefully segmented and aligned.
Time Range: null
Freshness: null
Geography: null

Audio files are stored in .flac format and require conversion to float32 arrays using a provided mapping function with soundfile. The dataset is a dummy version on Hugging Face, which may be a subset or placeholder.

Regionus

Related Datasets

Quality Score

D34

Description

43

Source

36

Reputation

18

Access

22

Community

11.4K downloads

1 likes

0 views

Dataset Info

Author: patrickvonplaten
Created: Mar 2, 2022
Updated: Oct 14, 2021
Last synced: Apr 29, 2026

Access

22

Community

11.4K downloads

1 likes

0 views

Dataset Info

Author: patrickvonplaten
Created: Mar 2, 2022
Updated: Oct 14, 2021
Last synced: Apr 29, 2026

LibriSpeech English Speech Corpus with 1000 Hours of Audio

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info