DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

LibriSpeech English Audio Corpus of 1000 Hours | DataSalon

Home Speech & AudioLibriSpeech English Audio Corpus of 1000 Hours

Speech & Audio

LibriSpeech English Audio Corpus of 1000 Hours

Name: LibriSpeech English Audio Corpus of 1000 Hours
Creator: patrickvonplaten
Published: 2022-03-02T23:29:22
Keywords: Regionus

by patrickvonplaten·Updated 4y ago

Available on 1 platform

Description

The LibriSpeech corpus contains approximately 1000 hours of read English speech audio, sampled at 16 kHz. It was prepared by Vassil Panayotov with assistance from Daniel Povey, derived from audiobooks in the LibriVox project.

Use Cases

Train automatic speech recognition models on 1000 hours of segmented and aligned English audio.
Analyze speech patterns and phonetics from read audiobook content.
Benchmark audio processing pipelines using the provided .flac format audio files.

Strengths

Approximately 1000 hours of audio data provides a substantial resource for speech tasks.
Audio is carefully segmented and aligned, facilitating direct use for training.
Data is derived from the LibriVox project, a known source of public domain audiobooks.

Limitations

Audio is stored in .flac format, requiring conversion (e.g., via soundfile library) for typical ML workflows.
The dataset consists solely of read speech from audiobooks, which may not represent spontaneous conversational speech.
Specific details on speaker demographics, recording conditions, or data splits are not provided in this input.

Provenance

Source: LibriVox project audiobooks.
Collection Method: Derived from read audiobooks, carefully segmented and aligned.
Time Range: null
Freshness: null
Geography: null

Audio files are in .flac format; users must convert them to float32 arrays using a library like soundfile, as demonstrated in the provided Python code snippet.

Regionus

Related Datasets

Quality Score

D33

Description

Source

Reputation

Quality Score

D33

Description

Source

Reputation

Access

Community

12 downloads

0 views

Dataset Info

Author: patrickvonplaten
Created: Mar 2, 2022
Updated: Aug 12, 2021
Last synced: Apr 29, 2026

Access

Community

12 downloads

0 views

Dataset Info

Author: patrickvonplaten
Created: Mar 2, 2022
Updated: Aug 12, 2021
Last synced: Apr 29, 2026

LibriSpeech English Audio Corpus of 1000 Hours

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info