Unlabeled English Audiobook Speech for ASR Benchmarking

Name: Unlabeled English Audiobook Speech for ASR Benchmarking
Creator: HugoLaurencon
Published: 2022-05-09T14:31:34
Keywords: Regionus

by HugoLaurenconUpdated 1y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Libri-light is a dataset of 60,000 hours of unlabeled English speech audio from audiobooks. It serves as a benchmark for training automatic speech recognition systems with limited or no supervision.

Use Cases

Train self-supervised speech models on 60K hours of unlabeled English audiobook audio.
Benchmark semi-supervised ASR systems using the unlabeled speech data for pre-training.
Develop unsupervised feature extraction methods for raw speech waveforms from audiobooks.

Strengths

Contains 60,000 hours of speech audio, providing a large-scale resource for unsupervised learning.
Specifically designed as a benchmark for ASR with limited supervision, offering a clear evaluation target.
Data is sourced from audiobooks, which typically provide clear, read speech in English.

Limitations

The speech data is entirely unlabeled, requiring significant effort or other resources for supervised tasks.
The dataset consists solely of audiobook speech, which may not represent conversational or noisy acoustic environments.
No column or feature-level metadata is provided in the input, limiting structured analysis.

Provenance

Source: HugoLaurencon via Hugging Face
Collection Method: Aggregated from audiobooks.
Time Range: null
Freshness: null
Geography: null

null

Regionus

Related Datasets

Quality Score

D31

Description

24

Source

41

Reputation

31

Access

22

Community

55 downloads

4 likes

0 views

Dataset Info

Author: HugoLaurencon
Created: May 9, 2022
Updated: Jul 15, 2024
Last synced: Apr 30, 2026

Access

22

Community

55 downloads

4 likes

0 views

Dataset Info

Author: HugoLaurencon
Created: May 9, 2022
Updated: Jul 15, 2024
Last synced: Apr 30, 2026

Unlabeled English Audiobook Speech for ASR Benchmarking

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info