Unsupervised Peoples Speech: 1 Million Hours of English Audio

Name: Unsupervised Peoples Speech: 1 Million Hours of English Audio
Creator: MLCommons
Published: 2023-11-10T02:40:09
Keywords: Modalityaudio, Task Idsaudio Language Identification, Task Categoriesaudio Classification, Unsupervised, Audio, Languageeng, Regionus, Task Categoriesautomatic Speech Recognition

by MLCommonsUpdated 1y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

MLCommons provides over one million hours of English audio extracted from Archive.org for unsupervised speech research. The collection features a diverse set of speakers and is available under CC-BY and CC-BY-SA licenses for academic and commercial applications. It was last updated in February 2025 to support large-scale speech model development.

Use Cases

Self-supervised pre-training of Automatic Speech Recognition (ASR) models using raw audio files
Audio Language Identification to distinguish English dialects or accents
Unsupervised feature extraction for audio classification tasks

Strengths

Over 1,000,000 hours of audio
Permissive CC-BY and CC-BY-SA licensing for commercial use
Diverse speaker representation from public archive sources

Limitations

Lack of ground-truth transcripts necessitates unsupervised or self-supervised approaches
Variable recording quality and background noise inherent in public archive sources

Provenance

Source: Archive.org
Collection Method: Extracted from public archives
Freshness: Last updated February 2025; source material from Archive.org is static once collected.
Geography: United States

Processing one million hours of audio requires significant storage and high-performance compute resources; users should verify specific Archive.org item licenses if redistributing individual files.

Audio Modalityaudio Task Idsaudio Language Identification Task Categoriesaudio Classification Unsupervised Languageeng Regionus Task Categoriesautomatic Speech Recognition

Related Datasets

Quality Score

D37

Description

39

Source

36

Reputation

46

Access

22

Community

21.3K downloads

74 likes

0 views

Dataset Info

Author: MLCommons
Created: Nov 10, 2023
Updated: Feb 27, 2025
Last synced: Jun 1, 2026

Access

22

Community

21.3K downloads

74 likes

0 views

Dataset Info

Author: MLCommons
Created: Nov 10, 2023
Updated: Feb 27, 2025
Last synced: Jun 1, 2026

Unsupervised Peoples Speech: 1 Million Hours of English Audio

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info