People's Speech: 30,000+ Hours of Transcribed English Audio

Name: People's Speech: 30,000+ Hours of Transcribed English Audio
Creator: MLCommons
Published: 2022-08-16T14:21:49
Keywords: Source Datasetsoriginal, Language Creatorsmachine Generated, Licensecc By Sa 30, Language Creatorscrowdsourced, Librarydask, Size Categories1 Mn10 M, Languageen, Licensecc By 20, Modalitytext, Licensecc By Sa 40, Librarydatasets, Licensecc By 40, Parquet, Annotations Creatorscrowdsourced, Task Categoriesautomatic Speech Recognition, Multilingualitymonolingual

by MLCommonsUpdated 1y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

MLCommons provides the People's Speech dataset, a collection of over 30,000 hours of transcribed English speech. This corpus is designed for training large-scale speech-to-text systems and is released under permissive licenses for both academic and commercial applications.

Use Cases

Training automatic speech recognition (ASR) models using audio recordings and text transcriptions
Acoustic modeling to account for diverse speaker profiles
Benchmarking speech-to-text accuracy against machine-generated and crowdsourced labels

Strengths

30,000+ hours of transcribed audio
Permissive CC-BY and CC-BY-SA licensing for commercial use
Diverse speaker representation

Limitations

Potential transcription noise due to machine-generated annotations
Monolingual English focus limits multilingual application

Provenance

Source: MLCommons
Collection Method: Gathered from various sources with machine-generated and crowdsourced annotations
Freshness: Last updated November 2024

The dataset is distributed in Parquet format and may require the Dask library for efficient handling of the large volume of data; users should verify specific sub-licenses (CC-BY vs CC-BY-SA) for their specific use case.

Parquet Source Datasetsoriginal Language Creatorsmachine Generated Licensecc By Sa 30 Language Creatorscrowdsourced Librarydask Size Categories1 Mn10 M Languageen Licensecc By 20 Modalitytext Licensecc By Sa 40 Librarydatasets Licensecc By 40 Annotations Creatorscrowdsourced Task Categoriesautomatic Speech Recognition Multilingualitymonolingual

Related Datasets

Quality Score

D39

Description

42

Source

36

Reputation

49

Access

22

Community

23.8K downloads

263 likes

0 views

Dataset Info

Author: MLCommons
Created: Aug 16, 2022
Updated: Nov 20, 2024
Last synced: Jun 8, 2026

Access

22

Community

23.8K downloads

263 likes

0 views

Dataset Info

Author: MLCommons
Created: Aug 16, 2022
Updated: Nov 20, 2024
Last synced: Jun 8, 2026

People's Speech: 30,000+ Hours of Transcribed English Audio

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info