Name: Large English Speech Recognition Corpus With 30,000+ Hours
Creator: MLCommons
Published: 2022-03-30T15:49:51
Keywords: Source Datasetsoriginal, Language Creatorsmachine Generated, Licensecc By Sa 30, Language Creatorscrowdsourced, Languageen, Licensecc By 20, Licensecc By Sa 40, Arxiv211109344, Licensecc By 40, Annotations Creatorscrowdsourced, Regionus, Licensecc By 30, Task Categoriesautomatic Speech Recognition, Annotations Creatorsmachine Generated, Multilingualitymonolingual, Licensecc By 25

Description

The People's Speech Dataset contains over 30,000 hours of transcribed English speech, licensed for academic and commercial use under CC-BY-SA and CC-BY 4.0. It was created by MLCommons to train speech-to-text systems and features a diverse set of speakers.

Use Cases

Train automatic speech recognition models on over 30,000 hours of transcribed English audio.
Analyze speaker diversity patterns within the large corpus of crowdsourced and machine-generated speech.
Fine-tune speech-to-text systems using the permissively licensed CC-BY and CC-BY-SA audio and transcript pairs.
Benchmark model performance on a monolingual English speech recognition task with US regional data.

Strengths

Contains over 30,000 hours of transcribed speech, making it one of the world's largest English speech corpora.
Available under permissive licenses (CC-BY-SA and CC-BY 4.0) for both academic and commercial usage.
Includes speech from a diverse set of speakers, as noted in the dataset summary.
Annotations were created through both crowdsourced and machine-generated methods, providing varied data sources.

Limitations

The dataset is monolingual (English only), limiting its applicability for multilingual speech recognition tasks.
Specific details on audio quality, speaker demographics, or transcription accuracy are not provided in the input.
The input does not specify the time range of the collected speech data, which may affect model relevance.

Provenance

Source: MLCommons
Collection Method: Combination of crowdsourced and machine-generated annotations, sourced from original datasets.
Freshness: The dataset was last updated on 2024-08-25.
Geography: Primarily US region, as indicated by tags.

The dataset uses multiple Creative Commons licenses (CC-BY and CC-BY-SA versions 2.0, 2.5, 3.0, 4.0). Users must comply with the specific license terms for their intended use, which may include share-alike requirements.

Source Datasetsoriginal Language Creatorsmachine Generated Licensecc By Sa 30 Language Creatorscrowdsourced Languageen Licensecc By 20 Licensecc By Sa 40 Arxiv211109344 Licensecc By 40 Annotations Creatorscrowdsourced Regionus Licensecc By 30 Task Categoriesautomatic Speech Recognition Annotations Creatorsmachine Generated Multilingualitymonolingual Licensecc By 25

Large English Speech Recognition Corpus With 30,000+ Hours

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info