English Audio Transcripts with 3.4 Million Hours of Speech

Name: English Audio Transcripts with 3.4 Million Hours of Speech
Creator: allenai
Published: 2025-07-17T04:30:27
Keywords: Librarypolars, Size Categories10 Mn100 M, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Regionus, Arxiv250820869, JSON, Licenseodc By

by allenaiUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

OLMoASR-Pool contains approximately 3.4 million hours of audio and 18.8 million unique transcripts collected from the public internet. It was created by AllenAI to train English speech recognition models and includes a variety of speaking styles, accents, and audio setups.

Use Cases

Train speech recognition models on 3.4 million hours of audio to transcribe diverse English accents and speaking styles.
Analyze the relationship between audio characteristics and transcript accuracy across different audio setups.
Fine-tune language models using the 18.8 million unique transcripts for tasks like text generation or summarization.

Strengths

Contains approximately 3.4 million hours of audio data.
Includes 18.8 million unique transcript IDs.
Encompasses a variety of speaking styles, accents, and audio setups.

Limitations

Specific column definitions, file formats, and audio quality metrics are not provided.
The dataset's geographic and temporal coverage is not specified, which may limit analysis of regional or time-based trends.

Provenance

Source: AllenAI, collected from the public internet.
Collection Method: Web-scale collection from the public internet.
Freshness: Last updated on March 20, 2026.

The full dataset description is hosted externally; users should review the page at https://huggingface.co/datasets/allenai/OLMoASR-Pool for complete details.

JSON Librarypolars Size Categories10 Mn100 M Modalitytext Librarymlcroissant Librarydatasets Librarypandas Regionus Arxiv250820869 Licenseodc By

Related Datasets

Quality Score

C40

Description

42

Source

41

Reputation

47

Access

22

Community

59 downloads

12 likes

0 views

Dataset Info

Author: allenai
Created: Jul 17, 2025
Updated: Mar 20, 2026
Last synced: Jun 18, 2026

Access

22

Community

59 downloads

12 likes

0 views

Dataset Info

Author: allenai
Created: Jul 17, 2025
Updated: Mar 20, 2026
Last synced: Jun 18, 2026

English Audio Transcripts with 3.4 Million Hours of Speech

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info