MASC: Massive Arabic Speech Corpus

Name: MASC: Massive Arabic Speech Corpus
Creator: abdusah
Published: 2022-03-02T23:29:22
Keywords: Librarypolars, Languagear, Language Creatorscrowdsourced, Librarydask, Modalitytimeseries, Size Categoriesn1 K, Librarymlcroissant, Librarydatasets, Parquet, Licensecc By Nc 40, Annotations Creatorscrowdsourced, Regionus

by abdusahUpdated 3y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

1,000 hours of Arabic speech audio sampled at 16 kHz, collected from over 700 YouTube channels. The data spans multiple regions, genres, and dialects to support the development of speech recognition technologies.

Use Cases

Train automatic speech recognition (ASR) models using the 1,000 hours of multi-dialectal audio.
Develop dialect identification systems by leveraging the multi-regional nature of the speech samples.
Perform acoustic modeling for Arabic speech sampled at 16 kHz.

Strengths

1,000 hours of speech audio data
Audio sampled at a consistent 16 kHz frequency
Sourced from over 700 distinct YouTube channels
Includes multi-regional and multi-dialectal Arabic speech variations

Parquet Librarypolars Languagear Language Creatorscrowdsourced Librarydask Modalitytimeseries Size Categoriesn1 K Librarymlcroissant Librarydatasets Licensecc By Nc 40 Annotations Creatorscrowdsourced Regionus

Related Datasets

Quality Score

D34

Description

48

Source

36

Reputation

10

Access

22

Community

42 downloads

0 views

Dataset Info

Author: abdusah
Created: Mar 2, 2022
Updated: Jul 1, 2022
Last synced: Apr 29, 2026

Access

22

Community

42 downloads

0 views

Dataset Info

Author: abdusah
Created: Mar 2, 2022
Updated: Jul 1, 2022
Last synced: Apr 29, 2026

MASC: Massive Arabic Speech Corpus

Description

Use Cases

Strengths

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info