Sign in to view source links and access this dataset
Description
1,000 hours of Arabic speech audio sampled at 16 kHz, crawled from over 700 YouTube channels. The MASC dataset is multi-regional, multi-genre, and multi-dialect, created by MohamedRashad and last updated in April 2026. It is intended to advance research and development in Arabic speech technology.
Use Cases
Train automatic speech recognition (ASR) models based on the multi-dialect and multi-genre speech content.
Benchmark speech recognition performance across different Arabic dialects based on the multi-regional data.
Develop speech synthesis or voice conversion systems using the 16 kHz sampled audio.
Conduct linguistic analysis of Arabic dialect variation using the speech data from diverse YouTube channels.
Strengths
Contains 1,000 hours of speech data, providing substantial volume for model training.
Sourced from over 700 YouTube channels, suggesting diversity in content and speakers.
Explicitly designed as multi-regional, multi-genre, and multi-dialect to support robust Arabic speech technology.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count and exact file formats are unknown, which may limit suitability assessment.
Data may reflect geographic, channel, or content bias inherent to the YouTube source material.
Provenance
Source
MohamedRashad on Hugging Face.
Collection Method
Crawled from YouTube channels.
Time Range
null
Freshness
Last updated 2026-04-06 12:40:00; freshness should be verified.
Geography
Multi-regional, covering various Arabic-speaking regions.
License information is unknown; users should verify terms before use.