Name: MASC Arabic: 1,000 Hours of Multi-Dialect Speech from YouTube
Creator: MohamedRashad
Published: 2025-12-27T22:41:33
Keywords: Audio Dataset, Arabic Speech, Multi Dialect, Multi Regional, Audio, Speech Recognition

Description

1,000 hours of Arabic speech audio sampled at 16 kHz, crawled from over 700 YouTube channels. The MASC dataset is multi-regional, multi-genre, and multi-dialect, created by MohamedRashad and last updated in April 2026. It is intended to advance research and development in Arabic speech technology.

Use Cases

Train automatic speech recognition (ASR) models based on the multi-dialect and multi-genre speech content.
Benchmark speech recognition performance across different Arabic dialects based on the multi-regional data.
Develop speech synthesis or voice conversion systems using the 16 kHz sampled audio.
Conduct linguistic analysis of Arabic dialect variation using the speech data from diverse YouTube channels.

Strengths

Contains 1,000 hours of speech data, providing substantial volume for model training.
Sourced from over 700 YouTube channels, suggesting diversity in content and speakers.
Explicitly designed as multi-regional, multi-genre, and multi-dialect to support robust Arabic speech technology.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and exact file formats are unknown, which may limit suitability assessment.
Data may reflect geographic, channel, or content bias inherent to the YouTube source material.

Provenance

Source: MohamedRashad on Hugging Face.
Collection Method: Crawled from YouTube channels.
Time Range: null
Freshness: Last updated 2026-04-06 12:40:00; freshness should be verified.
Geography: Multi-regional, covering various Arabic-speaking regions.

License information is unknown; users should verify terms before use.

Audio Audio Dataset Arabic Speech Multi Dialect Multi Regional Speech Recognition

MASC Arabic: 1,000 Hours of Multi-Dialect Speech from YouTube

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info