MASC: 1,000 Hours of Multi-Dialect Arabic Speech from YouTube

Name: MASC: 1,000 Hours of Multi-Dialect Arabic Speech from YouTube
Creator: pain
Published: 2023-06-10T10:00:21
Keywords: Multiregional, Youtube, Multidialect, Audio, Speech Recognition

by painUpdated 3y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

1,000 hours of speech audio sampled at 16 kHz, crawled from over 700 YouTube channels. The MASC dataset is multi-regional, multi-genre, and multi-dialect, intended to advance research and development of Arabic speech technology. It was authored by 'pain' and last updated on the Hugging Face platform in June 2023.

Use Cases

Train automatic speech recognition (ASR) models based on the multi-dialect Arabic speech content.
Benchmark speech recognition performance across different Arabic dialects and genres.
Develop speech synthesis or voice cloning systems using the diverse speech samples.
Study acoustic and linguistic variations in Arabic as spoken across different regions.

Strengths

Contains 1,000 hours of speech audio, providing a substantial volume of training data.
Sourced from over 700 YouTube channels, suggesting diversity in speakers and content.
Explicitly designed to be multi-regional, multi-genre, and multi-dialect, which may address coverage gaps in Arabic speech resources.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Last updated 2023-06-12 19:48:45; freshness should be verified.
Data may reflect geographic, temporal, or content bias inherent to its source platform, YouTube.

Provenance

Source: Hugging Face, uploaded by author 'pain'.
Collection Method: Crawled from YouTube channels.
Freshness: Last updated 2023-06-12 19:48:45.
Geography: Multi-regional coverage of Arabic-speaking regions.

License is unknown; terms of use must be verified before application.

Audio Multiregional Youtube Multidialect Speech Recognition

Related Datasets

Quality Score

D32

Description

33

Source

36

Reputation

25

Access

26

Community

233 downloads

10 likes

0 views

Dataset Info

Author: pain
Created: Jun 10, 2023
Updated: Jun 12, 2023
Last synced: May 22, 2026

Access

26

Community

233 downloads

10 likes

0 views

Dataset Info

Author: pain
Created: Jun 10, 2023
Updated: Jun 12, 2023
Last synced: May 22, 2026

MASC: 1,000 Hours of Multi-Dialect Arabic Speech from YouTube

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info