Sign in to view source links and access this dataset
Description
1,000 hours of speech audio sampled at 16 kHz, crawled from over 700 YouTube channels. The MASC dataset is multi-regional, multi-genre, and multi-dialect, intended to advance research and development of Arabic speech technology. It was authored by 'pain' and last updated on the Hugging Face platform in June 2023.
Use Cases
Train automatic speech recognition (ASR) models based on the multi-dialect Arabic speech content.
Benchmark speech recognition performance across different Arabic dialects and genres.
Develop speech synthesis or voice cloning systems using the diverse speech samples.
Study acoustic and linguistic variations in Arabic as spoken across different regions.
Strengths
Contains 1,000 hours of speech audio, providing a substantial volume of training data.
Sourced from over 700 YouTube channels, suggesting diversity in speakers and content.
Explicitly designed to be multi-regional, multi-genre, and multi-dialect, which may address coverage gaps in Arabic speech resources.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Last updated 2023-06-12 19:48:45; freshness should be verified.
Data may reflect geographic, temporal, or content bias inherent to its source platform, YouTube.
Provenance
Source
Hugging Face, uploaded by author 'pain'.
Collection Method
Crawled from YouTube channels.
Freshness
Last updated 2023-06-12 19:48:45.
Geography
Multi-regional coverage of Arabic-speaking regions.
License is unknown; terms of use must be verified before application.