Description

SADA (Saudi Audio Dataset for Arabic) is a large-scale Arabic speech corpus designed to support AI model development for Arabic speech processing. It contains over 667 hours of transcribed Arabic audio recordings, primarily featuring various Saudi dialects, and was curated in a collaboration involving the National Center for Artificial Intelligence. The dataset was last updated on the platform in May 2025.

Use Cases

Train automatic speech recognition (ASR) models based on the large volume of transcribed Arabic audio.
Develop dialect identification systems based on the dataset's focus on various Saudi dialects.
Build text-to-speech (TTS) synthesis models for Arabic based on the paired audio and transcription data.
Fine-tune language models for Arabic speech understanding based on the transcribed content.

Strengths

Over 667 hours of transcribed audio provides substantial training material.
Focus on Saudi dialects addresses a specific regional linguistic need.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count and file formats are unknown, which may limit suitability assessment.

Provenance

Source: MohamedRashad, with collaboration involving the National Center for Artificial Intelligence.
Collection Method: Curated audio recordings, likely collected and transcribed for research purposes.
Time Range: null
Freshness: Last updated 2025-05-03 17:02:55; freshness should be verified.
Geography: Primarily Saudi Arabia, based on the focus on Saudi dialects.

License is unknown; users must verify licensing terms before use.

Audio Arabic Language Speech Corpus Large Scale Natural Language Processing Saudi Dialects Synthetic

SADA22: Saudi Audio Dataset for Arabic with 667 Hours of Transcribed Speech

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info