Sign in to view source links and access this dataset
Description
siddiqiya's dataset is a specialized Arabic corpus combining the Quran and approximately 10,000 non-repetitive hadith from 14 books, including the 'magma'a el zawa'ed' compilation. It is intended to train and evaluate speech recognition systems to prevent AI from altering sacred scriptures. The dataset also incorporates other existing speech datasets like Common Voice, Fleurs, and Media Speech.
Use Cases
Train Arabic automatic speech recognition (ASR) models based on the described combination of Quranic recitation and hadith audio.
Evaluate ASR model accuracy on religious scripture to prevent mis-transcription based on the dataset's stated purpose.
Fine-tune language models for classical or religious Modern Standard Arabic (MSA) based on the included text corpus.
Benchmark model performance across different Arabic speech datasets based on the mention of Common Voice, Fleurs, and Media Speech.
Strengths
Includes approximately 10,000 hadith without repetitions, providing a specific scale for the text component.
Combines multiple sources, including the full Quran and 14 hadith books, suggesting a focused domain collection.
Explicitly incorporates established speech datasets (Common Voice, Fleurs, Media Speech), likely adding variety to audio samples.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the complete, combined dataset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality and structure require manual inspection after download.
Provenance
Source
siddiqiya on Hugging Face, combining Quran, hadith books (including 'magma'a el zawa'ed' by Nour eldin elhaithamy), and other speech datasets.
Collection Method
Likely compiled from existing textual and audio sources for a specific AI training objective.
Freshness
Last updated 2025-05-27 20:43:43; freshness should be verified.
License is unknown; terms of use must be verified before application.