Arabic speech data comprising 7,168 hours of validated audio across approximately 3,957,670 segments from 1,229 books. The dataset, created by AlgoRythmetic, was last updated on April 21, 2026. Audio is provided in 16 kHz mono FLAC format and is organized into 2,639 parquet shards.
Use Cases
- Train Arabic automatic speech recognition (ASR) models based on the large volume of validated speech segments.
- Develop text-to-speech (TTS) systems for Arabic using the audiobook content mentioned in the description.
- Fine-tune language models on Modern Standard Arabic and dialectal speech data.
- Conduct linguistic analysis of Arabic prosody and phonetics using the segmented audiobook data.
Strengths
- Large scale with 7,168 hours of validated audio.
- Rigorous validation process applied to every shard, checking schema, row groups, FLAC integrity, and metadata.
- Content sourced from 1,229 books, providing diverse textual material.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- The description indicates a genre breakdown is not fully visible, limiting content assessment.
- Data may reflect source bias inherent to the specific collection of 1,229 books.
Provenance
- Source
- huggingface, author AlgoRythmetic
- Collection Method
- Likely extracted and processed from audiobook sources.
- Time Range
- null
- Freshness
- Last updated 2026-04-21 10:05:49
- Geography
- null