Sign in to view source links and access this dataset
Description
An unofficial Arabic-only extraction of Mozilla Common Voice Corpus 18.0, prepared for Automatic Speech Recognition research. The dataset was created by MohamedRashad and last updated on 2025-12-27. It is derived from the original Common Voice 18 release, filtered to include only Arabic speech data while preserving the original dataset structure, splits, and metadata fields.
Use Cases
Training Arabic speech recognition models based on validated and unvalidated speech data.
Benchmarking ASR system performance on a dedicated Arabic speech corpus.
Developing language-specific acoustic models based on the preserved metadata fields.
Conducting research on speech data validation and filtering techniques for Arabic.
Strengths
Preserves the original Common Voice 18 dataset structure and metadata fields.
Contains both validated and unvalidated Arabic speech data segments.
Last updated on 2025-12-27, indicating recent maintenance.
Limitations
Row count, file formats, and license information are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic or demographic bias inherent to the original Common Voice collection.
Provenance
Source
Mozilla Common Voice Corpus 18.0
Collection Method
Filtered extraction to include only Arabic (ar) speech data.
Time Range
null
Freshness
2025-12-27
Geography
null
License is unknown; restrictions should be verified before use.