Sign in to view source links and access this dataset
Description
Common Voice Corpus 11.0 is a multilingual speech dataset consisting of MP3 audio files paired with corresponding text transcriptions. The dataset contains 24,210 recorded hours, with 16,413 validated hours across 100 languages. Many recordings include demographic metadata such as age, sex, and accent.
Use Cases
Training automatic speech recognition (ASR) models based on the large volume of validated audio-text pairs.
Improving ASR model accuracy for specific demographic groups based on the included age, sex, and accent metadata.
Benchmarking multilingual speech recognition performance across the 100 supported languages.
Studying acoustic variations and model bias related to speaker demographics mentioned in the description.
Strengths
Large scale with 24,210 recorded hours of speech data.
Includes 16,413 hours of validated data, suggesting a quality control process.
Covers a wide range of 100 languages.
Contains demographic metadata like age, sex, and accent for many recordings.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
The total number of unique speakers and the distribution of recordings per language are unknown.
Data may reflect geographic or linguistic bias inherent to the contributor base of a crowdsourced platform.
Provenance
Source
Mozilla Common Voice project, hosted by user 'echodict' on Hugging Face.
Collection Method
Crowdsourced contributions from volunteers.
Freshness
Last updated 2026-04-16 07:27:52; freshness should be verified.
Geography
Global, based on the 100 languages covered.
License is unknown and must be verified before use for commercial or redistribution purposes.