Sign in to view source links and access this dataset
Description
Common Voice 20.0 Mongolian Dataset is a subset of Mozilla's Common Voice project containing Mongolian speech data. The dataset includes audio clips in .mp3 format, transcriptions, train/test/dev splits, and metadata such as speaker demographics. It was uploaded by user 'warmestman' to Hugging Face on March 5, 2025.
Use Cases
Train speech recognition models based on the provided audio clips and transcriptions.
Conduct voice analysis based on the audio data and speaker demographic metadata.
Perform linguistic research on the Mongolian language based on the transcribed speech corpus.
Benchmark speech processing systems using the predefined train/test/dev splits.
Strengths
Includes structured train/test/dev splits, which supports machine learning workflows.
Contains additional metadata such as speaker demographics, which can enable bias analysis.
Part of the established Mozilla Common Voice project, suggesting a standardized collection method.
Limitations
Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count and total size are unknown, which may limit suitability assessment.
Provenance
Source
Mozilla Common Voice project
Collection Method
Crowdsourced collection, likely via the Common Voice platform.
Time Range
Part of the Common Voice 20.0 release.
Freshness
Last updated 2025-03 05 16:27:27; freshness should be verified.
Geography
Mongolian language focus.
License is unknown; users must verify licensing terms before use.