A refined subset of the Mozilla Common Voice corpus containing only Uzbek language voice recordings. The dataset has been cleaned and normalized, with a text field added, to improve usability for training automatic speech recognition models. It was created by user 'yakhyo' and last updated on April 15,我们发现了一个问题。
Use Cases
- Training Uzbek-language ASR models based on the cleaned and normalized audio-text pairs.
- Fine-tuning multilingual speech models for Uzbek based on the filtered language-specific samples.
- Benchmarking speech recognition performance for Uzbek based on the preprocessed dataset structure.
Strengths
- Focuses exclusively on the Uzbek language, providing a targeted resource.
- Includes preprocessing steps such as text normalization, which may reduce data cleaning effort.
Limitations
- Row count, file formats, and license information are unknown, limiting suitability assessment.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- Mozilla Common Voice project, specifically version 17.0.
- Collection Method
- Filtered and preprocessed from the larger multilingual corpus.
- Time Range
- null
- Freshness
- Last updated 2025-04-15 13:16:47.
- Geography
- null