Sign in to view source links and access this dataset
Description
Uzbek YouTube content, including IT vlogs, news, and Tashkent-dialect podcasts, forms the basis of this speech dataset. It contains at least 37,807 audio clips across two splits, totaling over 135.9 hours of audio, curated by Saidakmal and last updated in May 2026. Each audio clip is paired with two automatic speech recognition transcriptions generated by Gemini and Whisper models.
Use Cases
Training Uzbek-language ASR models based on the labeled audio-transcription pairs.
Benchmarking ASR model performance using the consistent, dual-transcription labels.
Studying dialectal variations in Uzbek speech based on the described Tashkent-dialect podcast content.
Analyzing transcription consistency and error rates between different ASR systems.
Strengths
Contains at least 37,807 records across two splits, providing a substantial sample size.
Features over 135.9 hours of transcribed Uzbek speech audio.
Each record includes two ASR transcriptions (Gemini and Whisper), enabling comparative analysis.
Data quality is filtered by a Character Error Rate (CER) threshold of ≤12.5% between transcriptions, ensuring label consistency.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the 'news' split is incomplete, and total dataset size is unknown.
Data may reflect source bias inherent to YouTube content, such as over-representation of IT vlogs and news.
Provenance
Source
YouTube
Collection Method
Speech dataset collected from Uzbek YouTube content, with transcriptions generated by Gemini and Whisper models.
Freshness
Last updated 2026-05-14 10:06:05; freshness should be verified.