Name: Uzbek YouTube Speech Recognition Dataset with Gemini and Whisper Labels
Creator: Saidakmal
Published: 2026-05-14T09:58:32
Keywords: Uzbek Language, Audio, Speech Recognition, Synthetic, Multimodal Transcription, Multimodal, Youtube Content

Description

Uzbek YouTube content, including IT vlogs, news, and Tashkent-dialect podcasts, forms the basis of this speech dataset. It contains at least 37,807 audio clips across two splits, totaling over 135.9 hours of audio, curated by Saidakmal and last updated in May 2026. Each audio clip is paired with two automatic speech recognition transcriptions generated by Gemini and Whisper models.

Use Cases

Training Uzbek-language ASR models based on the labeled audio-transcription pairs.
Benchmarking ASR model performance using the consistent, dual-transcription labels.
Studying dialectal variations in Uzbek speech based on the described Tashkent-dialect podcast content.
Analyzing transcription consistency and error rates between different ASR systems.

Strengths

Contains at least 37,807 records across two splits, providing a substantial sample size.
Features over 135.9 hours of transcribed Uzbek speech audio.
Each record includes two ASR transcriptions (Gemini and Whisper), enabling comparative analysis.
Data quality is filtered by a Character Error Rate (CER) threshold of ≤12.5% between transcriptions, ensuring label consistency.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the 'news' split is incomplete, and total dataset size is unknown.
Data may reflect source bias inherent to YouTube content, such as over-representation of IT vlogs and news.

Provenance

Source: YouTube
Collection Method: Speech dataset collected from Uzbek YouTube content, with transcriptions generated by Gemini and Whisper models.
Freshness: Last updated 2026-05-14 10:06:05; freshness should be verified.
Geography: Uzbekistan (specifically mentions Tashkent dialect)

License is unknown, which may restrict usage.

Audio Multimodal Uzbek Language Speech Recognition Synthetic Multimodal Transcription Youtube Content

Uzbek YouTube Speech Recognition Dataset with Gemini and Whisper Labels

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info