News Youtube Uzbek Speech Dataset

Name: News Youtube Uzbek Speech Dataset
Creator: islomov
Published: 2025-06-16T06:48:39
Keywords: Size Categories10 Kn100 K, Librarypolars, Librarydask, Languageuz, Modalityaudio, Modalitytext, News, Librarymlcroissant, Librarydatasets, Mixed, Parquet, Regionus, Task Categoriesautomatic Speech Recognition, Licenseapache 20

by islomovUpdated 11mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Uzbek language audio clips and text transcriptions sourced from YouTube news channels Kunuz and Qalampir across multiple regional dialects. The dataset utilizes Gemini 2.5 Pro for transcription generation to support Automatic Speech Recognition (ASR) development.

Use Cases

Train Automatic Speech Recognition (ASR) models using the audio clips and transcription text
Evaluate dialectal accuracy of speech-to-text systems across different Uzbek regional accents
Fine-tune language models on news-specific vocabulary and syntax found in the transcriptions
Analyze linguistic variations in Uzbek news reporting using the transcription text

Strengths

Sourced from prominent Uzbek news channels Kunuz and Qalampir
Includes transcriptions generated and refined using the Gemini 2.5 Pro model
Features audio clips representing multiple Uzbek regional dialects
Derived from publicly available YouTube news video content

Parquet Size Categories10 Kn100 K Librarypolars Librarydask Languageuz Modalityaudio Modalitytext News Librarymlcroissant Librarydatasets Mixed Regionus Task Categoriesautomatic Speech Recognition Licenseapache 20

Related Datasets

Quality Score

D40

Description

39

Source

41

Reputation

49

Access

22

Community

652 downloads

7 likes

0 views

Dataset Info

Author: islomov
Created: Jun 16, 2025
Updated: Jun 16, 2025

Access

22

Community

652 downloads

7 likes

0 views

Dataset Info

Author: islomov
Created: Jun 16, 2025
Updated: Jun 16, 2025

News Youtube Uzbek Speech Dataset

Description

Use Cases

Strengths

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info