IT YouTube Uzbek Speech: 10K-100K Technical Audio Transcriptions

Name: IT YouTube Uzbek Speech: 10K-100K Technical Audio Transcriptions
Creator: islomov
Published: 2025-06-16T05:41:50
Keywords: Size Categories10 Kn100 K, Librarypolars, Librarydask, Languageuz, Modalityaudio, Modalitytext, Librarymlcroissant, Librarydatasets, Parquet, Regionus, Task Categoriesautomatic Speech Recognition, Licenseapache 20

by islomovUpdated 11mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Encompassing between 10,000 and 100,000 audio clips and transcriptions in the Uzbek language, specifically targeting the Information Technology domain. Collected by islomov from YouTube channels like Mohir Dev and updated in June 2025, it includes English technical terms to improve model generalization. The data is designed for training and evaluating Automatic Speech Recognition (ASR) systems in a technical context.

Use Cases

Training ASR models to recognize technical Uzbek vocabulary and English IT terms
Fine-tuning speech-to-text systems for educational video content from the Mohir Dev channel
Analyzing code-switching patterns between Uzbek and English in technical transcriptions

Strengths

10,000 to 100,000 record count
Apache 2.0 license
Domain-specific IT vocabulary and English code-switching

Limitations

Acoustic bias due to primary sourcing from the Mohir Dev YouTube channel
Domain concentration in Information Technology
Potential transcription noise inherent in public video-sourced data

Provenance

Source: Mohir Dev YouTube channel and other public IT-related videos
Collection Method: Scraped from public YouTube content
Freshness: Last updated June 2025.
Geography: Uzbekistan

The dataset is provided in Parquet format and is licensed under Apache 2.0, allowing for broad use in research and commercial applications.

Parquet Size Categories10 Kn100 K Librarypolars Librarydask Languageuz Modalityaudio Modalitytext Librarymlcroissant Librarydatasets Regionus Task Categoriesautomatic Speech Recognition Licenseapache 20

Related Datasets

Quality Score

D39

Description

39

Source

41

Reputation

47

Access

22

Community

301 downloads

5 likes

0 views

Dataset Info

Author: islomov
Created: Jun 16, 2025
Updated: Jun 16, 2025

Access

22

Community

301 downloads

5 likes

0 views

Dataset Info

Author: islomov
Created: Jun 16, 2025
Updated: Jun 16, 2025

IT YouTube Uzbek Speech: 10K-100K Technical Audio Transcriptions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info