Tashkent Dialect Uzbek Speech: 10K-100K Podcast Audio Transcriptions

Name: Tashkent Dialect Uzbek Speech: 10K-100K Podcast Audio Transcriptions
Creator: islomov
Published: 2025-06-16T07:55:20
Keywords: Size Categories10 Kn100 K, Librarypolars, Podcasts, Librarydask, Languageuz, Modalityaudio, Modalitytext, Librarymlcroissant, Librarydatasets, Tashkent, Parquet, Regionus, Task Categoriesautomatic Speech Recognition, Licenseapache 20

by islomovUpdated 11mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Between 10,000 and 100,000 audio clips and transcriptions of Tashkent dialect Uzbek speech comprise this dataset. Collected by islomov from YouTube podcasts like Jahongir Latipov and Bu podcast, it was last updated in June 2025 for Automatic Speech Recognition (ASR) tasks.

Use Cases

Training Automatic Speech Recognition (ASR) models using the audio clips and text transcriptions
Dialect identification to distinguish Tashkent speech from other Uzbek variants
Linguistic analysis of informal Uzbek syntax within the text transcriptions

Strengths

Contains between 10,000 and 100,000 records
Focuses on the specific Tashkent dialect of the Uzbek language
Released under the permissive Apache 2.0 license

Limitations

Geographic bias toward the Tashkent region, limiting generalizability to other Uzbek dialects
Potential transcription noise due to the informal nature of podcast conversations
Lack of detailed speaker metadata or demographic information

Provenance

Source: YouTube (Jahongir Latipov and Bu podcast channels)
Collection Method: Scraped from public YouTube podcast videos
Freshness: Last updated June 2025.
Geography: Tashkent, Uzbekistan

Distributed in Parquet format; requires tools like Polars, Dask, or Hugging Face Datasets for efficient loading.

Parquet Size Categories10 Kn100 K Librarypolars Podcasts Librarydask Languageuz Modalityaudio Modalitytext Librarymlcroissant Librarydatasets Tashkent Regionus Task Categoriesautomatic Speech Recognition Licenseapache 20

Related Datasets

Quality Score

D39

Description

39

Source

41

Reputation

47

Access

22

Community

290 downloads

5 likes

0 views

Dataset Info

Author: islomov
Created: Jun 16, 2025
Updated: Jun 16, 2025

Access

22

Community

290 downloads

5 likes

0 views

Dataset Info

Author: islomov
Created: Jun 16, 2025
Updated: Jun 16, 2025

Tashkent Dialect Uzbek Speech: 10K-100K Podcast Audio Transcriptions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info