DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Uzbekvoice Filtered: Quality-Controlled Uzbek Speech Corpus | DataSalon

Home Speech & AudioUzbekvoice Filtered: Quality-Controlled Uzbek Speech Corpus

Speech & Audio

Uzbekvoice Filtered: Quality-Controlled Uzbek Speech Corpus

Name: Uzbekvoice Filtered: Quality-Controlled Uzbek Speech Corpus
Creator: DavronSherbaev
Published: 2024-01-27T06:52:38
Keywords: Librarypolars, Librarydask, Languageuz, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Librarydatasets, Parquet, Regionus, Task Categoriesautomatic Speech Recognition, Licenseapache 20

by DavronSherbaev·Updated 1y ago

Available on 1 platform

Description

Offering a filtered collection of Uzbek speech recordings processed through voice activity detection, noise removal, and reading speed analysis. It excludes original Mozilla Common Voice files in favor of a refined subset validated via automatic speech-to-text (STT) models to ensure high-quality audio-text alignment.

Use Cases

Train Uzbek speech-to-text models using the audio files and their corresponding validated text transcriptions.
Analyze natural speech prosody using the subset of recordings filtered for standard reading speeds.
Evaluate the performance of voice activity detection algorithms against a dataset pre-processed for noise and silence.

Strengths

Filters out audio files lacking voice activity or containing only noise after denoising.
Removes 5-10% of recordings identified as reading speed outliers to maintain natural speech patterns.
Validated using an automatic speech-to-text (STT) model trained on a high-confidence subset.
Derived from the Uzbek language portion of the Mozilla Common Voice project.

Parquet Librarypolars Librarydask Languageuz Modalitytext Size Categories100 Kn1 M Librarymlcroissant Librarydatasets Regionus Task Categoriesautomatic Speech Recognition Licenseapache 20

Related Datasets

Quality Score

C41

Description

Source

Reputation

Quality Score

C41

Description

Source

Reputation

Access

Community

278 downloads

15 likes

0 views

Dataset Info

Author: DavronSherbaev
Created: Jan 27, 2024
Updated: Feb 3, 2025
Last synced: Apr 21, 2026

Access

Community

278 downloads

15 likes

0 views

Dataset Info

Author: DavronSherbaev
Created: Jan 27, 2024
Updated: Feb 3, 2025
Last synced: Apr 21, 2026

Uzbekvoice Filtered: Quality-Controlled Uzbek Speech Corpus

Description

Use Cases

Strengths

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info