Clean Common Voice 24.0: 10K-100K Taiwan Chinese Voice Seeds

Name: Clean Common Voice 24.0: 10K-100K Taiwan Chinese Voice Seeds
Creator: OKHand
Published: 2026-03-02T12:04:37
Keywords: Size Categories10 Kn100 K, Common Voice, Text To Speech, Librarypolars, Languagezh, Librarydask, Modalityaudio, OPTIMIZED-PARQUET, Licensecc0 10, Voice Cloning, Modalitytext, Librarymlcroissant, Librarydatasets, Parquet, Audio, Regionus, Speech

by OKHandUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Encompassing between 10,000 and 100,000 cleaned audio clips derived from the Mozilla Common Voice 24.0 Chinese (Taiwan) corpus. Released by OKHand in early 2026, it provides 'Voice Seeds' processed through Silero-VAD to remove silence and environmental noise for generative speech tasks.

Use Cases

Training text-to-speech (TTS) models using the cleaned audio and text pairs
Voice cloning applications requiring high-quality voice seeds without background silence
Fine-tuning speech recognition models specifically for the Chinese (Taiwan) dialect

Strengths

Processed with Silero-VAD (0.5 threshold) to ensure clean voice segments
Optimized Parquet format for efficient data streaming and storage
Scale of 10,000 to 100,000 high-quality voice seeds

Limitations

Automated cleaning via Silero-VAD may introduce minor audio artifacts or clip speech in rare edge cases
Geographic bias toward the Taiwan dialect of Chinese, which may not represent other Mandarin variants

Provenance

Source: Mozilla Common Voice 24.0
Collection Method: Automated processing of existing crowdsourced audio using Silero-VAD for silence removal.
Freshness: Last updated March 2, 2026.
Geography: Taiwan

The data is licensed under CC0 1.0 and is provided in an optimized Parquet format, making it compatible with modern data processing libraries like Polars and Dask.

Audio OPTIMIZED-PARQUET Parquet Size Categories10 Kn100 K Common Voice Text To Speech Librarypolars Languagezh Librarydask Modalityaudio Licensecc0 10 Voice Cloning Modalitytext Librarymlcroissant Librarydatasets Regionus Speech

Related Datasets

Quality Score

D37

Description

39

Source

39

Reputation

41

Access

22

Community

47 downloads

1 likes

0 views

Dataset Info

Author: OKHand
Created: Mar 2, 2026
Updated: Mar 2, 2026
Last synced: Apr 12, 2026

Access

22

Community

47 downloads

1 likes

0 views

Dataset Info

Author: OKHand
Created: Mar 2, 2026
Updated: Mar 2, 2026
Last synced: Apr 12, 2026

Clean Common Voice 24.0: 10K-100K Taiwan Chinese Voice Seeds

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info