Taiwanese Hokkien Synthetic Speech Audio Dataset

Name: Taiwanese Hokkien Synthetic Speech Audio Dataset
Creator: lianghsun
Published: 2026-03-19T02:39:20
Keywords: Text To Speech, Task Categoriestext To Speech, Librarypolars, Languagezh, Modalityaudio, Size Categoriesn1 K, Modalitytext, Taiwanese, Librarymlcroissant, Librarydatasets, Librarypandas, Licensecc By 40, Tabular, Parquet, Multilingual, Languagenan, Audio, Regionus, Hokkien, Audio Synthesis

by lianghsunUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

722 seed utterances and 32,506 Common Voice samples were used to generate this Taiwanese Hokkien (Min Nan) speech dataset via the CosyVoice3 model. The dataset includes audio files, corresponding text, and speaker metadata. It was created by lianghsun and last updated on March 19, 2026.

Use Cases

Training Taiwanese Hokkien speech synthesis models based on the audio and text pairs.
Evaluating TTS model performance on a Hokkien corpus based on the provided audio samples.
Studying speaker characteristics and emotion in synthetic speech based on the speaker_id and emotion metadata.
Developing multilingual speech systems incorporating Taiwanese Hokkien based on the domain and accent classifications.

Strengths

Includes 722 seed utterances from a specific TAT source.
Incorporates 32,506 cleaned samples from the Common Voice corpus.
Audio files have a specified sample rate of 22050 Hz.

Limitations

Row count is unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Data may reflect geographic or source bias inherent to the seed text and Common Voice sources.

Provenance

Source: lianghsun
Collection Method: Batch-generated by the CosyVoice3 (Fun-CosyVoice3-0.5B) model.
Freshness: Last updated 2026-03-19 02:40:34; freshness should be verified.
Geography: Taiwan

Related Datasets

Quality Score

D36

Description

39

Source

36

Reputation

39

Access

22

Community

17 downloads

1 likes

0 views

Dataset Info

Author: lianghsun
Created: Mar 19, 2026
Updated: Mar 19, 2026

Access

22

Community

17 downloads

1 likes

0 views

Dataset Info

Author: lianghsun
Created: Mar 19, 2026
Updated: Mar 19, 2026

Taiwanese Hokkien Synthetic Speech Audio Dataset

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info