Name: Taiwanese Hokkien Seed Text: 3 Million Sentences for Speech Synthesis and Recognition
Creator: lianghsun
Published: 2026-03-17T16:02:32
Keywords: Text To Speech, Task Categoriestext To Speech, Librarypolars, Languagezh, Librarydask, Size Categories1 Mn10 M, Speech Synthesis, Modalitytext, Taiwanese, Librarymlcroissant, Taiwanese Hokkien, Librarydatasets, Licensecc By 40, Text, Parquet, Languagenan, Regionus, Task Categoriesautomatic Speech Recognition, Min Nan, Speech Recognition, Automatic Speech Recognition, Text Corpus, Hokkien

Description

tw-hokkien-seed-text is a dataset of approximately 3 million full-character Taiwanese Hokkien sentences designed for training text-to-speech (TTS) and automatic speech recognition (ASR) models. The dataset was created by lianghsun and was last updated on March 20, 2026. Each sentence is 50–80 characters long, corresponding to a speech duration of 10–15 seconds, and is written exclusively in Chinese characters to preserve authentic Taiwanese Hokkien vocabulary and syntax.

Use Cases

Train text-to-speech models based on the dataset's full-character sentences and specified sentence length.
Develop automatic speech recognition systems based on the authentic Taiwanese Hokkien vocabulary and syntax.
Create language resources for Taiwanese Hokkien based on the collection of everyday language across various social strata and industries.
Study linguistic patterns in Taiwanese Hokkien based on the monolingual, character-only text corpus.

Strengths

Contains approximately 3 million sentences, providing substantial volume for model training.
Sentences are consistently 50–80 characters long, corresponding to 10–15 seconds of speech, which is suitable for TTS input.
Text is written exclusively in Chinese characters using recommended characters, preserving authentic Taiwanese Hokkien vocabulary and syntax.
Content covers a wide range of everyday topics from traditional industries to modern technology, reflecting real usage scenarios.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset's license is unspecified, requiring verification before use.

Provenance

Source: lianghsun on Hugging Face
Collection Method: Likely collected or compiled for speech model training; specific gathering method is not detailed.
Time Range: null
Freshness: Last updated 2026-03-20 21:06:43; freshness should be verified.
Geography: Taiwan (based on language focus)

License is unknown and must be checked before use. Data is in a character-only format, which may require specific processing for models expecting phonetic or mixed script input.

Taiwanese Hokkien Seed Text: 3 Million Sentences for Speech Synthesis and Recognition

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info