Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
tw-hokkien-seed-text is a dataset of approximately 3 million full-character Taiwanese Hokkien sentences designed for training text-to-speech (TTS) and automatic speech recognition (ASR) models. The dataset was created by lianghsun and was last updated on March 20, 2026. Each sentence is 50–80 characters long, corresponding to a speech duration of 10–15 seconds, and is written exclusively in Chinese characters to preserve authentic Taiwanese Hokkien vocabulary and syntax.
License is unknown and must be checked before use. Data is in a character-only format, which may require specific processing for models expecting phonetic or mixed script input.