Hakka Speech Recognition Dataset for Taiwanese Languages

Name: Hakka Speech Recognition Dataset for Taiwanese Languages
Creator: adi-gov-tw
Published: 2025-10-01T07:27:03
Keywords: Taiwanese Languages, Audio, Audio Transcription, Speech Recognition, Hakka Language, Multimodal

by adi-gov-twUpdated 7mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Hakka-language audio recordings and transcriptions form a pre-training dataset for the Taiwan-Tongues-ASR-CE project. The dataset is packaged in WebDataset format for direct use with PyTorch and Hugging Face libraries. It was created by the adi-gov-tw organization and last updated in December 2025.

Use Cases

Train automatic speech recognition models on Hakka audio files and corresponding text transcriptions.
Fine-tune pre-trained ASR models using the provided training and test splits (train.tsv, test.tsv).
Benchmark model performance on the separate test set of Hakka speech samples.
Develop multilingual ASR systems incorporating Hakka using the WebDataset-formatted tar archives.

Strengths

Data is structured for machine learning pipelines with dedicated training and test subsets.
Format compatibility with WebDataset, PyTorch, and Hugging Face Datasets simplifies loading.

Limitations

Unknown sample size and audio duration limit statistical analysis.
Lack of column details prevents assessment of feature richness or label consistency.

Provenance

Source: adi-gov-tw organization on Hugging Face.
Collection Method: Pre-training data collected for the Taiwan-Tongues-ASR-CE project, method unknown.
Time Range: null
Freshness: Last updated in December 2025.
Geography: Taiwan, focusing on the Hakka language.

Data is stored in WebDataset tar files; users must be familiar with this format or the corresponding libraries. License information is unavailable.

Audio Multimodal Taiwanese Languages Audio Transcription Speech Recognition Hakka Language

Related Datasets

Quality Score

D38

Description

42

Source

36

Reputation

40

Access

26

Community

33 downloads

1 likes

0 views

Dataset Info

Author: adi-gov-tw
Created: Oct 1, 2025
Updated: Dec 22, 2025
Last synced: Jun 11, 2026

Access

26

Community

33 downloads

1 likes

0 views

Dataset Info

Author: adi-gov-tw
Created: Oct 1, 2025
Updated: Dec 22, 2025
Last synced: Jun 11, 2026

Hakka Speech Recognition Dataset for Taiwanese Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info