Taiwanese Mandarin Speech Corpus for ASR Pre-training

Name: Taiwanese Mandarin Speech Corpus for ASR Pre-training
Creator: adi-gov-tw
Published: 2025-10-01T07:26:46
Keywords: Dginfra, Licenseother, Languagezh, Librarywebdataset, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, WEBDATASET, Librarydatasets, Regionus, Task Categoriesautomatic Speech Recognition

by adi-gov-twUpdated 5mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

Offering audio and text data for pre-training Automatic Speech Recognition models, specifically for Taiwanese Mandarin. It is structured into training and test subsets using the WebDataset format for direct integration with PyTorch and Hugging Face tools. The dataset is published by the author 'adi-gov-tw' and was last updated in December 2025.

Use Cases

Pre-train an ASR model on Taiwanese Mandarin audio files paired with text transcriptions.
Fine-tune a speech recognition model using the provided training and test WebDataset tar files.
Benchmark ASR model performance on the dedicated test set for Taiwanese Mandarin.

Strengths

Data is formatted for direct use with popular machine learning frameworks like PyTorch and Hugging Face datasets.
The dataset was updated in December 2025, indicating recent maintenance.

Limitations

The dataset size, row count, and specific audio/text features are not provided.
The license terms are unknown, which may restrict commercial use.

Provenance

Source: huggingface
Collection Method: Data is packaged in WebDataset tar format for machine learning workflows.
Freshness: Last updated on 2025-12-22.
Geography: Taiwan

Data is stored in WebDataset tar files; users must be familiar with this format or the associated libraries to load it. The full description is hosted externally on Hugging Face.

WEBDATASET Dginfra Licenseother Languagezh Librarywebdataset Modalitytext Size Categories100 Kn1 M Librarymlcroissant Librarydatasets Regionus Task Categoriesautomatic Speech Recognition

Related Datasets

Quality Score

D37

Description

39

Source

36

Reputation

43

Access

22

Community

257 downloads

1 likes

0 views

Dataset Info

Author: adi-gov-tw
Created: Oct 1, 2025
Updated: Dec 22, 2025

Access

22

Community

257 downloads

1 likes

0 views

Dataset Info

Author: adi-gov-tw
Created: Oct 1, 2025
Updated: Dec 22, 2025

Taiwanese Mandarin Speech Corpus for ASR Pre-training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info