Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Giving access to between 10,000 and 100,000 audio recordings and transcriptions for Hokkien speech recognition, published by adi-gov-tw in late 2024. It is organized into training and test subsets using the WebDataset format to facilitate high-throughput training in PyTorch and Hugging Face environments.
The dataset uses the WebDataset (.tar) format; users should be familiar with the webdataset library or Hugging Face datasets loader to handle the sharded data structure efficiently.