Sign in to view source links and access this dataset
Description
74,858 high-quality Vietnamese audio samples with phonemized transcripts, designed for fine-tuning modern Text-to-Speech models. The dataset was created by LanguaMan, who collected audio from YouTube, cleaned background noise, and used the Whisper-large-v3 model for transcription, followed by agent-assisted spelling correction and human feedback. The dataset page was last updated on April 21, 2026.
Use Cases
Fine-tuning TTS models based on the described high-quality, cleaned audio samples.
Training phoneme-to-speech synthesis models based on the phonemized transcripts.
Benchmarking Vietnamese speech synthesis quality using the curated, large-scale sample set.
Developing accent or speaker adaptation models based on the YouTube-sourced audio variety.
Strengths
Contains 74,858 audio samples, providing a substantial volume for model training.
Audio underwent a described cleaning process for background noise removal.
Transcripts were generated using the Whisper-large-v3 model and include a phonemization step.
Includes a human-in-the-loop feedback process for correcting transcription errors.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The description metadata is limited; actual data quality requires manual inspection after download.
Provenance
Source
Audio collected from YouTube by the author LanguaMan.
Collection Method
Audio cleaned, transcribed using Whisper-large-v3, with agent-assisted spelling correction and human feedback.
Freshness
Last updated 2026-04-21 04:33:00; freshness should be verified.
Geography
Vietnam (inferred from language focus).
License is unknown; terms of use must be verified before application.