Name: Vietnamese Speech Recognition Corpus With 100+ Hours
Creator: thanhnew2001
Published: 2026-02-19T09:43:30
Keywords: Machine Learning, Audio, Vietnamese Language, Speech Recognition

Description

32,267 audio samples totaling 103.18 hours of Vietnamese speech, curated for automatic speech recognition. The dataset, created by thanhnew2001, was last updated in February 2026. It is structured into 29,041 training and 3,226 development samples.

Use Cases

Train ASR models using the 16 kHz audio samples and corresponding transcriptions.
Benchmark model performance on the development set containing 3,226 samples.
Analyze speech patterns and acoustic features from the three source datasets: asr_dataset_nguoivietdailynews, asr_dataset_nguyenkhangofficial, and asr_dataset_trinhlieu.
Fine-tune pre-trained models on Vietnamese speech with segments averaging 12 seconds in length.

Strengths

103.18 total hours of audio data provides substantial material for model training.
32,267 samples offer a significant volume of speech instances.
Clear split with 29,041 training and 3,226 development samples facilitates standard ML workflows.

Limitations

Dataset size is moderate compared to major multilingual speech corpora.
Specifics on speaker demographics, recording conditions, and accent coverage are not provided.
No test set is explicitly mentioned, which complicates final model evaluation.

Provenance

Source: Aggregated from three source datasets: asr_dataset_nguoivietdailynews, asr_dataset_nguyenkhangofficial, and asr_dataset_trinhlieu.
Collection Method: null
Time Range: null
Freshness: Last updated in February 2026.
Geography: Presumably Vietnam, given the language focus.

Data is formatted in the Icefall framework format (train.json, dev.json). Users must refer to the Hugging Face dataset page for the full description and access instructions.

Audio Machine Learning Vietnamese Language Speech Recognition

Vietnamese Speech Recognition Corpus With 100+ Hours

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info