Name: MDCC: A Multi-Domain Cantonese Speech Recognition Corpus
Creator: ming030890
Published: 2025-07-26T14:47:29
Keywords: Cantonese, Multidomain, Audio, Large Scale, Natural Language Processing, Speech Recognition

Description

MDCC is a large-scale Cantonese automatic speech recognition dataset compiled from multiple domains. It provides .wav recordings of both spontaneous and read speech paired with UTF‑8 plain‑text transcripts and speaker metadata. The dataset was created by author 'ming030890' and was last updated on the Hugging Face platform on 2025-07-26.

Use Cases

Train Cantonese speech recognition models based on the provided .wav audio recordings.
Fine-tune ASR systems for multi-domain applications based on the dataset's compilation from multiple sources.
Analyze differences between spontaneous and read speech patterns in Cantonese based on the described audio types.
Explore speaker metadata such as sex for potential demographic analysis in speech technology.

Strengths

Compiled from multiple domains, suggesting diversity in content.
Includes both spontaneous and read speech, which may improve model robustness.
Provides speaker metadata (sex), adding a potential demographic dimension.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and license are unknown, which may limit suitability assessment.

Provenance

Source: Hugging Face dataset by author ming030890.
Collection Method: Compiled from multiple domains; specific gathering method is not detailed.
Time Range: null
Freshness: Last updated 2025-07-26 23:16:57; freshness should be verified.
Geography: null

The .wav data is hosted on a Google Drive link and is noted for research purposes only.

Audio Cantonese Multidomain Large Scale Natural Language Processing Speech Recognition

MDCC: A Multi-Domain Cantonese Speech Recognition Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info