Name: Cantonese YouTube Audio-Caption Pairs for ASR
Creator: ming030890
Published: 2025-07-11T20:20:29
Keywords: Size Categories10 Kn100 K, Librarypolars, Modalitytext, Librarymlcroissant, Librarydatasets, Librarypandas, Parquet, Regionus

Description

Ming030890's dataset contains Cantonese audio-caption pairs sourced from YouTube videos with manually provided captions. It was built by re-transcribing audio with SenseVoice and filtering segments to create a collection supporting ASR development. The dataset includes segments where ASR output matches original captions and segments with homophone or English word differences.

Use Cases

Fine-tune Cantonese ASR models using the filtered audio-caption pairs where transcription matches the original manual caption.
Analyze ASR error patterns by comparing segments where differences are only homophones or English words.
Train speech recognition systems on a dataset specifically built from YouTube's Cantonese caption sources.

Strengths

Dataset is built from YouTube videos with manually provided Cantonese captions, ensuring a foundation of human-verified text.
Segments are filtered to include cases where ASR output is identical to the original caption, indicating high-quality alignment.
Includes segments with specific error types like homophones or English words, useful for targeted ASR error analysis.

Limitations

The dataset size, row count, and specific file formats are unknown, limiting assessment of scale and usability.
Sample data and column structure are unavailable, preventing verification of data format and specific features.
Potential bias towards YouTube content and the specific ASR model (SenseVoice) used for re-transcription.

Provenance

Source: YouTube videos with manually provided Cantonese captions.
Collection Method: Audio re-transcribed using SenseVoice, with segments filtered based on comparison to original captions.
Freshness: Last updated on 2025-07-20.

License information is unknown. The full description is hosted externally on Hugging Face, requiring a visit to the dataset page for complete details.

Parquet Size Categories10 Kn100 K Librarypolars Modalitytext Librarymlcroissant Librarydatasets Librarypandas Regionus

Cantonese YouTube Audio-Caption Pairs for ASR

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info