Name: HK-LegiCoST: Cantonese-English Speech Translation Corpus with 600+ Hours
Creator: Borrison
Published: 2026-06-13T22:32:39
Keywords: Cantonese English, Text, Audio Text, Audio, Natural Language Processing, Speech Translation, Parallel Corpus

Description

HK-LegiCoST is a three-way parallel corpus containing over 600 hours of Cantonese audio, aligned with standard traditional Chinese transcripts and English translations at the sentence level. It was created by researchers including Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, and Sanjeev Khudanpur, with a paper published on arXiv in 2023. The dataset is hosted on Hugging Face by the user Borrison.

Use Cases

Train speech translation models based on the parallel Cantonese audio and English text.
Develop automatic speech recognition systems for Cantonese based on the audio-transcript pairs.
Research non-verbatim translation techniques based on the described alignment of transcripts and translations.
Benchmark multilingual speech processing systems based on the three-way parallel structure.
Study linguistic phenomena in Cantonese-to-English translation based on the sentence-aligned corpus.

Strengths

Contains over 600 hours of Cantonese audio, providing substantial training material.
Offers three-way parallel alignment (audio, Chinese transcript, English translation) at the sentence level.
Includes non-verbatim transcripts, which may reflect real-world translation scenarios.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-06-14 03:16:48; freshness should be verified.

Provenance

Source: Hugging Face dataset uploaded by user Borrison.
Collection Method: Leverages non-verbatim transcripts; specific collection method not detailed.
Freshness: Last updated 2026-06-14 03:16:48.
Geography: Likely related to Hong Kong (HK) given the dataset title and Cantonese language focus.

License is unknown; users should verify terms before use.

Text Audio Cantonese English Audio Text Natural Language Processing Speech Translation Parallel Corpus

HK-LegiCoST: Cantonese-English Speech Translation Corpus with 600+ Hours

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info