Sign in to view source links and access this dataset
Description
HK-LegiCoST is a three-way parallel corpus containing over 600 hours of Cantonese audio, aligned with standard traditional Chinese transcripts and English translations at the sentence level. It was created by researchers including Cihan Xiao, Henry Li Xinyuan, Jinyi Yang, Dongji Gao, Matthew Wiesner, Kevin Duh, and Sanjeev Khudanpur, with a paper published on arXiv in 2023. The dataset is hosted on Hugging Face by the user Borrison.
Use Cases
Train speech translation models based on the parallel Cantonese audio and English text.
Develop automatic speech recognition systems for Cantonese based on the audio-transcript pairs.
Research non-verbatim translation techniques based on the described alignment of transcripts and translations.
Benchmark multilingual speech processing systems based on the three-way parallel structure.
Study linguistic phenomena in Cantonese-to-English translation based on the sentence-aligned corpus.
Strengths
Contains over 600 hours of Cantonese audio, providing substantial training material.
Offers three-way parallel alignment (audio, Chinese transcript, English translation) at the sentence level.
Includes non-verbatim transcripts, which may reflect real-world translation scenarios.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Last updated 2026-06-14 03:16:48; freshness should be verified.
Provenance
Source
Hugging Face dataset uploaded by user Borrison.
Collection Method
Leverages non-verbatim transcripts; specific collection method not detailed.
Freshness
Last updated 2026-06-14 03:16:48.
Geography
Likely related to Hong Kong (HK) given the dataset title and Cantonese language focus.
License is unknown; users should verify terms before use.