22,196 hours of raw audio from Hong Kong Legislative Council meetings, processed into 20,471 hours of segmented speech. The dataset, created by laubonghaudoi, is split into raw and segmented subsets. It was last updated on 2026-02-26.
Use Cases
- Training Cantonese speech recognition models based on the large volume of segmented audio.
- Analyzing parliamentary speech patterns and discourse based on the transcribed subtitles.
- Developing voice activity detection (VAD) systems using the raw and segmented audio subsets.
- Studying formal Cantonese language use and political terminology from the legislative proceedings.
Strengths
- Large scale with over 20,000 hours of processed audio.
- Clear processing pipeline described, including download, VAD segmentation, and transcription with Qwen3-ASR-1.7B.
- Provides both raw and segmented subsets for different research needs.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Data may reflect geographic and institutional bias inherent to its source, the Hong Kong Legislative Council.
Provenance
- Source
- Hong Kong Legislative Council meeting recordings from YouTube.
- Collection Method
- Audio downloaded, converted to 16kHz OPUS, segmented with fsmn-vad, transcribed to Cantonese subtitles, and errors corrected with regex.
- Time Range
- null
- Freshness
- Last updated 2026-02-26 07:09:41; freshness should be verified.
- Geography
- Hong Kong