Cambodian cultural speech data comprising 134.6 hours of manually curated speech-text pairs in the Khmer language. The dataset was created by DDD-Cambodia using eight native speakers and was last updated in May 2026. Recordings average 8.54 seconds in length and include speaker metadata such as gender, age group, and origin city.
Use Cases
- Train automatic speech recognition models based on Khmer audio recordings.
- Fine-tune speech-to-text systems for cultural domain topics based on the described thematic content.
- Analyze speech patterns and acoustic features based on speaker metadata like gender and age group.
- Develop language models for Khmer based on the transcribed cultural text.
- Benchmark ASR model performance on a manually curated, culturally specific dataset.
Strengths
- 134.6 hours of manually curated speech-text pairs, providing a substantial audio corpus.
- Speaker metadata includes gender, age group, and origin city for eight distinct native speakers.
- Average recording length of 8.54 seconds with a standard deviation of 3.37 seconds, indicating consistent utterance duration.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Data may reflect geographic and demographic bias inherent to the eight speakers from Cambodia.
Provenance
- Source
- DDD-Cambodia
- Collection Method
- Utterances were manually generated by eight native speakers based on predefined cultural topics and subtopics.
- Freshness
- Last updated 2026-05-15 10:22:01; freshness should be verified.
- Geography
- Cambodia