Name: Central Thai and Dialect Speech Corpus with Parallel Sentences
Creator: CMKL
Published: 2024-08-20T16:22:08
Keywords: Machine Translation, Librarypolars, Languageth, Librarydask, Modalityaudio, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Licensecc By Sa 40, Dialect Audio, Librarydatasets, Parquet, Audio, Regionus, Natural Language Processing, Thai Speech, Speech Recognition, Multimodal

Description

700 hours of Central Thai speech and 40 hours each for three other Thai dialects form this corpus. The dataset, created by CMKL, includes parallel sentences across dialects to support speech and translation research. It was last updated in September 2024.

Use Cases

Train automatic speech recognition models on 700 hours of Central Thai audio.
Develop dialect adaptation systems using parallel sentences across the four Thai dialect recordings.
Research cross-dialect machine translation leveraging aligned audio-text pairs.
Benchmark speech model performance on Thai dialects using the provided data splits.

Strengths

700 hours of Central Thai audio provides a substantial training resource.
Includes parallel sentences across four dialects for controlled comparison.
Officially split data partitions support reproducible evaluation.

Limitations

Dialect data is smaller at 40 hours per dialect, limiting model training scope.
Specific details on speaker demographics, recording conditions, and label quality are not provided.

Provenance

Source: CMKL, with Central Thai data collected via Wang Data Market.
Collection Method: null
Time Range: null
Freshness: Last updated September 2024.
Geography: Thailand, focusing on Central and three other dialect regions.

Parts of the corpus are included in the ML-SUPERB benchmark; users should check for overlap. License is suggested as CC BY-SA 4.0 per platform tags but not confirmed in the provided description.

Audio Multimodal Parquet Machine Translation Librarypolars Languageth Librarydask Modalityaudio Modalitytext Size Categories100 Kn1 M Librarymlcroissant Licensecc By Sa 40 Dialect Audio Librarydatasets Regionus Natural Language Processing Thai Speech Speech Recognition

Central Thai and Dialect Speech Corpus with Parallel Sentences

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info