Name: Malay Speech Synthesis Dataset with 241 Hours of Audio
Creator: huseinzol05
Published: 2022-04-16T09:28:21
Keywords: Regionus

Description

A collection of approximately 241 hours of high-quality Malay speech audio synthesized by the ms-MY-YasminNeural voice. The audio is split into two subsets: 99.4 hours from Malay Wikipedia and News texts, and 142 hours from Malaysian Parliament transcripts. All audio has a 24000 Hz sample rate and uses sentences between 2 and 20 words in length.

Use Cases

Train a text-to-speech model using the 24000 Hz sample rate audio and ms-MY-YasminNeural voice characteristics.
Fine-tune an automatic speech recognition system on the 241 hours of clean Malay audio from Wikipedia and Parliament sources.
Analyze prosody and pronunciation patterns in Malay using sentences constrained to 2-20 words.
Build a speech synthesis voice cloning model leveraging the super clean audio samples from the ms-MY-YasminNeural narrator.

Strengths

Large volume of 241 total hours of synthesized speech audio.
High-quality 24000 Hz sample rate audio described as 'super clean'.
Text sources are clearly defined as Malay Wikipedia, News, and Malaysian Parliament transcripts.
Sentence length is controlled, ranging from 2 to 20 words.

Limitations

All audio is synthesized from a single neural voice (ms-MY-YasminNeural), lacking natural human speaker variation.
The dataset consists solely of machine-generated speech, which may not capture all nuances of natural human speech.
The temporal coverage and original recording dates of the source texts are not specified.

Provenance

Source: Texts sourced from Malay Wikipedia, News articles, and Malaysian Parliament transcripts.
Collection Method: Audio synthesized using Microsoft Azure's ms-MY-YasminNeural text-to-speech neural voice.
Freshness: Last updated on the platform in April 2022.
Geography: Malaysia, focusing on the Malay language.

Associated code and notebooks are hosted on a separate GitHub repository (malaya-speech). The dataset page on Hugging Face should be consulted for the full description and potential license information.

Regionus

Malay Speech Synthesis Dataset with 241 Hours of Audio

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info