Aggregating crowdsourced speech recordings and transcriptions for over 20 listed languages including Abkhaz, Basaa, and Cantonese. It is an unofficial conversion of the Mozilla Common Voice Corpus 16.0, providing paired audio and text data for multilingual speech technology development.
Use Cases
- Train automatic speech recognition (ASR) models using the audio recordings and their associated text transcriptions
- Build language identification models to classify audio samples into specific language categories like Breton or Bulgarian
- Perform phonetic analysis across different dialects by comparing the audio features of Chinese (Hong Kong) and Chinese (Taiwan) samples
- Develop text-to-speech (TTS) systems by utilizing the sentence strings as input and audio clips as ground truth
Strengths
- Covers a wide array of languages including Amharic, Armenian, Assamese, and Central Kurdish
- Provides paired audio clips and text transcriptions sourced from the Mozilla Common Voice project
- Includes specific regional variants such as Chinese (China), Chinese (Hong Kong), and Chinese (Taiwan)
- Contains data for low-resource languages like Basaa, Abkhaz, and Chuvash