Name: Common Voice 22.0: Multilingual Speech Corpus for ASR in 20+ Languages
Creator: fsicoli
Published: 2025-06-27T20:01:16
Keywords: Languagecy, Languagecnh, Languageckb, Languagear, Languagebr, Languageca, Languagecv, Languagebn, Languagebg, Languagecs, Languageab, Languagebe, Languageaz, Languageas, Languageast, Languageam, Languagebas, Task Categoriesautomatic Speech Recognition, Languageaf, Languageba

Description

Mozilla Common Voice Corpus 22.0 is a multilingual speech dataset featuring audio recordings and text transcriptions across a wide array of global languages. This version is an unofficial conversion of the Mozilla project data provided by fsicoli and updated in August 2025. It includes data for dozens of languages including Arabic, Bengali, and Chinese.

Use Cases

Training Automatic Speech Recognition (ASR) models using the audio and text pairs associated with the 'Task Categoriesautomatic Speech Recognition' tag
Language identification for low-resource tongues using specific language tags like 'Languagebas' (Basaa) and 'Languagecv' (Chuvash)
Dialectal variation research comparing samples from 'Languagecnh' (Hakha Chin) and 'Languageas' (Assamese)

Strengths

Includes rare and low-resource languages such as Abkhaz, Basaa, and Chuvash
Covers over 20 distinct language tags including Amharic and Bashkir
Derived from the Mozilla Common Voice project, a standard benchmark for open-source speech data

Limitations

Unofficial distribution which may lack the standard metadata or validation found in the official Mozilla release
Unknown record count and file size make it difficult to assess storage requirements before download

Provenance

Source: Mozilla Common Voice (https://commonvoice.mozilla.org/)
Collection Method: Downloaded and converted from the Mozilla Common Voice project website
Freshness: Updated as of August 11, 2025.
Geography: Global

This is an unofficial version of the Mozilla Common Voice Corpus 22.0; users should verify the license and data integrity against the official Mozilla release before use.

Common Voice 22.0: Multilingual Speech Corpus for ASR in 20+ Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info