Name: Common Voice 18.0: Bilingual Welsh and English Speech Recordings
Creator: techiaith
Published: 2024-08-19T20:15:12
Keywords: Size Categories10 Kn100 K, Common Voice, Languagecy, Librarypolars, Librarydask, English, Languageen, Licensecc0 10, Modalitytext, Librarymlcroissant, Librarydatasets, Text, Parquet, Multilingual, English Language, Audio, Welsh Language, Regionus, Task Categoriesautomatic Speech Recognition, Welsh, Multilingual Audio, Speech Recognition

Description

A bilingual dataset combining equal numbers of Welsh and English speech recordings from Common Voice version 18. The Welsh recordings were sourced from the train_all and other_with_excluded splits of the Welsh Common Voice corpus. An equal number of recordings were taken from the official English Common Voice version 18 training set, prioritizing those tagged with a British Isles accent. The dataset was created by author 'techiaith' and last updated on 2024-11 06.

Use Cases

Train or fine-tune bilingual speech recognition models based on the described Welsh and English audio recordings.
Develop accent detection or classification models based on the described prioritization of British Isles English accents.
Conduct comparative linguistic analysis of speech patterns between Welsh and English based on the parallel recordings.
Create or augment training data for speech synthesis systems in Welsh based on the described audio corpus.

Strengths

Provides a balanced combination of Welsh and English speech recordings.
Prioritizes English recordings with specific British Isles accents (e.g., Welsh, Scottish).
Sourced from the established Common Voice version 18 corpus.

Limitations

Row count, column names, and file formats are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Common Voice version 18 corpus.
Collection Method: Combined recordings from Welsh and English splits of the Common Voice corpus.
Time Range: Coverage corresponds to Common Voice version 18.
Freshness: Last updated 2024-11-06 13:50:01.
Geography: Includes Welsh language data and English data prioritized for British Isles accents.

License information is unknown and should be verified on the dataset page.

Common Voice 18.0: Bilingual Welsh and English Speech Recordings

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info