Name: X Voice Dataset Train: A Multilingual Speech Corpus for Model Training
Creator: XRXRX
Published: 2026-04-08T09:46:42
Keywords: Multilingual, Audio, Training

Description

XRXRX aggregated this multilingual speech dataset from seven distinct sources, including Multilingual LibriSpeech, VoxPopuli, and GigaSpeech 2. The collection was last updated on April 12, 2026. Each constituent dataset retains its own license, with most permitting commercial use.

Use Cases

Train multilingual automatic speech recognition (ASR) models based on the described multilingual sources.
Fine‑tune text‑to‑speech (TTS) systems using the aggregated voice data.
Develop speech synthesis models leveraging the combined audio corpus.
Benchmark speech model performance across different languages and accents represented in the source datasets.

Strengths

Aggregates data from seven established speech datasets, providing a broad base.
Most listed source datasets explicitly permit commercial use under licenses like CC BY 4.0.
The dataset page was updated on 2026‑04‑12, suggesting recent maintenance.

Limitations

Column‑level documentation is absent; field semantics must be inferred after download.
Row count and total size are unknown, which may limit suitability assessment.
License terms are not uniform; users must verify compliance for each sub‑dataset used.

Provenance

Source: Aggregated from seven sources: Multilingual LibriSpeech, Emilia, LEMAS, VoxPopuli, Granary (MOSEL Part), GigaSpeech 2, and Reazon Speech.
Collection Method: Likely a curated compilation of existing public speech datasets.
Freshness: Last updated 2026‑04‑12 16:31:54; freshness should be verified.

License restrictions vary by sub‑dataset; users must comply with the license of each individual sub‑dataset they use. GigaSpeech 2 requires a separate license agreement.

Audio Multilingual Training

X Voice Dataset Train: A Multilingual Speech Corpus for Model Training

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info