Sign in to view source links and access this dataset
Description
XRXRX aggregated this multilingual speech dataset from seven distinct sources, including Multilingual LibriSpeech, VoxPopuli, and GigaSpeech 2. The collection was last updated on April 12, 2026. Each constituent dataset retains its own license, with most permitting commercial use.
Use Cases
Train multilingual automatic speech recognition (ASR) models based on the described multilingual sources.
Fine‑tune text‑to‑speech (TTS) systems using the aggregated voice data.
Develop speech synthesis models leveraging the combined audio corpus.
Benchmark speech model performance across different languages and accents represented in the source datasets.
Strengths
Aggregates data from seven established speech datasets, providing a broad base.
Most listed source datasets explicitly permit commercial use under licenses like CC BY 4.0.
The dataset page was updated on 2026‑04‑12, suggesting recent maintenance.
Limitations
Column‑level documentation is absent; field semantics must be inferred after download.
Row count and total size are unknown, which may limit suitability assessment.
License terms are not uniform; users must verify compliance for each sub‑dataset used.
Provenance
Source
Aggregated from seven sources: Multilingual LibriSpeech, Emilia, LEMAS, VoxPopuli, Granary (MOSEL Part), GigaSpeech 2, and Reazon Speech.
Collection Method
Likely a curated compilation of existing public speech datasets.
Freshness
Last updated 2026‑04‑12 16:31:54; freshness should be verified.
License restrictions vary by sub‑dataset; users must comply with the license of each individual sub‑dataset they use. GigaSpeech 2 requires a separate license agreement.