Name: CommonVoice 22 Sidon: Denoised 48 kHz Audio for 137 Languages
Creator: sarulab-speech
Published: 2025-10-05T15:31:53
Keywords: Languagecy, Languagecnh, Languageckb, Task Categoriestext To Speech, Languagear, Languagebr, Languageca, Languagecv, Languagebn, Languagebg, Languagecs, Languageab, Languagebe, Languageaz, Languageas, Languageast, Languageam, Languagebas, Languageaf, Languageba

Description

Mozilla Common Voice 22.0 audio restored using the Sidon denoising model (sarulab-speech/sidon-v0.1) at 48 kHz. Released by sarulab-speech in October 2025, this collection spans 137 languages processed into 21-second chunks. The data is formatted as WebDataset shards for efficient streaming and large-scale training.

Use Cases

Training text-to-speech (TTS) systems using the 48 kHz reconstructed audio samples
Multilingual automatic speech recognition (ASR) across 137 language folders
Benchmarking speech enhancement models against Sidon-denoised outputs

Strengths

High-resolution 48 kHz audio reconstruction
Coverage of 137 distinct languages
Standardized WebDataset shard format for high-throughput loading

Limitations

Potential algorithmic artifacts introduced by the Sidon restoration model
Fixed 21-second chunking may split natural speech segments or sentences
Inherits any transcription errors present in the original Mozilla Common Voice 22.0 source

Provenance

Source: Mozilla Common Voice 22.0
Collection Method: Algorithmic restoration and denoising of crowdsourced audio using the Sidon-v0.1 model
Freshness: Last updated October 2025.
Geography: Global (137 languages)

The dataset is distributed as WebDataset shards (.tar.gz) and requires the included paths.yaml manifest for Hugging Face-style loading. It is licensed under CC0 1.0, matching the original source.

CommonVoice 22 Sidon: Denoised 48 kHz Audio for 137 Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info