Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
Mozilla Common Voice 22.0 audio restored using the Sidon denoising model (sarulab-speech/sidon-v0.1) at 48 kHz. Released by sarulab-speech in October 2025, this collection spans 137 languages processed into 21-second chunks. The data is formatted as WebDataset shards for efficient streaming and large-scale training.
The dataset is distributed as WebDataset shards (.tar.gz) and requires the included paths.yaml manifest for Hugging Face-style loading. It is licensed under CC0 1.0, matching the original source.