Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,943 datasets
Visual novel audio recordings paired with transcriptions and Gemini 2.5 Pro generated captions. The collection includes descriptive metadata tags such as emotion, speaker profile, and style to facilitate controllable speech synthesis.
Treble10-Speech is an automatic speech recognition (ASR) dataset featuring 16 kHz audio files generated by convolving LibriSpeech data with high-fidelity room-acoustic simulations. Created by Treble Technologies and updated in November 2025, the collection includes between 1,000 and 10,000 records across 10 distinct furnished room environments. The dataset provides speech samples with reverberation times ranging from 0.17 to 0.84 seconds.
Mozilla Common Voice 22.0 audio restored using the Sidon denoising model (sarulab-speech/sidon-v0.1) at 48 kHz. Released by sarulab-speech in October 2025, this collection spans 137 languages processed into 21-second chunks. The data is formatted as WebDataset shards for efficient streaming and large-scale training.
MusicBench is a music audio-text pair dataset designed for text-to-music generation. It expands the MusicCaps dataset from 5,521 to 52,768 training and 400 test samples. The dataset was released by amaai-lab in March 2025 alongside the Mustango model.
218.2 hours of transcribed Turkish speech across 186,171 utterances. The collection supports research in multilingual speech recognition for Turkic languages and is hosted via the IS2AI GitHub repository.
A dataset of 20,000 audio files, split evenly between AI-generated and human-composed music. The AI-generated portion comprises 256 files from SunoCaps, 4,872 from Udio, and 4,872 from MusicSet. It was created by SleepyJesse and last updated on December 11, 2024.
A refined subset of the Mozilla Common Voice corpus containing only Uzbek language voice recordings. The dataset has been cleaned and normalized, with a text field added, to improve usability for training automatic speech recognition models. It was created by user 'yakhyo' and last updated on April 15,我们发现了一个问题。
7,418 professionally curated samples link film clips with high-quality music, visual descriptions, and main melodies. Proposed in the FilmComposer project, this dataset aims to advance research in music production and video-to-music generation. Author apple-jun uploaded it to Hugging Face on April 27, 2025.
20.33 hours of high-quality Tamil speech recordings from male and female speakers, with corresponding text transcriptions. The dataset, created by SPRINGLab, is derived from the Indic TTS Database project and is suitable for text-to-speech model development.
Myanmar language text corpus designed to address the lack of large-scale, openly accessible resources for Myanmar Natural Language Processing. The dataset is tailored to support tasks like text-to-speech and automatic speech recognition. It was created by author 'freococo' and last updated on May 22, 2025.
A collection of Tunisian dialect audio and corresponding annotations for automatic speech recognition tasks. The dataset was created by Linagora to train their Linto Tunisian dialect STT model, with the first packaged version released in 2023.
VieNeu-TTS-140h contains 74,858 Vietnamese audio samples and phonemized transcripts totaling 140 hours of speech data. Developed by pnnbao-ump and updated in late 2024, the collection was sourced from YouTube and refined through a pipeline involving Whisper-large-v3 transcription and human-in-the-loop correction.
Most songs collected are love songs, touching on themes of nostalgia and saudade as well as lively dances. The collection process involved interviewing people and learning about their lives through songs linked to agricultural work and annual cycles. The dataset was coordinated by Álvarez Pérez, Xosé Afonso and last updated in May 2024.
A Tatoeba-based speech corpus for Northern Kurdish (Kurmanji) containing only a test split. It is intended for automatic speech recognition (ASR) and speech-to-text translation (S2TT) evaluation. The dataset was created by aranemini and last updated on August 13, 2025.
HumynLabs provides an audio dataset of a single, standardized customer support query in English. The dataset is intended for training and evaluating models in speech recognition and emotional tone analysis. It was last updated on Hugging Face on July 28, 2025.
585 hours of 24kHz English speech audio form this multi-speaker corpus derived from LibriVox audiobooks and Project Gutenberg texts. Heiga Zen and Google Speech/Brain team members prepared the dataset specifically for TTS research. The dataset card was last updated in February 2024.
Comprising Taiwanese Mandarin speech recordings from 203 individuals in Taiwan, including 137 females and 66 males. The data was collected in a quiet indoor environment via mobile phone guiding. It is a sample of a larger paid dataset intended for speech and language model training.
VibraVox contains between 10,000 and 100,000 French speech recordings captured using body-conduction transducers. Developed by Cnam-LMSSC and documented in Arxiv 2407.11828, this dataset provides a specialized audio-text corpus for speech processing research. It includes expert-generated and crowdsourced annotations for various audio-centric machine learning tasks.
The eastern North Pacific, about 500 km from Vancouver Island, was the site of the Acoustic Surface Reverberation Experiment in 1991-1992. The Upper Ocean Processes Group deployed moorings to measure oceanographic variables, with this dataset likely containing temperature readings. Data was collected over 9.5 weeks during the winter of 1991-1992 at a sample rate of 7.5 minutes.
Ming030890's dataset contains Cantonese audio-caption pairs sourced from YouTube videos with manually provided captions. It was built by re-transcribing audio with SenseVoice and filtering segments to create a collection supporting ASR development. The dataset includes segments where ASR output matches original captions and segments with homophone or English word differences.