Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,962 datasets
MLCommons provides over one million hours of English audio extracted from Archive.org for unsupervised speech research. The collection features a diverse set of speakers and is available under CC-BY and CC-BY-SA licenses for academic and commercial applications. It was last updated in February 2025 to support large-scale speech model development.
AVSRCocktail is an audio-visual speech recognition system designed for cocktail party scenarios. The model, authored by nguyenvulebinh, combines lip reading and audio processing to handle background noise and speaker interference. The dataset was last updated on July 7, -2025.
A 2025 collection of multiple audio datasets compiled by XiaomiMiMo for the MiMo-Audio-Eval toolkit. It includes datasets for automatic speech recognition, text-to-speech, and audio understanding tasks such as AISHELL1, LibriSpeech, and SeedTTS.
74 spoken languages and American Sign Language are covered in this comprehension dataset. It is an extension of the Belebele text dataset, built by aligning Belebele, Flores200, and Fleurs datasets. The dataset was created by Facebook and last updated on December 17, 2024.
Audio clips and transcriptions of Kalenjin speech sourced from the Mozilla Common Voice project. The dataset was created by author kln001 and last updated on July 28, 2025. It is intended for training and evaluating Automatic Speech Recognition models.
A dataset for Optical Music Recognition (OMR) containing musical scores for 163 unique jazz standards. The scores are provided in both MusicXML and Humdrum **kern formats. The dataset was created by PRAIG and last updated on Hugging Face in September 2025.
5 hours of Turkish audio and text transcripts sourced from over 40 Creative Commons-licensed YouTube videos. The collection features more than 100 distinct speakers with audio resampled to 16 kHz and segmented into clips of up to 25 seconds. It is specifically designed for training and evaluating Turkish speech-to-text models.
This collection of speech processing data, updated in January 2026 by double22a, focuses on automatic speech recognition and synthesis tasks. It includes audio files in .wav format and supports diverse applications such as speech-diarization and voice conversion.
Data for performing Inverse Text Normalization (ITN) in the Vietnamese language, transforming spoken-style text to written forms. It is authored by nguyenvulebinh and was last updated on May 8, 2025. Specific details on row count, column structure, and file formats are not provided in the input.
The Free Music Archive (FMA) provides between 100,000 and 1,000,000 audio tracks and associated metadata for Music Information Retrieval (MIR) research. Developed by Michaël Defferrard and colleagues in 2017, it facilitates tasks like genre classification and music organization.
This database was created by Nordic Language Technology for developing automatic speech recognition and dictation systems for Norwegian. The data was preserved and transferred to the National Library of Norway's Språkbanken in 2011.
Offering 2 specific test subsets from the WenetSpeech corpus, 'test-meeting' and 'test-net', integrated for use within the UltraEval-Audio framework. It facilitates the evaluation of large audio models across 12 task categories including speech understanding and generation using standardized benchmarking scripts.
Japanese audio data contains 266 hours of speech processed by Scribe v1 for automatic speech recognition and classified using Facebook's audio aesthetics model as a prefilter. The dataset is derived from the Japanese portion of the Emilia Yodas collection and is licensed under CC BY 4.0. It includes text transcriptions and aesthetic scores for audio events.
ABX-accent provides evaluation items for the Accented English Speech Recognition Challenge (AESRC) dataset. The project uses the fastABX metric to assess speech representations. The repository is authored by coml and was last updated in October 2025.
ReplayDF contains between 100,000 and 1,000,000 audio samples designed to evaluate the impact of physical replay attacks on deepfake detection systems. Created by mueller91 and updated in 2025, it features re-recorded bona-fide and synthetic speech across six languages and 109 speaker-microphone combinations.
Tamazight speech segments, specifically in the Tachelhit dialect, are paired with Modern Standard Arabic transcriptions. The dataset is actively growing with regular updates, as noted on its Hugging Face page. Author SoufianeDahimi last updated the dataset on March 15, 2025.
ChildMandarin is an open-source speech dataset containing Mandarin Chinese audio from young children aged 3 to 5. Created by BAAI, it addresses a lack of public resources for this demographic, enabling research in automatic speech recognition and speaker verification.
12,743 parallel text and speech samples for Moroccan Darija, including transcriptions in Latin and Arabic scripts and English translations. It was created by atlasia to support speech recognition, language modeling, and NLP tasks for Moroccan Darija. The dataset was last updated on February 7, 2025.
Audio samples and associated metadata from the Nekopara visual novel series, covering volumes 0-4 and extra content. The dataset was uploaded by author grider-transwithai and last updated on July 20, 2024. It likely contains audio files at a 44.1 kHz sampling rate with fields for character name and source volume.
Wuthering Waves Game Character Voice Dataset is provided by AI Hobbyist, with final interpretation rights belonging to KUROGAME. The dataset integration code, created by Genius-Society, offers automated search, download, language splitting, and normalization for Python developers. It was last updated on 2025-11-04.