Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,940 datasets
Chordonomicon is a large-scale dataset of over 666,000 contemporary music compositions represented symbolically with chords and chord progressions. It was created by ailsntua and includes metadata such as genre, sub-genre, release date, structural information, and Spotify IDs. The dataset was last updated on 2025 05 15.
3150 audio samples at 24kHz, created by bosonai and last updated on 2025-07-28. The dataset is designed for evaluating the HiggsTokenizer and contains four subsets: Speech, Music, Sound Event, and Audiophile. The Speech, Music, and Sound Event subsets each contain 1,000 ten-second clips, while the Audiophile subset contains 150 thirty-second high-fidelity clips.
MLCommons provides the People's Speech dataset, a collection of over 30,000 hours of transcribed English speech. This corpus is designed for training large-scale speech-to-text systems and is released under permissive licenses for both academic and commercial applications.
High-quality Gujarati speech recordings with text transcriptions, derived from the Indic TTS Database project. The dataset contains monolingual recordings from both male and female speakers, curated by SPRINGLab. It was last updated on the Hugging Face platform on 2025-01-27.
A 2016 noise classification for land transport infrastructure in France's Sarthe department, approved by a prefectural decree on March 18, 2016. It categorizes roads, railways, and public transport lanes based on generated noise levels to determine zones requiring reinforced building insulation. The dataset was produced by the Bureau de Recherches Géologiques et Minières and last updated in April 2019.
A test set for the NADI-2015 Subtask-2 challenge focused on Automatic Speech Recognition for Arabic across multiple dialects. The dataset is hosted by UBC-NLP and was last updated in July 2025. Participants are required to register for the shared task to access the data.
Data used for analyses in Martínez-Castilla et al. (2023). The dataset contains assessments of rhythm discrimination, melody discrimination, music memory, and language abilities for children with Developmental Language Disorder and typically developing peers. The research found children with DLD exhibited significantly lower performance on all three music subtests.
Amphion released the NVSpeech (Emilia-NV) dataset in 2025, providing between 100,000 and 1,000,000 Mandarin Chinese speech samples. The collection features word-level annotations for 18 categories of paralinguistic vocalizations, including non-verbal sounds and lexicalized interjections.
A collection of audio recordings containing three distinct categories: clean speech, noisy speech, and noise-only samples. The dataset was created by haydarkadioglu and last updated on August 20, 2025. It is designed for research in speech enhancement, noise reduction, and speech recognition.
Librispeech is a 1000-hour corpus of 16kHz read English speech derived from audiobooks, designed for automatic speech recognition. This version includes alignments generated by the Montreal Forced Aligner (MFA). The dataset was uploaded to Hugging Face by gilkeyio and last updated on November 22, 2023.
MusicEval is a dataset of 2,748 generated music clips with a total duration of 16.62 hours, created by BAAI. It is the first generative music assessment dataset designed to address text-to-music evaluation challenges. The dataset was last updated on August 18, -2025.
MultiMed is a multilingual automatic speech recognition dataset for the medical domain, presented at ACL 2025. The dataset is hosted on Hugging Face by author leduckhai and was last updated on June 1, 2025. It is intended to serve as a foundational resource for downstream applications like speech translation and spoken language understanding.
A derived version of the Technical Indian English (TIE) dataset, which contains approximately 8,000 hours of speech from around 9,800 technical lectures in English. The original content was sourced from the NPTEL platform, with lectures averaging 50 minutes each and delivered by instructors from various regions across India. The dataset was created by author 'raianand' and was last updated on the Hugging Face platform in November 2024.
A benchmark containing approximately 6.52 hours of human-annotated broadcast speech, totaling 8085 utterances, across 13 distinct domains. It is designed for automatic speech recognition performance evaluation in challenging conditions. The dataset was created by SUST-CSE-Speech and last updated on March 9, 2024.
OpenSound created this dataset for training CapTTS, EmoCapTTS, and AccCapTTS models, as described in the paper 'CapSpeech: Enabling Downstream Applications in Style-Captioned Text-to-Speech'. The dataset was last updated on July 28, 2025. It contains audio-text pairs sourced from multiple original datasets.
Cantonese audio segments and creator-uploaded transcripts extracted from various YouTube channels. The dataset was created by OrcinusOrca and last updated on August 27, 2025. It is intended for training automatic speech recognition models.
InstructTTSEval is a benchmark for evaluating Text-to-Speech systems on complex natural-language instructions. The dataset provides a hierarchical framework with three progressively challenging tasks testing acoustic control and style generalization. It was created by CaasiHUANG and last updated in June 2025.
7 hours of transcribed audio recordings of Chilean Spanish sentences. The dataset was created by author ylacombe from restructured OpenSLR archives and was last updated in November 2023.
Chilean Spanish audio data consisting of 7 hours of transcribed, high-quality sentences recorded by 31 volunteers. The dataset was created by ylacombe and restructured from original OpenSLR archives for easier streaming. It was last updated on November 27, 2023.
443,641 Vietnamese audio samples and corresponding phonemized transcripts totaling 1,000 hours of speech data. The collection is structured for training and fine-tuning high-quality Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) models.