Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,909 datasets
A 2026 benchmark from KRAFTON provides 6,000 prompt–text pairs for evaluating zero-shot text-to-speech models. It covers four acoustic regimes: Clean, Noisy, Wild, and Emotional, using prompts from 12 different datasets. This framework aims to assess model robustness in realistic and challenging recording scenarios.
A spoken corpus of traditional Mooré folk stories (contes) designed for research in low-resource speech and language processing. The dataset is created by goaicorp for academic purposes and was last updated in April 2026.
A spoken corpus of traditional Mooré folk stories (contes) designed for low-resource language research. The dataset was created by goaicorp for academic purposes and was last updated in April 2026. Access to the data is gated and requires a request.
Ytmusics is a dataset hosted on HuggingFace by NathMen12. The dataset's specific content and structure are not described in the available metadata. It was last updated on 2026-05-18 18:04:45.
A dataset for training Text-to-Speech models, including XTTS_v2, YourTTS, and Tacotron. It contains audio in the LJSpeech format, featuring multiple speakers of the Saudi dialect of Arabic. The dataset was created by Abdelrahman2922 and was last updated on March 30, 2026.
CompSpoof V2 contains over 250,000 audio samples totaling approximately 283 hours, developed by XuepingZhang for component-level anti-spoofing research. The data simulates real-world acoustic scenarios where speech, environmental sounds, or both components are spoofed, with each sample provided at multiple sampling rates.
Monthly conflict forecasts for Saint Kitts and Nevis produced by the Violence & Impacts Early-Warning System (VIEWS) consortium. The system generates predictive data for violent conflict and fatalities up to 36 months in advance using iterative research models. This CSV-formatted data is updated monthly and includes HXL tags for humanitarian interoperability.
Verified occurrence and life history data for butterflies and moths across North America. The project aggregates quality-controlled observations from citizen scientists, museum collections, literature, and professional lepidopterists. BAMONA is directed by Kelly Lotts and Thomas Naberhaus at Montana State University.
ASOS 30-Second Ceilometer Data is a high-resolution time-series of cloud layer observations from Automated Surface Observing System (ASOS) stations. The dataset contains 30-second samples of cloud base height, layer thickness, and sensor status from 25 reference sites across the contiguous United States. It is archived by the National Climatic Data Center (NCDC) under NOAA, with the earliest records from June 1998.
OLMoASR-Pool contains approximately 3.4 million hours of audio and 18.8 million unique transcripts collected from the public internet. It was created by AllenAI to train English speech recognition models and includes a variety of speaking styles, accents, and audio setups.
OVSpeech is a dataset built for the ICASSP 2026 paper titled 'OV-InstructTTS: Towards Open-Vocabulary Instruct Text-to-Speech'. It is constructed upon the ContextSpeech framework and is authored by y-ren16. The dataset was last updated on the Hugging Face platform in April 2026.
An exploratory experiment to enable frozen pretrained RWKV language models to accept speech modality input. The dataset, created by author 'echodict', is hosted on Hugging Face and was last updated on 2026-04-01. It follows the SLAM_ASR approach to bridge the gap between text-trained LLMs and speech recognition tasks.
Tricky TTS is a benchmark dataset designed to stress-test text-to-speech models on challenging English text. Each row targets a specific failure mode to separate capable systems from weaker ones. The dataset was created by Trelis and last updated in March 2026.
A unified collection of code-mixed automatic speech recognition datasets. The dataset was uploaded by author RidheshBhati to the Hugging Face platform and was last updated on May 1, 2026.
7998514482 bytes of data comprise this multimodal dataset from the 'Art of Virtuosity' performance within the Music-in-Medicine program. It likely contains EEG recordings and audio files, such as piano performances, to study brain synchrony. The dataset is available under a CC-BY-4.0 license.
9556620369 bytes of multimodal data were collected during the 'Art of Virtuosity' rehearsal, part of the Music-in-Medicine program. The dataset includes EEG and audio recordings, suggesting a focus on the neurological and acoustic aspects of musical performance. Its cross-platform presence and open license indicate it is intended for research in music cognition and therapy.
2.6 million audio snippets totaling 4,932 hours of speech, enhanced with emotion annotations and speaker embeddings. The dataset, created by ai-music4you3, contains WAV files at 48kHz mono with durations ranging from 3.0 seconds to over 18 minutes. It was last updated on March 17, 2026.
Call Center Audio is a large audio dataset containing over 13,000 hours of real-world customer service calls. It features time-stamped transcripts and over 90% unique speakers, supporting tasks like speech recognition and speaker diarization. The dataset was created by ud-nlp and was last updated in March 2026.
8165719852 bytes of multimodal data, including EEG and audio recordings, are provided for the "Quasi una Fantasia" piece from the Music-in-Medicine program. The dataset is published under a CC-BY-4.0 license and contains files in formats such as XLSX, MP3, WAV, CSV, MAT, and MP4. It is intended for research exploring the intersection of music, neuroscience, and therapeutic applications.
Multimodal data recording from Rhapsody in Blue performance from the Music-in-Medicine program. The dataset includes brain activity and audio recordings, likely containing EEG signals synchronized with musical performance audio. Its 4.3 GB size suggests a detailed capture of the event.