Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
Mandarin speech data from 100 native speakers, each reading 100 sentences ten times. The dataset provides over 55 hours of synchronized audio, lip video, and surface electromyography recordings per modality. It was created by MML-Group and last updated on the platform in October 2025.
December 1998 to February 2000 data from Crooked Lake and Lake Druzhby in Antarctica's Vestfold Hills. The dataset contains measurements of ammonia, nitrite, nitrate, temperature, oxygen, and abundances of heterotrophic bacteria, cyanobacteria, ciliates, and rotifers. Data was collected by a team led by Prof J Laybourn-Parry and organized into physical, chemical, biological, and production folders.
Annotated audio recordings evaluate how speaking pace and background noise affect transcription accuracy for Whisper and other ASR systems. The dataset, created by danielrosehill, was last updated on December 9, 2025. Its specific size and row count are not provided in the available metadata.
Physical trajectory profile data from a glider deployment by the University of Massachusetts - Dartmouth in the Mid-Atlantic Bight. The dataset contains measurements of oceanographic properties like temperature and salinity, collected from August 18 to August 22, 2016. Data was submitted to the National Centers for Environmental Information (NCEI) via the IOOS National Glider Data Assembly Center.
NonverbalTTS is a 17-hour open-access English speech corpus with aligned text annotations. It includes annotations for 10 types of nonverbal vocalizations and 8 emotion categories. The dataset was created by author deepvk and last updated on Hugging Face in October 2025.
Version 3.0 Level 1 science data provides calibrated Delay Doppler Maps (DDMs) from the eight-satellite CYGNSS constellation. The dataset includes geo-located measurements of Power Received, Bistatic Radar Cross Section (BRCS), Normalized BRCS, and Leading Edge Slope, produced by POCLOUD. Data from up to 8 spacecraft is typically available daily with a latency of approximately 6 days from the last measurement.
Five historic shoreline positions for Massachusetts from 1844 to 1994 document coastal erosion and accretion. The dataset was produced by the Massachusetts Coastal Zone Management office in collaboration with the USGS and Woods Hole Oceanographic Institution. It updates a previous analysis from the mid-1800s to 1978 with new 1994 shoreline data.
UrduSpeech is a high-quality multi-style speech corpus containing approximately 51,600 audio-text pairs for Urdu and Kashmiri languages. It was created by humairawan and last updated on December 11, 2025. The dataset includes professionally recorded audio with diverse speaking styles, emotional expressions, and gender representation.
Moe Speech provides 600 hours of Japanese anime-style voice recordings across approximately 100,000 to 1,000,000 audio clips. Created by litagin and updated in May 2025, the collection is sourced from 50 visual novels for character-specific audio tasks.
2,000 hours of transcribed Arabic speech collected from Aljazeera News Channel broadcasts. QASR is a large-scale corpus created by QCRI, featuring multi-layer annotation and covering multiple Arabic dialects and code-switching speech. The dataset was last updated on the platform in October 2025.
A combined Turkish text-to-speech dataset aggregating seven open-source sources. It contains approximately 81,500 audio samples recorded at 24kHz and is described as SNAC-compliant. The dataset was created by user 'afkfatih' and was last updated on November 30, 2025.
Approximately 24 hours of high-quality speech audio in Latin American Spanish, prepared for Text-to-Speech applications requiring a 24kHz sampling rate. The audio files were derived from crowdsourced datasets made by Google and obtained via OpenSLR. The dataset was uploaded by GianDiego and last updated on April 12, 2025.
Comprising just above 5 hours of Danish speech audio (.wav files) with corresponding reference text, created by Alvenir to evaluate automatic speech recognition models. It includes recordings from 50 speakers aged 20-60 years.
24,800 AI-generated 20-second music clips created using the facebook/musicgen-small model. The dataset is the audio companion to Prompt2MusicBench, with each clip generated from a structured text prompt encoding genre, instrument, tempo, and mood. It was created by bodhisattamaiti and last updated on Hugging Face in December 2025.
1994-1996 land cover classifications for the Massachusetts coastal zone, derived from 10 full or partial Landsat Thematic Mapper scenes. The data was produced by the Multi-Resolution Land Characteristics program for the Coastal Change Analysis Project to establish environmental baselines. It was later reprojected into the Massachusetts State Plane coordinate system by the Massachusetts Office of Coastal Zone Management in October 2006.
2013 acoustic event logs from a Southern Ocean voyage contain sonobuoy-detected whale calls and other sounds, classified by onboard acousticians. The dataset includes processed bearings, frequencies, and estimated receive levels for each event, linked to sonobuoy deployment locations. It was collected by the Australian Antarctic Data Centre during the 2013 Antarctic Blue Whale Voyage.
1,928 audio files generated from the test set of the Nemotron Content Safety Dataset V2. This multimodal extension provides spoken versions of adversarial and safety-critical prompts across 23 violation categories, enabling research in AI safety. The dataset was created by NVIDIA and was last updated on the platform in December 2025.
EMOVA-Alignment-7M is a dataset curated for omni-modal pre-training, including vision-language and speech-language alignment. It was created by Emova-ollm using open-sourced image-text pre-training datasets, OCR datasets, and 2,000 hours of ASR and TTS data. The dataset page was last updated on 2025-03-14.
WavCaps is a dataset for audio-language multimodal research, with audio clips sourced from FreeSound, BBC Sound Effects, SoundBible, and the AudioSet Strongly-labelled Subset. The dataset was created by cvssp and last updated on Hugging Face in July 2023. It uses ChatGPT to assist in generating weakly-labelled captions for the audio content.
158 hours of audio recordings with corresponding text transcriptions, curated by benjaminogbonna. The dataset includes metadata like accent and locale and was last updated on March 30, 2025. It was created to address a gap in speech and language datasets for African accents.