Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
94,512 high-quality synthetic English speech audio files generated by Gemini Flash 2.0 via the Multimodal Live API. The dataset contains speech from two speakers, Puck (Male) and Kore (Female), totaling 284.31 hours with an average duration of 10.83 seconds per file. It was created by user shb777 and was trending in February 2025.
300 speech samples provided in noisy and 22 different enhanced versions, totaling 6,900 audio clips with human-labeled mean opinion scores (MOS). This collection was developed for the 2024 URGENT Speech Enhancement Challenge at NeurIPS to evaluate speech quality assessment and enhancement algorithms.
Over 1,100 hours of Vietnamese speech data were collected from various social resources by author NhutP and last updated on April 25, 2025. The dataset includes a diverse representation of accents from northern, central, and southern Vietnam, as well as different dialects and speaking styles. This diversity is intended to enhance the training and evaluation of automatic speech recognition models.
ASR data likely contains audio samples and transcriptions for speech recognition tasks. The dataset is hosted on Kaggle, but details about its size, source, and creation date are unknown. Its content and structure must be verified after download.
A Kaggle-hosted dataset titled 'animetts', which likely contains audio data for text-to-speech applications. The dataset's specific content, size, and origin are not detailed in the provided metadata. Further verification after download is required to confirm the exact nature and scope of the audio samples.
CS-FLEURS contains 300 hours of speech data across 113 unique code-switched language pairs. It includes both read and synthetic speech for developing and evaluating speech recognition and translation systems.
Thorsten-Voice created a small, high-quality dataset of 60 newly recorded German speech samples, last updated in December 2025. The dataset is designed for speaker refinement and voice matching in Orpheus text-to-speech models. The samples are spoken in a neutral, relaxed, everyday style, closely reflecting the natural speaking voice of the original speaker.
SADA (Saudi Audio Dataset for Arabic) is a large-scale Arabic speech corpus designed to support AI model development for Arabic speech processing. It contains over 667 hours of transcribed Arabic audio recordings, primarily featuring various Saudi dialects, and was curated in a collaboration involving the National Center for Artificial Intelligence. The dataset was last updated on the platform in May 2025.
ESpeech's Espeech Podcasts dataset contains 3,200 hours of processed audio segments extracted from various podcasts. The audio is in Russian, processed at a 44.1kHz sample rate, and is structured as segmented audio files with JSON metadata. The dataset was last updated on November 25, 2025.
RobotsMali's Bam Asr Early dataset is a collection of Bambara language audio for automatic speech recognition (ASR). It primarily combines the Jeli-ASR dataset and the Mali-Pense data curated by Aboubacar Ouattara, with an additional hour of audio featuring children's voices reading books. The dataset was last updated on March 18, 2025.
Kyutai TTS Voices is a dataset of audio samples for text-to-speech synthesis, published on the Hugging Face platform by user jspaulsen. The dataset was last updated on January 20, 2026. Its specific content, scale, and collection methodology are not detailed in the available metadata.
September 2017 data from a NOAA Okeanos Explorer expedition focused on the Musicians Seamounts in the Pacific Ocean. It includes oceanographic, meteorological, and navigation data collected via 24-hour operations using ROVs, mapping systems, and telepresence. The dataset was compiled by NOAA's National Centers for Environmental Information.
1.26 million synthetic audio samples support research in Dhivehi, a low-resource language. The dataset was created by user alakxender and updated on October 15, 2025. It pairs Dhivehi sentences with waveforms generated through controlled synthesis, voice-cloning, and acoustic perturbations.
SongEval contains 2,399 complete songs totaling approximately 140 hours of audio, released by ASLP-lab in 2025. It serves as a benchmark for the aesthetic evaluation of music generation systems, featuring annotations from 16 expert raters across five perceptual dimensions.
289 life story interviews totaling 365 hours of audio, collected by nilc-nlp. The dataset features a broad range of speakers varying in age, education, and regional accents. It was last updated on the Hugging Face platform on July 17, 2025.
24,800 structured text prompts systematically vary across genre, tempo, instrument, and mood to study controllability in text-to-music models like MusicGen. The dataset, created by bodhisattamaiti, contains only prompts in CSV format, with no accompanying audio files. It was last updated on December 11, 2025.
Northeast US coastal waters contain benthic fauna data collected from 1881 to the present by National Marine Fisheries Service laboratories. The dataset includes 21,000 sample sites with parameters like depth, sediment type, species name, and abundance. Major studies incorporated are Ocean Pulse, the Northeast Monitoring Program, and surveys of the New York Bight and Long Island Sound.
November 1998 data collected by the USGS survey 98015 aboard the Canadian Coast Guard vessel Frederick G. Creed. This set is a sun-illuminated topographic image of the sea floor offshore eastern Cape Cod, Massachusetts, created from multibeam sonar data. The image has a 4-meter pixel size and was reprojected into the Massachusetts State Plane coordinate system in September 2006.
16 hours of Egyptian Arabic dialect speech audio manually transcribed for automatic speech recognition tasks. The dataset was collected from multi-genre YouTube channels, cleaned, and adjusted for the Hugging Face Hub by MightyStudent. It is intended for fine-tuning or training models like Whisper.
Offering a growing collection of captioned anime voice recordings organized by language and speaker splits. It features audio data specifically curated from anime media for speech synthesis and recognition applications, maintained as a dynamic repository.