Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
Annotated Russian audio data for tasks like text-to-speech, speech recognition, and speaker identification. The dataset includes features such as text transcriptions, speaker identifiers, audio files, utterance pitch statistics, and signal-to-noise ratio (SNR). It was created by author kijjjj and last updated in June 2025.
ContextASR-Bench is a benchmark dataset designed to evaluate the linguistic capabilities of Automatic Speech Recognition models. It was created by author MrSupW and last updated on Hugging Face in August 2025. The description suggests it addresses a gap in prior benchmarks by focusing on world knowledge and contextual understanding.
Physical trajectory profile data was collected by a University of Massachusetts - Dartmouth glider during a 90-day mission from August to September 2019. The dataset contains measurements of oceanographic properties like temperature, salinity, and chlorophyll a to investigate Mid-Atlantic Cold Pool dynamics. The data was assembled by the IOOS National Glider Data Assembly Center and archived by NOAA's National Centers for Environmental Information.
CTD data from a SeaBird SBE-37 Microcat sensor deployed at a depth of 5 kilometers near Station ALOHA, north of Oahu, Hawaii. The dataset contains rapidly sampled measurements of conductivity, temperature, and pressure from 2011 to 2013, collected by the NOAA National Centers for Environmental Information. Sampling rates changed from 1 minute to 1 second and then to 2 seconds during the deployment.
August 6 to October 21, 2018 data from a glider mission measuring physical oceanographic properties in the Mid-Atlantic Bight. The dataset contains measurements from glider BLUE deployed by the University of Massachusetts - Dartmouth, focusing on seasonal-varying features of the Mid-Atlantic Cold Pool. It was archived by NOAA's National Centers for Environmental Information.
August 31 to September 22, 2017 trajectory data from the glider 'Blue' deployed by the University of Massachusetts - Dartmouth. The dataset contains physical oceanographic measurements like temperature, salinity, conductivity, density, chlorophyll, backscatter, CDOM, and oxygen. It was collected as part of the 'Investigation of Mid-Atlantic Cold Pool Dynamics' program and archived by NOAA's National Centers for Environmental Information.
A dataset titled 'Tts Mazlum Kiper Tur' was published on the Hugging Face platform by author 'omersaidd'. The dataset was last updated on 2026-01-08 10:36:40. Platform tags suggest it contains Turkish language text and audio data for text-to-speech applications.
Tamazight-NLP hosts the Tamazight-Arabic Speech Recognition Dataset containing 20,344 audio segments. The dataset provides approximately 15.5 hours of Tamazight speech in the Tachelhit dialect paired with Arabic transcriptions. It was last updated on March 29, 2025.
30,800 audio-text records totaling 66.4 hours of Japanese voice lines from the game Fate/Grand Order. The dataset, created by deepghs, was last updated on August 28, 2024, and includes only voices with a single voice actor to reduce noise. Each record has an average duration of approximately 7.76 seconds.
Mozilla Common Voice Corpus 22.0 is a multilingual speech dataset featuring audio recordings and text transcriptions across a wide array of global languages. This version is an unofficial conversion of the Mozilla project data provided by fsicoli and updated in August 2025. It includes data for dozens of languages including Arabic, Bengali, and Chinese.
BRSpeech-DF is the first publicly available dataset for deepfake speech detection in Portuguese, covering both Brazilian and European variants. It contains 459,000 audio samples, including both real and synthetic speech generated using multiple zero-shot text-to-speech models. The dataset was created by AKCIT-Deepfake and was last updated on 2025-11-25.
A dataset for speaker recognition tasks, published on the Hugging Face platform by Acouspike. The dataset was last updated on January 12, 2026. The specific content, size, and structure require verification after download.
Vegetation field plots at Scotts Bluff National Monument were visited, described, and documented in a digital database. The database consists of three parts: Physical Descriptive Data, Species Listings, and Strata Descriptive Data. Information for this metadata was obtained from a USGS site and put into NASA Directory Interchange Format.
A collection of game artwork assets from OpenGameArt.org, all released under the Creative Commons Attribution 4.0 International license. The dataset includes 2D art, 3D art, concept art, music, sound effects, textures, and documents with associated metadata. It was uploaded to Hugging Face by the author 'nyuuzyou' on 2025-04-20.
ViSpeR is a large-scale dataset for Visual Speech Recognition (VSR) covering four widely spoken languages: Arabic, Chinese, French, and Spanish. It was created to address the scarcity of publicly available VSR data for non-English languages and is described as larger in size compared to other datasets in its domain. The dataset and models are hosted by the author 'tiiuae' and were last updated on April 17,ζ们εη°δΈδΈͺιθ――οΌθ―·ε ³ιε½εε·₯ε ·οΌιθΏζθΏ°ιθ――ζ₯ει¦η»ζ们γ
30,160 audio-text records of Japanese voice lines from the game Azur Lane, totaling 75.8 hours. The dataset was created by deepghs and last updated on August 28, 2024. It includes only voices with a single voice actor to reduce noise.
8,169 Egyptian-Arabic text samples are manually annotated for offensive language and hate speech. The dataset was created by IbrahimAmin, Mostafa Abbas, Rany Hatem, Andrew Ihab, and Mohamed Waleed Fahkr. It was last updated on August 17, 2025.
Baamtu Datamation created this Wolof Text-to-Speech dataset under the AI4D African language program. It contains recordings from a male speaker for 22 hours 28 minutes 41 seconds and a female speaker for 18 hours 47 minutes 19 seconds, each contributing over 20,000 sentences sourced from news and Wikipedia.
12,508 audio-text records totaling 20.9 hours of Japanese voice lines from the game Girls Frontline. The dataset, created by deepghs and last updated in August 2024, is curated to include only characters with a single voice actor to reduce noise. Average audio clip duration is approximately 6.01 seconds.
SingMOS-v1 is a preview version of the SingMOS-Pro dataset containing 3,421 Chinese and Japanese vocal clips. The clips have a sample rate of 16 kHz and total 4.25 hours in duration. The dataset was created by TangRain and last updated on Hugging Face in October 2025.