Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,932 datasets
Presenting a 10.4-hour sample of a Japanese speech synthesis corpus recorded by a native female speaker with authentic accent. It features phonemically balanced coverage and professional phonetician annotations, matching speech synthesis R&D needs.
SPRINGLab's IndicTTS Malayalam dataset contains high-quality speech recordings with transcriptions for text-to-speech research. The dataset includes approximately 17.89 hours of audio from male and female speakers, sourced from the Indic TTS Database project. It was last updated on January 25, 2025.
Encompassing between 10,000 and 100,000 audio clips and transcriptions in the Uzbek language, specifically targeting the Information Technology domain. Collected by islomov from YouTube channels like Mohir Dev and updated in June 2025, it includes English technical terms to improve model generalization. The data is designed for training and evaluating Automatic Speech Recognition (ASR) systems in a technical context.
Between 10,000 and 100,000 audio clips and transcriptions of Tashkent dialect Uzbek speech comprise this dataset. Collected by islomov from YouTube podcasts like Jahongir Latipov and Bu podcast, it was last updated in June 2025 for Automatic Speech Recognition (ASR) tasks.
AF-Chat is a fine-tuning dataset of approximately 75,000 multi-turn conversations involving audio clips, created by NVIDIA. The conversations are multi-audio, with an average of 4.6 clips and 6.2 turns per conversation, spanning speech, environmental sounds, and music. The dataset was last updated on July 21, 2025.
Comprising 240 hours of Hindi speech recorded by 401 Indian speakers in both quiet and noisy environments. The recording content covers economic, entertainment, news, and spoken language topics, with all texts manually transcribed for high accuracy.
FeruzaSpeech contains 60 hours of high-quality Uzbek read speech recorded by a single native female speaker from Tashkent. Released by k2speech in 2024, the collection provides transcriptions in both Latin and Cyrillic alphabets for academic research. It serves as a resource for low-resource language modeling in the Central Asian region.
500-hour sample of Filipino speech audio recorded by native speakers using mobile phones. The accompanying text has been manually proofread for high accuracy, and recordings are compatible with mainstream mobile operating systems.
986 hours of European Portuguese speech data, recorded by 2,000 native speakers using mobile phones. The scripted monologue content was designed by professional language experts and covers categories like general purpose, interactive, vehicle-mounted, and household commands.
292,637 audio clips and transcriptions sourced from Japanese visual novels for automatic speech recognition training. Created by joujiboi and updated in November 2025, it provides a specialized corpus of character-driven dialogue. The collection is distinct from version 1 and contains mostly unique audio content.
Sample of 516 hours of Korean audio data, recorded by 1,077 speakers with a near-equal gender distribution. The recordings include daily language, interactive sentences, and command phrases, with each speaker contributing approximately half an hour of audio.
Comprising 1,012 hours of Indian English audio data recorded by 2,100 native speakers using mobile phones. The recorded text was designed by linguistic experts and manually proofread for high accuracy, covering generic, interactive, on-board, and home categories.
Over 12 million links to music on YouTube, compiled by LAION. The dataset was created by recursively exploring the 'Fans might also like' artist graph, extracting metadata such as artist names and subscriber counts. It was last updated in November 2024.
Encompassing 199 hours of valid speech data from 346 British English speakers, all of whom are local to England. Each speaker recorded approximately 392 sentences across categories including economics, news, entertainment, and commonly used spoken language in a quiet environment.
A 338-hour sample of Spanish speech recorded by 800 native speakers from Spain, Mexico, and Argentina. All audio was recorded in quiet environments and manually transcribed with a reported sentence accuracy rate of 95%.
10,905 Japanese voice-text pairs from the Arknights mobile game, curated for playable characters. The dataset contains 26.3 hours of audio with an average duration of 8.7 seconds per sample, created by user deepghs and last updated in August 2024. It is designed for training and evaluating automatic speech recognition and voice synthesis models.
Featuring 210,000 commonly used Japanese sentences recorded by 799 local speakers. Audio was captured in quiet indoor places, streets, and restaurants using mainstream Android phones and iPhones, with a text transcription error rate below 5%.
A three-dimensional numerical model simulates circulation in Massachusetts and Cape Cod Bays, driven by tides, wind, river runoff, and thermal forcing. The U.S. Geological Survey developed this model to study the transport of nutrients, contaminants, and red tide populations. The dataset was last updated in 1992.
Aggregating 227 hours of Spanish speech data recorded by native speakers from Spain, Mexico, and Venezuela via mobile phones. The recordings, made in quiet environments, cover fields like economy, entertainment, and news, with all texts manually transcribed to 95% sentence accuracy.
Tornado records provide a chronological listing of tornado events by U.S. state from 1950 onward. The National Weather Service compiled these reports, which include statistics on injuries, fatalities, and damage estimates. Data is coded according to the Pearson Tornado Tape 82-column text format.