Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,962 datasets
A collection of a sample of speech recordings from 290 children in the U.S.A., with a balanced male-female ratio. The audio content is sourced from children's books and textbooks, recorded in quiet indoor environments using mobile phones.
MIDI files represent classical compositions from renowned artists like Bach, Beethoven, Chopin, and Mozart. The collection is organized into directories by composer. It was created by user 'drengskapur' and last updated in July 2024.
140 hours of Norwegian speech recordings from 40 days of parliamentary meetings, transcribed into 65,000 sentences in both Bokmål and Nynorsk. The dataset includes 1.2 million words and links audio segments to speaker metadata such as gender, age, and dialect.
A language-labeled version of the VoxCeleb2 speaker identification dataset. It was created by applying a language identification model to the original audio clips. The dataset was authored by johbac and last updated on Hugging Face in April 2025.
BUREAU DE RECHERCHES GÉOLOGIQUES ET MINIÈRES provides a dataset mapping the sound classification of land transport infrastructure in Maine-et-Loire department, France. The classification is mandated by French law (Law No. 92-1444 and the Environmental Code) and identifies sectors affected by noise based on traffic characteristics. The dataset was last updated on 2021-09-03.
The Var department in France contains a dataset classifying railway land transport infrastructure by noise levels, based on a prefectural decree from September 29, 2016. It likely contains polygons or zones representing areas affected by noise, categorized from 1 (noisiest) to 5, with defined nuisance sector widths. The dataset was produced by the Bureau de Recherches Géologiques et Minières (BRGM) and was last updated on January 10, 2020.
A collection of short audio snippets extracted from publicly shared songs generated by the Suno AI model. All excerpts, ranging from 3 to 30 seconds, have been captioned using the Gemini Flash 2.0 model to produce human-readable audio descriptions. The dataset was created by laion and last updated on Hugging Face in November 2025.
Aggregating a sample of Chinese English speech recordings collected via mobile phone from 1,279 speakers representing major Chinese dialect regions. The recordings feature a specific Chinese English accent and cover categories including spoken English, speech, and human-computer interaction.
Composed of a Chinese Mandarin speech corpus featuring recordings from 400 speakers representing various accent regions across China. The audio was captured in quiet indoor settings using high-fidelity microphones and is provided at a 16kHz sampling rate with manual transcriptions.
630 speakers from 8 major American English dialect regions each reading 10 phonetically rich sentences. The dataset includes high-quality audio recordings accompanied by time-aligned phonetic and word transcriptions for acoustic-phonetic research.
Part of the Speech processing Universal PERformance Benchmark (SUPERB), a leaderboard for evaluating shared self-supervised learning models across multiple speech tasks. It contains audio data stored in the .flac format.
A sample of a larger paid Vietnamese speech collection, containing recordings from 1751 native speakers using mobile phones. The script was designed by linguists and covers topics including generic, interactive, on-board, and home scenarios, with text manually proofread for accuracy.
Sample of a paid dataset containing female American English audio recordings for speech synthesis. The data is recorded by a native speaker with authentic accent and balanced phoneme coverage, annotated with professional phonetician involvement.
85 hours of emotion-neutral Mandarin speech recordings from 218 native speakers, comprising 88,035 utterances. The corpus is designed for training multi-speaker Text-to-Speech systems and includes auxiliary speaker attributes such as gender, age group, and native accent labels.
Historical upper air meteorological data collected globally over a 5-year period from 1958 to 1963. This small dataset was created by the Massachusetts Institute of Technology (MIT) and archived at the National Climatic Data Center. It was later incorporated into the larger, quality-controlled Comprehensive Aerological Data Set (CARDS).
Featuring a sample of Chinese speech recordings from 200 native speakers covering main dialect zones. It includes recordings made in both noisy and quiet environments, with texts transcribed by professional annotators.
A collection of 532 English speech recordings from Portuguese speakers, captured in a quiet environment using mobile phones. The scripts were designed by linguists and cover generic, interactive, on-board, and home topics, with manual proofreading for high text accuracy.
ScaleAI compiled 2,912 successful jailbreak prompts across 537 multi-turn conversations for the paper 'LLM Defenses Are Not Robust to Multi-Turn Human Jailbreaks'. The dataset includes metadata such as design choice comments from red teamers and the resulting attack success rate (ASR). It was last updated on the platform in September 2024.
Featuring a sample of Korean speech recordings from 211 local speakers, comprising 99 females and 112 males. Audio was captured in a quiet indoor environment using mainstream Android phones and iPhones.
Sample of a paid dataset containing male audio data for American English speech synthesis. The audio is recorded by native speakers with authentic accents and features phoneme-balanced coverage with professional phonetician annotation.