Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,932 datasets
Uzbek language audio clips and text transcriptions sourced from YouTube news channels Kunuz and Qalampir across multiple regional dialects. The dataset utilizes Gemini 2.5 Pro for transcription generation to support Automatic Speech Recognition (ASR) development.
369,510 hours of speech audio and text captions sourced from YouTube, released by the espnet team in 2024. The dataset pairs audio utterances with either user-uploaded (manual) or system-generated (automatic) captions.
Nearly 10 hours of studio-quality English speech recordings from a single speaker recreate expressive utterances from the Switchboard-1 Telephone Speech Corpus. These recordings feature labeled paralanguage and disfluencies across three different data components to simulate realistic informal conversations.
A Persian (Farsi) text-to-speech corpus built by concatenating and denoising multiple existing Farsi datasets. The dataset was created by author Thomcles and last updated on November 9, 2025. It is intended for training and evaluation of speech synthesis models in Persian.
FSD50K is an open dataset of 50,000 human-labeled sound events. It was created by researchers including Eduardo Fonseca and Xavier Serra for audio classification tasks.
Audio, music, and speech AI resources are aggregated in this community-curated repository by yyf, last updated in January 2026. It functions as a directory for signal processing practitioners looking for specialized machine learning tools and datasets.
An audio dataset collected by Willy030125 and last updated on 2025-08-19. It aggregates ambient noise recordings from sources including Kaggle datasets on hospital and general ambient noise. The primary purpose is to reduce hallucinations in Automatic Speech Recognition models like Whisper.
LibriTTS-R is a multi-speaker English speech corpus containing approximately 585 hours of read speech at a 24kHz sampling rate. The dataset is a sound quality improved version of the original LibriTTS corpus published in 2019. It was adapted for the Hugging Face datasets library by the user 'mythicinfinity'.
Magpie-Speech-Orpheus-125k is a synthetic speech dataset containing approximately 125,000 samples. It was created by applying the Magpie instruction-synthesis approach to the Orpheus-TTS text-to-speech model and decoding audio tokens with the SNAC 24 kHz codec. The dataset was authored by Aratako and last updated on August 26,ζ们εη°δΊδΈδΈͺιθ――γ
A large-scale collection of Persian (Farsi) speech-to-text data designed for modern machine learning workflows. The dataset consolidates audio-text pairs from multiple open sources and applies a rigorous cleaning and normalization pipeline. It was created by kiarashQ and last updated on November 3, 2025.
20,000 images of randomly generated sheet music designed to be as hard to read as possible. This dataset trains models to answer music theory questions and name every note in a melody. It was created by Sweaterdog and last updated on December 18, 2025.
Encompassing 201 hours of North American English speech data from 302 speakers. Recordings were made in quiet indoor environments using PC and Android mobile phone devices, capturing phrases and sentences across various scenes.
The Darija Speech To Text Dataset is a collection of 13,178 rows of transcribed speech audio totaling 8.23 GB, created by ayoubkirouane. It was last updated on 2024-07-18 and focuses on the Darija dialect, primarily from Algeria and Morocco, with slang from other Arabic-speaking countries.
Chinese-LiPS is a multimodal dataset for audio-visual speech recognition in Mandarin Chinese. It combines speech, video, and textual transcriptions to enhance automatic speech recognition performance, particularly in educational contexts. The dataset was created by BAAI and was last updated on 2025-11-18.
41 Landsat 5 and 7 scenes were analyzed to classify land cover for U.S. Coast zone 65, enabling change detection between 1995-era and 2000-era classifications. The Massachusetts Office of Coastal Zone Management reprojected the data into the Massachusetts State Plane coordinate system in 2006. This dataset was produced by the Multi-Resolution Land Characteristics program through a multi-agency effort.
Six months of field sampling and orthographic image analysis in 2018 documented ice-rafted sediment deposits in the Great Marsh, Massachusetts. The dataset, curated by NOAA's National Centers for Environmental Information, includes deposit locations, thickness measurements, and calculated total sediment coverage. Data is stored in an ArcGIS geodatabase with shapefiles and supporting Excel spreadsheets.
Massachusetts coastal land cover classifications for 1995 and 2000, produced by the Multi-Resolution Land Characteristics program. The data set was created by analyzing 41 full or partial Landsat 5 and 7 scenes following the Coastal Change Analysis Program protocol. It was last updated by the Massachusetts Office of Coastal Zone Management in October 2006.
Aggregating a sample of 134 hours of Malay speech audio recorded by 156 native speakers in a quiet environment. It includes approximately 450 sentences per speaker, covering categories like economy, entertainment, news, and numbers.
March 2025, the Svarah dataset provides 9.6 hours of transcribed English audio from 117 speakers across India. It addresses the underrepresentation of Indian English speakers in existing benchmarks like LibriSpeech and Switchboard. The dataset was created by ai4bharat.
Aggregating 1,044 hours of Brazilian Portuguese speech recorded by 2,038 native speakers using mobile phones. The recording text was designed by linguistic experts and covers general interactive, in-car, and home categories, with texts manually proofread for high accuracy.