Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,943 datasets
The eastern North Pacific, about 500 km from Vancouver Island, was the site of the Acoustic Surface Reverberation Experiment in 1991-1992. The Upper Ocean Processes Group deployed moorings to measure oceanographic variables, with this dataset likely containing temperature readings. Data was collected over 9.5 weeks during the winter of 1991-1992 at a sample rate of 7.5 minutes.
Ming030890's dataset contains Cantonese audio-caption pairs sourced from YouTube videos with manually provided captions. It was built by re-transcribing audio with SenseVoice and filtering segments to create a collection supporting ASR development. The dataset includes segments where ASR output matches original captions and segments with homophone or English word differences.
Over 3,900 audio recordings of medical speech comprise this evaluation set for automatic speech recognition (ASR) systems, published by ekacare in 2025. The data focuses on the transcription of clinical terminology and branded drug names specific to the Indian healthcare sector.
PSRB (Persian Speech Recognition Benchmark) is a dataset designed to evaluate Persian Automatic Speech Recognition systems. This 1-hour sample provides a representative subset capturing various accents, speech styles, speaker demographics, and acoustic environments. The dataset was created by PartAI and was last updated on 2025-08-19.
SongFormBench is a benchmark dataset for Music Structure Analysis created by a consortium including Northwestern Polytechnical University, Hong Kong University of Science and Technology, Northwestern University, and Cornell University. It was last updated on the Hugging Face platform on October 11, 2025. The dataset is described as high-quality and is intended for evaluating tasks related to understanding the structural components of music.
A Vietnamese speech synthesis dataset containing 500 hours of audio data, focusing on dialects. The dataset is hosted on HuggingFace by author pnnbao-ump and was last updated on November 11, 2025. Access is restricted to institutions or organizations with clear research use cases.
A Persian text-to-speech dataset containing paired audio recordings and transcriptions. It is designed for training and evaluating TTS systems, created by alavanaico and last updated on 2025-09-20.
Monster MIDI Dataset is a large collection of raw MIDI files intended for music information retrieval and AI music generation. The dataset includes a stand-alone Python module for GPU/CPU-powered search and filtering. It is hosted by projectlosangeles and was last updated in November 2025.
360,493 posts form a superset of English language content annotated for hate speech. Manueltonneau compiled this collection in April 2024 by merging all publicly available and documented hate speech datasets identified in a systematic survey. The dataset serves as a consolidated resource for text classification research.
This placeholder dataset contains a small collection of audio files in .flac format specifically formatted for the Speech processing Universal PERformance Benchmark (SUPERB). It provides a file column to facilitate the development of speech processing pipelines and the extraction of self-supervised learning representations.
Aggregating a sample of Thai speech audio data recorded by 498 native speakers reading text in a quiet environment. The recordings cover multiple categories including economics, entertainment, news, figure, and oral content, with approximately 400 sentences per speaker.
105 hours of manually checked Uzbek speech recordings featuring 958 unique speakers. The dataset includes transcribed audio files designed for speech recognition tasks in the Uzbek language.
Approximately 20,000 high-quality Somali speech clips with transcriptions, designed for training Text-to-Speech models. The dataset was uploaded by author 'zakihassan' and last updated on July 12, 2025. Audio files are in .wav format with a 22050 Hz sampling rate.
A collection of speech recordings from 349 American English speakers, recorded in a quiet environment. The audio covers categories like economics, entertainment, news, and spoken language, and is manually transcribed with start and end time annotations.
Comprising flags for Urdu Automatic Speech Recognition (ASR). It was created by author 'kingabzpro' and was last updated on October 21, 2025. The specific number of rows, columns, and data size is unknown.
A sample of Italian speech data recorded from 325 native speakers reading text in a quiet environment. The recordings cover multiple categories including economics, entertainment, news, and oral content, with an average of 9.2 words per sentence.
MusicScore is a large-scale dataset of music score images paired with textual metadata. It was collected and processed from the International Music Score Library Project (IMSLP) by authors Yuheng Lin, Zheqi Dai, and Qiuqiang Kong. The dataset was last updated on June 20, 2024.
CS-Dialogue is a 104-hour dataset of spontaneous Mandarin-English code-switching dialogues for speech recognition. It was created by BAAI to address limitations in existing datasets, such as small size and lack of natural conversations.
BosonAI created a dataset of 1645 diverse test cases for evaluating Text-to-Speech models. The dataset focuses on six challenging scenarios: emotions, paralinguistics, foreign words, syntactic complexity, complex pronunciation, and questions. It was released in June 2025 to accompany a research paper.
Manueltonneau's French Hate Speech Superset contains 18,071 posts annotated as hateful or not. It merges all publicly available French hate speech datasets identified in a systematic 2024 survey. The dataset was last updated in October 2024.