Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
A repository aggregating multiple public Brazilian Portuguese (PT-BR) speech corpora into a single dataset for Automatic Speech Recognition (ASR) training and research. The dataset was created by opedromartins and was last updated on 2025-09-14. Its goal is to provide a broad, standardized, and easily accessible resource for the community.
10 hours of piano music performed by 15 elite-level pianists, comprising 153 pieces. The dataset includes synchronized audio and key pressing events, capturing physical performance data. It was created by rcwang and hosted on HuggingFace, with a last update in November 2024.
ShiftySpeech is a large-scale synthetic speech dataset containing over 3000 hours of audio. It was created by user ash56 and spans seven distinct distribution shifts including reading style, podcast, YouTube, three languages, and demographic variations. The dataset was last updated on HuggingFace in October 2025.
Rural Women ASR V2 provides between 10,000 and 100,000 audio utterances of Hindi and Bhojpuri speech recorded from rural women in India. Developed by ai4bharat for the Recognizing Every Voice initiative, the collection was last updated in November 2025. It captures speech from diverse age groups, regions, and socio-economic backgrounds to improve inclusive speech technology.
A collection of over 5.8 million unique and normalized MIDI files, last updated on 2025-06-06. The dataset was created by 'projectlosangeles' for Music Information Retrieval (MIR) and symbolic music AI. Each file was converted to a proper MIDI format specification and checked for integrity.
Audio recordings of Isan (Northeastern Thai) speech are paired with transcriptions and demographic metadata. The dataset, created by typhoon-ai and last updated in November 2025, features spontaneous responses to questions across General and Finance domains. It is designed to support Automatic Speech Recognition, dialect study, and text normalization tasks.
Martínez-Castilla et al.'s dataset contains raw survey data from 507 Spanish adults collected between August and December 2020. The data was used to analyze the impact of personal and contextual variables on the perceived efficacy of music for emotional wellbeing during the COVID-19 lockdown. Personal variables include age, gender, musical training, personality, resilience, and perception of music's importance.
MMAU-Pro is a benchmark dataset for evaluating audio intelligence in multimodal models, covering speech, environmental sounds, and music. It contains 5,305 expert-annotated question–answer pairs with audios sourced from the wild. The dataset was created by gamma-lab-umd and updated in August 2025.
VoxBox is a curated collection of bilingual speech corpora annotated with clean transcriptions and metadata. The dataset was created by SparkAudio and was last updated on April 15, 2025. It includes audio files and JSONL metadata files organized by sub-corpus, such as aishell-3, casia, commonvoice_cn, and wenetspeech4tts.
78 hours of audio extracted from the Khan Academy Turkish YouTube channel. The dataset is segmented into short clips averaging 10.5 seconds each and was created by author ysdede, with a last recorded update on 2025-02 11.
Haitian Creole speech recordings and documentation from the Carnegie Mellon University Language Technologies Institute. This 2010 collection is released under a permissive license for unrestricted use, modification, and distribution.
700 hours of Central Thai speech and 40 hours each for three other Thai dialects form this corpus. The dataset, created by CMKL, includes parallel sentences across dialects to support speech and translation research. It was last updated in September 2024.
VietMed is a Vietnamese speech recognition dataset for the medical domain. It comprises 16 hours of labeled medical speech, 1000 hours of unlabeled medical speech, and 1200 hours of unlabeled general-domain speech. The dataset was introduced by leduckhai for the LREC-COLING 2024 conference.
Nexora Music Pd V1 Medium is a dataset hosted on Hugging Face by ArkAiLab-Adl. The dataset's title suggests it contains music-related data, likely for machine learning applications. It was last updated on January 8, 2026.
Encompassing 231 hours of French speech audio recorded by 406 speakers from France, Canada, and Africa. The audio was recorded in a quiet environment via mobile phone reading and covers content from fields like economics, entertainment, and news. All audio has corresponding manually transcribed text with a reported sentence accuracy rate of 95%.
Aggregating 769 hours of French speech recorded by 1623 native speakers using mainstream Android phones and iPhones. The recording text was designed by linguistic experts and manually proofread, covering general interactive, in-car, and home categories.
Aggregating 10.9 hours of French speech from 401 speakers, with each speaker contributing 50 sentences. The audio was recorded using mainstream Android phones and iPhones for guiding scenarios including in-car, smart home, and smart speech assistant contexts. Texts were manually transcribed for accuracy.
1,804 hours of high-quality Persian speech data designed for text-to-speech applications. The dataset, created by MohammadJRanjbar, addresses a gap in Persian speech technology by providing speaker diversity and audio quality comparable to major English corpora. It was last updated on the platform in October 2025.
Ten days of aerial surveys from July 11 to 29, 2005 recorded the distribution and relative abundance of marine mammals, birds, and large fish over a 1,672 km² study area in the Gulf of Maine. The survey, conducted by SCIOPS, involved six 46 km transects per day, with observations made from a twin-engine aircraft flying at 230 m altitude. Data includes sighting locations, species identifications, and group sizes for upper trophic level predators.
EnviroAtlas data estimates walkable road intersection density within a 750-meter radius of any 10-meter pixel in 18 U.S. communities. The U.S. Environmental Protection Agency created this dataset, which was last updated in April 2025. Intersection density is calculated using kernel density, weighting closer intersections higher than distant ones.