Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,962 datasets
Attention2Probability is a lightweight intervention scheme for speech terminology. The core approach uses a cross-attention mechanism to retrieve terms likely to appear in audio, which are then added to an LLM prompt for term intervention. The dataset, created by ByteDance, was last updated on Hugging Face on August 27, 2025.
Japanese Anime Speech contains between 10,000 and 100,000 audio-text pairs sourced from Japanese visual novels, released by joujiboi in 2024. The collection pairs audio clips with their corresponding transcriptions to improve the accuracy of automatic speech recognition models for anime-style media.
316,000 music tracks shared on YouTube under the CC-BY license are described in this metadata collection. The dataset includes descriptions, tags, and other metadata associated with videos from a diverse range of artists and genres. It was created by WaveGenAI and last updated on October 29, 2024.
ChartQA is a multimodal dataset hosted by ahmed-masry on Hugging Face, last updated on June 22, 2024. It likely contains chart images paired with textual questions and answers for visual question answering tasks. The dataset requires manual download of a zip file and cannot be loaded directly via the standard datasets library function.
A speech dataset comprised of recordings of two people engaging in spontaneous conversations in English. The dataset aims to fill the gap in high quality spontaneous speech data and was created by CASPER-SSSD, last updated on June 16, 2025. Conversations were conducted over a custom-built web platform from each participant's end and their own device.
AVHRR satellite imagery of Eastern Antarctica was captured by the NOAA12 satellite. Data collection began in June 1996, covering specific coastal and ice shelf regions, but the archival service was discontinued in 2015. The data originates from the Antarctic Meteorology Centre's Casey HRPT receiver, managed by the Australian Antarctic Data Centre (AU_AADC).
Coastal seafloor physiographic zones between Nahant and Gloucester, Massachusetts, are characterized from NOAA nautical charts and aerial photographs. The dataset was created by SCIOPS and last updated in 2003. It focuses on inshore areas not covered by other high-resolution geophysical surveys.
2003 data from NASA EarthData provides geospatial statistics on internal wave packets extracted from Synthetic Aperture Radar (SAR) imagery over Massachusetts Bay. The dataset, sourced from NOAA NCEI, contains polygons representing 1x1 minute latitude/longitude grid cells with calculated statistical metrics for each cell. It was created to analyze the frequency, size, and location of these oceanographic features.
Approximately 170 square kilometers of seafloor data were collected for Boston Harbor and its approaches. The National Oceanic and Atmospheric Administration Ship Whiting gathered sidescan sonar and bathymetric measurements in 2000 and 2001. The Massachusetts Office of Coastal Zone Management and the U.S. Geological Survey reprocessed and gridded the data.
Petersham, Massachusetts is the location for these lidar-derived digital surface model (DSM) data, representing surface elevations for 'leaf-on' conditions in August 2022. The data were collected by the NSIDC_CPRD organization as part of the SMAPVEX19-22 campaign to validate satellite-derived soil moisture estimates in forested areas. The DSM captures the highest elevation of features, which may include bare-earth, vegetation, and human-made objects.
Reprocessed SEVIRI All-Sky Radiances product contains mean brightness temperatures from all thermal infrared and water vapor channels for 16x16 pixel areas. The product, generated by EUMETSAT using version 1.5.3 software and ECMWF ERA-interim data, includes clear and cloudy sky brightness temperatures, clear sky fraction, and solar zenith angle. Data is BUFR encoded and provided at 3-hourly intervals on every third quarter hour.
Approximately 170 square kilometers of seafloor data were collected by NOAA Ship Whiting in 2000 and 2001. The Massachusetts Office of Coastal Zone Management and the U.S. Geological Survey reprocessed and gridded the sidescan sonar and bathymetric measurements. These data were converted to the Massachusetts State Plane coordinate system in 2006.
2003 data from NOAA NCEI provides statistics on internal wave packets extracted from Synthetic Aperture Radar (SAR) imagery. The data is aggregated into 30x30 arc-second latitude/longitude polygon grid cells. It includes calculated metrics for each cell, such as packet frequency and area statistics.
MDCC is a large-scale Cantonese automatic speech recognition dataset compiled from multiple domains. It provides .wav recordings of both spontaneous and read speech paired with UTFβ8 plainβtext transcripts and speaker metadata. The dataset was created by author 'ming030890' and was last updated on the Hugging Face platform on 2025-07-26.
Audio segments and transcriptions extracted from the NPTEL Introduction to World Literature lecture series. The dataset is intended for research and educational purposes in speech recognition and literary content analysis. It was uploaded by author swastik17 to Hugging Face and last updated on 2025-05-20.
Tejasva-Maurya's English Technical Speech Dataset contains 11,247 audio recordings of technical vocabulary. The collection includes transcriptions and speaker embeddings, last updated on October 26, 2024. It is designed for developing speech and language models.
Presenting a sample of a paid corpus containing speech recordings from 10 British English native speakers. It is designed for speech synthesis research, featuring balanced phoneme coverage and annotations involving a professional phonetician.
Nawar Halabi at the University of Southampton developed this speech corpus as part of PhD work. Recordings were made in a professional studio using the south Levantine Arabic dialect with a Damascian accent. Synthesized speech output from this corpus has reportedly produced a high-quality, natural voice.
A curated subset of the MTG-Jamendo Autotagging benchmark containing tracks annotated with genre, instrument, and mood/theme tags. Audio files are preprocessed to 30-second clips at a 16kHz sampling rate for consistent music auto-tagging tasks. The dataset was uploaded by author vtsouval and last updated on 2025-05-14.
MusicSem is a multimodal dataset containing 35,977 entries of paired text and audio. It includes a withheld test set of 480 entries for leaderboard evaluation. The dataset was curated by Rebecca Salganik, Teng Tu, Fei-Yueh Chen, Xiaohao Liu, Kaifeng Lu, Ethan Luvisia, Zhiyao Duan, Guillaume Salha-Galvan, Anson Kahng, Yunshan Ma, and Jian Kang.