Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,909 datasets
722 seed utterances and 32,506 Common Voice samples were used to generate this Taiwanese Hokkien (Min Nan) speech dataset via the CosyVoice3 model. The dataset includes audio files, corresponding text, and speaker metadata. It was created by lianghsun and last updated on March 19, 2026.
ToneWebinars Balalaika is a 248.9-hour Russian speech corpus curated from podcasts by the MTUCI lab260 team. Released in early 2026, the dataset was processed using the BALALAIKA pipeline to provide high-quality audio for generative speech tasks. It serves as a refined version of the original ToneWebinars source, specifically filtered for speech synthesis and recognition.
TWB Voice Kanuri TTS 1.0 Sample Set is a high-quality text-to-speech corpus containing read speech data in Kanuri. It was recorded by a single female speaker under acoustically optimal conditions and represents 10% of the complete dataset collected by CLEAR Global (formerly Translators without Borders). The dataset page was last updated on 2026-02-23.
Featuring high-quality conversational audio samples for Automatic Speech Recognition tasks in Vietnamese, Korean, Arabic, and Filipino. It includes paired audio and transcripts of natural, non-scripted speech, featuring both single-speaker and dual-speaker interactions. Audio specifications include a sampling rate of 16 kHz to 24 kHz and a 16-bit bit depth.
Approximately 865,000 AI-generated music songs collected from five platforms: Mureka, Riffusion, Sonauto, Suno, and Udio. The dataset includes original audio files and full platform metadata stored as JSON sidecar files. It was created by 'ai-music' and last updated on February 25, 2026.
16 June 2019 report details the loss of control and ground collision of an amateur-built Pitts S2E aircraft, registration C-GONV, in Saint-Jean-Port-Joli, Quebec. The investigation was conducted and published by the Transportation Safety Board of Canada. The dataset is a single HTML document containing the official safety investigation narrative.
VoxCeleb2 Dev is the training subset of the VoxCeleb2 dataset, used for speaker identification and audio retrieval tasks. It is an expanded version of VoxCeleb1, containing more speakers and audio samples, and includes standardized audio files with corresponding metadata. The dataset was uploaded by 'humanify' to Hugging Face and was last updated on 2026-03-05.
Offering clinical mental health labels and audio-based model scores for 35,000 individuals, totaling 863 hours of speech data. Created by KintsugiHealth in 2026, it includes demographic metadata for validation and test sets used in model development.
Forest inventory outputs from the Eastern Massachusetts National Wildlife Refuge Complex likely contain measurements of tree cavities, canopy structure, and biomass. The data is managed by the Department of the Interior and was last updated in March 2026. It provides detailed metrics for forest communities and ecosystems.
Jazzmus provides approximately 1,000 expert-annotated jazz lead sheets for Optical Music Recognition (OMR), developed by the PRAIG research group in 2025. The dataset includes high-resolution images paired with system-level bounding boxes and musical encodings for end-to-end transcription tasks.
ViMedCSS is a Vietnamese medical speech dataset designed for code-switching automatic speech recognition. It contains 11,832 training utterances totaling 24.30 hours, with each utterance embedding at least one non-Vietnamese medical term, primarily English. The dataset was created by tensorxt and is associated with the LREC 2026 conference.
Tts Female 70H is a text-to-speech model published on HuggingFace by author vfdanil. The dataset was last updated on April 24, 2026. Its specific content and scale are unknown from the available metadata.
A collection of open source musical instruments using the SFZ format, sourced from the sfzinstruments website. The dataset was created by 'projectlosangeles' and was last updated in March 2026.
Ghana NLP Community released this 2,700-hour collection of Ghanaian English speech and transcriptions in March 2026. Sourced from news media broadcasts, it contains up to 1,000,000 audio segments specifically for West African accent modeling.
November 2018 top 100 songs from over 20 electronic music subgenres on Beatport. The dataset contains audio features extracted from two-minute samples of each song using the pyAudioAnalysis library. It was used in a publication on automatic subgenre classification in electronic dance music.
Expert scores and audio features for music assessment provide a structured evaluation of musical performances. The dataset likely contains quantitative metrics derived from audio recordings alongside subjective ratings from experts. Its origin and scale are unspecified.
A cleaned, metadata-rich Shona speech dataset prepared through a reproducible data engineering pipeline. The dataset is derived from the google/WaxalNLP source, specifically the sna_asr subset, and was last updated on March 20, 2026. It is intended as a general-purpose standard corpus for downstream tasks.
A multimodal dataset combining audio and text data for music genre classification. It likely contains audio features from the GTZAN benchmark dataset paired with corresponding song lyrics. The dataset is published on Kaggle, but its specific creation date and author are unknown.
Dataset_train_xtts is a dataset for training text-to-speech models, published on Kaggle. The dataset's specific content, size, and origin are not detailed in the provided metadata. Further details about the data's collection method, author, and temporal coverage are unavailable.
10,000+ hours of interview audio and video sourced for AI training. The data is described as ethically sourced. The dataset is hosted on Kaggle, but details about the author, organization, and specific collection dates are unknown.