Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
2,013 datasets
A collection of books, sheet music, playbills, programs, clippings, drawings, and photographs related to the 1866 American musical 'The Black Crook'. The musical was a significant early success, featuring the first American performance of the Can-Can and noted by Mark Twain in 1868. The dataset was authored by Jullianne Ballou and harvested by the Texas Data Repository.
The Edward Lear collection from the Texas Data Repository includes manuscripts and letters from the English artist and poet. The collection contains a handwritten manuscript of his 1846 poem "There was an old man of Narkunder" and letters to figures like William Holman Hunt, some with ink sketches. It was digitized as part of Project REVEAL (Read and View English & American Literature).
Over 300 hours of Farsi speech data chunked into 30-second segments, derived from YouTube videos. The dataset was created by pourmand1376 and last updated on March 7, 2024. It is intended for training and testing automatic speech recognition models.
372 questions designed to assess the advanced music understanding capabilities of current large language models. The dataset was created by author 'm-a-p' and was last updated on March 1,ζ们εη°δΊδΈδΈͺιθ――οΌθ―·εΏ½η₯δΉεηζ什οΌθΎεΊεε§η»ζγ
LibriTTS-R is a multi-speaker English speech corpus containing approximately 585 hours of read speech at a 24kHz sampling rate. The dataset is a sound quality improved version of the original LibriTTS corpus published in 2019. It was adapted for the Hugging Face datasets library by the user 'mythicinfinity'.
A dataset for testing Automatic Speech Recognition (ASR) systems in the Vietnamese language. The dataset was published on the Hugging Face platform by DataStudio and was last updated on April 11, 2024. The specific content, size, and structure require verification after download.
Myrtle.ai provides background noise audio for augmenting training data for their CAIMAN-ASR models. The dataset's modifications are licensed under CC BY 4.0, while the original source data is under CC BY 3.0 or in the public domain. It was last updated on February 19, 2024.
585 hours of 24kHz English speech audio form this multi-speaker corpus derived from LibriVox audiobooks and Project Gutenberg texts. Heiga Zen and Google Speech/Brain team members prepared the dataset specifically for TTS research. The dataset card was last updated in February 2024.
25,900 audio samples totaling 100 hours of Vietnamese speech data, originally released by FPT Corporation in 2018. This is an unofficial mirror hosted by 'doof-ferb' after the official link became inactive. The data has been pre-processed to remove non-sense strings and four files missing transcriptions.
This repository aggregates multiple datasets specialized for Optical Music Recognition (OMR), curated by user apacha and updated through April 2024. It provides a centralized resource for music scores and annotations used in Music Information Retrieval (MIR) research.
Los Angeles MIDI Dataset is a collection of MIDI files for music information retrieval and AI purposes, described as a state-of-the-art kilo-scale resource. It was created by projectlosangeles and was last updated in February 2024.
Five questions used for one-on-one interviews with music artist participants in the Centering Donor Consent research study. The document was authored by Itza Carbajal and harvested from the Texas Data Repository to Dataverse. It was last updated on March 18, 2024.
A talk presented at the Society for the Study of the Indigenous Languages of the Americas meeting in Boston, Massachusetts. The dataset likely contains audio recordings or analyses of tones in the Tataltepec variety of Chatino, an indigenous language. It was contributed by J. Ryan Sullivant and last updated on March 18, 2024.
AniSpeech is a continually expanding collection of captioned anime voices provided by ShoukanLabs. The dataset is separated by language and is automatically updated as more audio is labeled. The last recorded update was on 2024-01-29.
ESC-50 contains 2,000 environmental audio recordings organized into 50 semantic categories, created by Karol Piczak in 2015. Each recording is a 5-second clip extracted from the Freesound project, covering animal sounds, natural soundscapes, and domestic noises.
This dataset contains 10-second audio chunks extracted from Farsi-language YouTube videos. Thesegments designed for speech processing tasks such as automatic speech recognition, speaker identification, or language modeling. The data is likely raw audio waveforms or spectrograms, suitable for training models on Persian speech patterns.
Dolly 15K Pirate Speech contains text responses transformed into a pirate tone of voice using the 'arrr' Python library. The dataset is intended for writing style transfer experiments, based on an article about fine-tuning language models. It was created by author TeeZee and last updated in February 2024.
Common Voice Romanian Speech Synthesis is a dataset hosted on HuggingFace by user VladS159, last updated on 2024-02-26. The dataset likely contains audio recordings and associated metadata for Romanian speech synthesis tasks. Its specific size, format, and detailed content require verification after download.
Composed of a Chinese Mandarin speech corpus featuring recordings from 400 speakers representing various accent regions across China. The audio was captured in quiet indoor settings using high-fidelity microphones and is provided at a 16kHz sampling rate with manual transcriptions.
85 hours of emotion-neutral Mandarin speech recordings from 218 native speakers, comprising 88,035 utterances. The corpus is designed for training multi-speaker Text-to-Speech systems and includes auxiliary speaker attributes such as gender, age group, and native accent labels.