Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
2,013 datasets
Descriptive text data on folk music and dance traditions from the Olivenza region, likely documenting cultural practices. The dataset was coordinated by Álvarez Pérez, Xosé Afonso and harvested into the e-cienciaDatos Dataverse platform. It was last updated on May 5, 2024.
An interview with Sara Delgado from Piedras Albas, harvested by e-cienciaDatos. The audio recording captures personal recollections about childhood in the town, local livelihoods, life on the border, and cultural topics like music festivals and contraband. The dataset was last updated on May 5, 2024.
An interview with Francisco and Lola in Rubiás, focusing on language similarities and differences across the border. The dataset likely contains discussions on the assessment of Galician spoken on television, dialectal variations in border villages like Montalegre, and comparisons between Galician and Portuguese. It was coordinated by Álvarez Pérez, Xosé Afonso and last updated on May 5, 2024.
Containing 445,793 synthetic Japanese voice recordings totaling over 577 hours of audio generated via the VOICEVOX engine. Created by ayousanz and updated in May 2024, the data is based on the ITA, Tsukuyomi-chan, and ROHAN text corpora.
Audio recordings and text transcripts of the Ewe Bible organized for Text-to-Speech (TTS) development. These linguistic resources support speech synthesis for the Ewe language, a Gbe language spoken primarily in Ghana and Togo.
A dataset named 'Meld Tts Gender Speaker3' was published on the HuggingFace platform by author TAESOO98 on 2024-05-28. The title suggests it contains audio samples for a specific speaker, likely intended for text-to-speech synthesis tasks. The dataset's specific content, size, and structure require verification after download.
95+ open-source datasets across voice and sound computing categories. The index facilitates discovery of specialized audio resources for various machine learning applications.
MSDWild is a dataset designed for testing multimodal analysis in tasks including multimodal speaker diarization, multimodal speaker localization, and audio-visual lip synchronization. The dataset is hosted on Hugging Face by author 'taocode' and was last updated on April 29, 2024. A sample can be viewed on the associated GitHub repository.
A dataset containing 15,000 audio samples of a male Dutch Flemish voice. It was created by fibleep and ported from the dutch-vl-tts GitHub repository to the Hugging Face platform. The data was last updated on April 16, 2024, and originates from the Mozilla Common Voice project's Dutch language data.
Telugu TTS is a dataset for speech synthesis published on HuggingFace by author deboleen6. Platform tags indicate it contains text and audio data for generating Telugu speech. The dataset was last updated on May 27, 2024.
3,674 denoised audio files from the Reazon Speech v2 dataset, processed using UVR to remove background music and noise. The dataset was cleaned by author Stardust-minus using eight A800 GPUs over approximately 10 days and was mirrored to Hugging Face by litagin in April 2024.
6,898,333 rows of chart images paired with text queries and labels, hosted on Hugging Face by ahmed-masry and last updated in March 2024. The dataset is structured for training multimodal models, with each row containing an image name, an input query, and an output label. Its primary use appears to be pretraining models for chart understanding and generation tasks.
216,284 Irish tunes in ABC notation, split into 214,122 for training and 2,162 for validation. The Irish Massive ABC Notation (IrishMAN) dataset was compiled from traditional music sources thesession.org and abcnotation.com. It was created by sander-wood and last updated on March 16,我们发现了一个问题。
Jenny TTS 6H is a text-to-speech dataset published on HuggingFace by author shacharu. The dataset was last updated on 2024-05-06. The specific content and scale of the audio samples are not detailed in the available metadata.
A benchmark containing approximately 6.52 hours of human-annotated broadcast speech, totaling 8085 utterances, across 13 distinct domains. It is designed for automatic speech recognition performance evaluation in challenging conditions. The dataset was created by SUST-CSE-Speech and last updated on March 9, 2024.
10 categories of spoken digits (0-9) provided in an audio format. This dataset serves as an acoustic counterpart to the MNIST handwritten digit collection for speech recognition tasks.
Musica is a multimodal dataset hosted on HuggingFace by author zaibutcooler, last updated on May 2, 2024. Its platform tags indicate it contains both audio and text data, likely related to music. The specific content, size, and structure require verification after download.
A collection of 5000 hours of Bengali speech audio for automatic speech recognition, aggregated from nine public sources including Common Voice and OpenSLR. The dataset, created by SKNahin and last updated in March 2024, includes a filtering column to identify higher-quality audio segments based on word error rate and word-per-second metrics.
A dataset released as part of a 2015 IEEE ASRU workshop paper by Feng, Minwei, et al. titled 'Applying deep learning to answer selection: A study and an open task.' The data was deconstructed from tokens provided in a GitHub repository by the user 'deccan-ai'.
EdAcc (The Edinburgh International Accents of English Corpus) is an automatic speech recognition dataset composed of 40 hours of English dyadic conversations. It was created by edinburghcstr and includes speakers with a diverse set of first and second-language English accents, along with linguistic background profiles. The dataset was last updated on February 22,我们发现了一个错误。