Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
2,013 datasets
A 30-hour voice dataset recorded by an Irish speaker named Jenny. The dataset includes audio of newspaper headlines, YouTube video transcripts, sections from books '1984' and 'Little Women', Wikipedia articles, recipes, Reddit comments, song lyrics, and transcripts from the show 'Friends'. Audio files are 48kHz, 16-bit PCM format, and the dataset was last updated on HuggingFace in January 2024.
A collection of isolated dry voice recordings and a corresponding nene_org.txt label file for the character Nene Kusanagi. These vocal stems are voiced by Machico and sourced from the game Project Sekai for use in speech synthesis and voice cloning.
A 2024 release from ASAPP contains a subset of the Gridspace-Stanford Harper Valley speech corpus, annotated for dialog act classification. The dataset includes English audio and text data tagged for customer service applications.
The Spoken Language Understanding Evaluation (SLUE) benchmark tracks research progress on multiple SLU tasks. It facilitates the development of pre-trained representations by providing fine-tuning and evaluation sets for a variety of SLU tasks. The benchmark was created by ASAPP and focuses on freely available datasets.
Paulmooney Medical ASR Data is a dataset for automatic speech recognition in a medical context, published on HuggingFace by yashtiwari. It was last updated on February 16, 2024. The specific content, scale, and collection methodology require verification after download.
13,203 music files with a total playtime of 36.72 hours, generated using the MU-LLaMA and VideoMAE captioning models. The dataset was created by M2UGen to train the M2UGen model and was last updated on 2024-01-02.
Aggregating crowdsourced speech recordings and transcriptions for over 20 listed languages including Abkhaz, Basaa, and Cantonese. It is an unofficial conversion of the Mozilla Common Voice Corpus 16.0, providing paired audio and text data for multilingual speech technology development.
MohamedRashad compiled a dataset of text-to-speech samples designed to showcase linguistic diversity. The dataset page was last updated on December 12, 2023. The description suggests the collection likely contains speech samples across multiple languages.
800 million words of normalized text and pre-trained n-gram models derived from 14,500 public domain books. These resources provide the linguistic foundation for the LibriSpeech ASR corpus across multiple model formats.
7 hours of transcribed audio recordings of Chilean Spanish sentences. The dataset was created by author ylacombe from restructured OpenSLR archives and was last updated in November 2023.
Chilean Spanish audio data consisting of 7 hours of transcribed, high-quality sentences recorded by 31 volunteers. The dataset was created by ylacombe and restructured from original OpenSLR archives for easier streaming. It was last updated on November 27, 2023.
Librispeech is a 1000-hour corpus of 16kHz read English speech derived from audiobooks, designed for automatic speech recognition. This version includes alignments generated by the Montreal Forced Aligner (MFA). The dataset was uploaded to Hugging Face by gilkeyio and last updated on November 22, 2023.
English Wikipedia text and ASR error data presented in an ASRU-2023 paper. It contains 4.3 million unique words or phrases from Wikipedia titles occurring in 33.8 million paragraphs, plus 26 million phrase pairs representing ASR recognition errors. The dataset was created by bene-ges and last updated on Hugging Face in December 2023.
CML-TTS is a multilingual Text-to-Speech dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias. It comprises audiobooks sourced from public domain books on Project Gutenberg, read by volunteers from the LibriVox project, and includes recordings in languages such as Dutch, German, French, Italian, and Polish. The dataset was last updated on the Hugging Face platform on 2023-11-24.
1,000 hours of Arabic speech audio sampled at 16 kHz, sourced from over 700 YouTube channels. The collection spans multiple regions, genres, and dialects to support the development of speech recognition technologies.
Hi-Fi TTS is a multi-speaker English text-to-speech dataset derived from LibriVox's public domain audio books and Project Gutenberg texts. The dataset was created by MikhailT and was last updated on Hugging Face on November 30, 2023. Its specific size, row count, and file formats are not detailed in the provided metadata.
Dummy Optimus Prime Tts is a dataset hosted on HuggingFace by the author ylacombe. It was last updated on December 20, 2023. The dataset likely contains audio samples or related data for text-to-speech synthesis, inferred from its title.
The GTZAN dataset contains 1,000 audio tracks for musical genre classification, each 30 seconds long. It includes 10 distinct genres, with 100 tracks per genre, all formatted as 22,050Hz Mono 16-bit WAV files.
MixologyDB is a dataset created to advance the field of intelligent music production, specifically targeting music mixing in a digital audio workstation. The dataset was created by mclemcrew and was last updated on Hugging Face in November 2023. Its specific size, row count, and column structure are not detailed in the provided metadata.
Audio files and corresponding transcriptions for training Automatic Speech Recognition models for the Iban language. The dataset was created by user 'meisin123' and was last updated in November 2023. It is hosted on the Hugging Face platform.