Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
2,013 datasets
Vocalforge is a Python toolkit designed for generating synthetic voice datasets. The project, authored by rioharper on GitHub, was last updated in December 2023. It is released under the permissive MIT license.
Supplying semantic and acoustic tokens for the LibriLight and LibriTTS English speech corpora, specifically formatted for training SPEAR TTS-like models. It features 24kHz EnCodec acoustic tokens at 6kbps and semantic tokens generated through a Whisper tiny VQ bottleneck trained on LibriLight subsets.
An upload of the NST Danish ASR Database, reorganized for use on the Hugging Face platform. The dataset is intended for training automatic speech recognition models and is available in the Danish language. The training and test splits are the original ones from the source database.
Librispeech Long is a speech audio dataset derived from the LibriSpeech corpus, likely containing longer-form English audio segments. The dataset was created by distil-whisper and was last updated on Hugging Face in November 2023. Its specific size, format, and license details are not provided in the available metadata.
Sectors affected by rail noise in the French department of Maine-et-Loire, determined by the Prefect under national noise control laws. The dataset is provided by the Bureau de Recherches Géologiques et Minières and was last updated on August 18, 2023. It likely contains geographic boundaries for areas where specific acoustic requirements apply for new construction.
Multiple human-labeled audio collections across various sound categories are hosted on this platform, utilizing content from the Freesound repository. The data is generated through a collaborative framework where users contribute to the labeling and verification of open-source audio samples.
LP-MusicCaps, Music Negation/Temporal Ordering, and WavCaps datasets were re-organized into instruction form by seungheondoh. The dataset was last updated on August 16, 2023. It likely contains pseudo-captions for music and audio content generated using ChatGPT.
A speech dataset in the Persian language, published on the Hugging Face platform by SeyedAli and last updated on September 15, 2023. The dataset's specific content, size, and structure are not detailed in the provided metadata. Its primary modality is indicated as audio, with associated text for processing tasks.
WavCaps is a dataset for audio-language multimodal research, with audio clips sourced from FreeSound, BBC Sound Effects, SoundBible, and the AudioSet Strongly-labelled Subset. The dataset was created by cvssp and last updated on Hugging Face in July 2023. It uses ChatGPT to assist in generating weakly-labelled captions for the audio content.
Piano patterns for jazz music audio machine learning research. The data focuses on the transcription and analysis of genre-specific piano performances. It supports the development of models for genre-specific transcription and pattern recognition.
LP-MusicCaps-MTT is a dataset of pseudo music captions generated by a Large Language Model for text-to-music and music-to-text tasks. The dataset was constructed by combining three existing multi-label tag datasets and four task-specific datasets. It was created by seungheondoh and last updated on August 4, 2023.
Offering LLM-generated pseudo music captions derived from three multi-label tag datasets for audio-language tasks. It features music-to-caption pairs across four distinct generation tasks to support text-to-music and music-to-text model training.
A combined dataset from the ATCO2-ASR and ATCOSIM collections, likely containing air traffic control speech audio. The dataset was created by author jlvdoorn and last updated on July 7, 2023. It is split into 80% training and 20% validation partitions, with some files containing additional metadata.
A collection of 3,992 audio clips of Kinyarwanda text-to-speech recordings made by a single voice actress in a studio setting. It was collected as part of the Mbaza project and includes a CSV file linking audio file names to their corresponding written text.
1,000 hours of speech audio sampled at 16 kHz, crawled from over 700 YouTube channels. The MASC dataset is multi-regional, multi-genre, and multi-dialect, intended to advance research and development of Arabic speech technology. It was authored by 'pain' and last updated on the Hugging Face platform in June 2023.
15 hours of Vietnamese speech recordings specifically curated for Automatic Speech Recognition (ASR) tasks. The corpus was developed by AILAB at VNUHCM - University of Science and includes audio data paired with corresponding transcriptions for linguistic research.
Naija-Stopwords is a list of collected stopwords from the four most widely spoken languages in Nigeria — Hausa, Igbo, Nigerian-Pidgin, and Yorùbá. It is part of the Naija-Senti project and was authored by HausaNLP. The dataset was last updated on June 18, 2023.
10 hours of speech recordings and transcriptions from the ATCOSIM project for Air Traffic Management. The data captures interactions between controllers and pilots during real-time simulations to support automatic speech recognition research.
6 pre-trained base models for SoVITS 4.0 voice conversion, featuring 768-dimensional vectors and layer 12 configurations. These models were trained on the m4singer and vctk datasets, reaching up to 320,000 training steps with loss values as low as 14.1.
A dataset for Automatic Speech Recognition (ASR) containing Hebrew speech audio files. The dataset was created by author 'imvladikon' and was last updated in May 2023.