Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,962 datasets
79 songs across multiple genres and languages feature lyrics aligned to audio on a word-by-word basis with start and end times. Created by jamendolyrics, this dataset serves primarily as a benchmark for automatic lyrics alignment tasks. It was last updated in March 2025.
Supplying semantic and acoustic tokens for the LibriLight and LibriTTS English speech corpora, specifically formatted for training SPEAR TTS-like models. It features 24kHz EnCodec acoustic tokens at 6kbps and semantic tokens generated through a Whisper tiny VQ bottleneck trained on LibriLight subsets.
Featuring audio recordings of sung poetry and music collected during fieldwork in the Pamir Mountains of Tajikistan in 1993. The recordings were made in the regions of Shughnan, Rushan, and Wakhan by researcher G. van den Berg.
MIDI Loops is a collection of quality-labeled MIDI music loops, each precisely 32 beats or 8 bars long. The dataset was compiled by asigalov61 from the LAKH MIDI dataset and the Annotated MIDI dataset, with loops numbered in order of composition. It was last updated in November 2025.
European language speech recognition metadata sourced from CommonVoice and Multilingual LibriSpeech datasets. The dataset contains only metadata files in JSON or Parquet format, with audio files not included. It was created by WhissleAI and last updated on April 30, 2025.
MuChoMusic is a benchmark containing 1,187 multiple-choice questions validated by human annotators, based on 644 music tracks from two publicly available datasets. It was created by mulab-mir and last updated on August 5, 2024. The questions cover a wide variety of genres and assess knowledge and reasoning across several musical concepts.
An upload of the NST Danish ASR Database, reorganized for use on the Hugging Face platform. The dataset is intended for training automatic speech recognition models and is available in the Danish language. The training and test splits are the original ones from the source database.
Open Text-to-Speech voices for the Ukrainian language. The dataset was created by a user named Smoliakov and is hosted on Hugging Face by the organization 'speech-uk'. It was last updated on February 24, 2025.
The Free Music Archive (FMA) is an open dataset for evaluating tasks in Music Information Retrieval (MIR). It was introduced by Michaël Defferrard et al. at the ISMIR conference in 2017.
Common Voice 20.0 Mongolian Dataset is a subset of Mozilla's Common Voice project containing Mongolian speech data. The dataset includes audio clips in .mp3 format, transcriptions, train/test/dev splits, and metadata such as speaker demographics. It was uploaded by user 'warmestman' to Hugging Face on March 5, 2025.
13,203 music files with a total playtime of 36.72 hours, generated using the MU-LLaMA and VideoMAE captioning models. The dataset was created by M2UGen to train the M2UGen model and was last updated on 2024-01-02.
A 30-hour voice dataset recorded by an Irish speaker named Jenny. The dataset includes audio of newspaper headlines, YouTube video transcripts, sections from books '1984' and 'Little Women', Wikipedia articles, recipes, Reddit comments, song lyrics, and transcripts from the show 'Friends'. Audio files are 48kHz, 16-bit PCM format, and the dataset was last updated on HuggingFace in January 2024.
675 map sheets comprise the first large-scale, nationwide map series for the German Empire, completed in 1909. The series was designed in polyhedral projection with each sheet covering an area of approximately 35 km by 28 km. The Bundesamt für Kartographie und Geodäsie provides this historical map sheet, originally produced in monochrome.
Prussian Original Survey Maps are hand-drawn, one-off topographic maps produced starting in 1822 for the entire territory of Prussia. The maps were created at a scale of 1:25,000 and were not published, serving as the basis for smaller-scale maps. The dataset includes a specific sheet for the Wittstock/Dosse area, produced by the Bundesamt für Kartographie und Geodäsie.
Assembled from high-quality audio recordings in the South Levantine Arabic dialect, specifically focusing on the Damascian accent. The corpus was recorded in a professional studio and is provided in .flac format to optimize storage while maintaining audio fidelity.
Swedish mountain forest boundary digitized by the Swedish Forest Agency (Skogsstyrelsen). The boundary is suitable for overview purposes but lacks legal effect. The legal basis is the regulation SKSFS 1991:3.
This benchmark contains evaluation data for long-form Text-to-Speech (TTS) and speech-audio understanding tasks in English and Chinese. It is designed to test the capabilities of omni-modal large language models in generating personalized, long-horizon speech and interpreting complex audio signals.
A corrected version of the Mozilla CommonVoice 17 Turkish corpus for speech recognition tasks. It utilizes filename stems as unique keys to reorganize the data structure and improve split consistency for model training.
MohamedRashad compiled a dataset of text-to-speech samples designed to showcase linguistic diversity. The dataset page was last updated on December 12, 2023. The description suggests the collection likely contains speech samples across multiple languages.
CORAA v1.1 contains 290.77 hours of Brazilian Portuguese audio with transcriptions, segmented into over 400,000 audio files. The dataset is compiled from five distinct speech projects, including academic recordings and TEDx talks, and is validated for automatic speech recognition research.