Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
VoxLingua107 is a speech dataset for training spoken language identification models. It contains 6628 hours of short speech segments automatically extracted from YouTube videos and labeled for 107 languages. The dataset was created by TalTechNLP and was last updated on September 4, 2025.
severyn-k created a dataset of isolated guitar chord recordings for audio classification tasks. The data was recorded in realistic acoustic conditions, including minor background sounds, to improve robustness for real-world inference. The dataset was last updated on December 12, 2025.
Delivering a baseline for general music object detection in the context of Optical Music Recognition (OMR), created by author apacha to accompany a specific journal publication. Updated in February 2026, it focuses on the identification and localization of musical symbols within sheet music images.
A large-scale Chinese multi-turn dialogue speech synthesis dataset containing 46,080 conversations across domains like literary Q&A, natural dialogue, and poetry. The dataset comprises approximately 275,000 WAV audio files totaling 1,000-1,200 hours of speech at a 16kHz sampling rate. It was created by author MYJOKERML and last updated on Hugging Face in November 2025.
Over 1,500 fifteen-minute intervals document real-time acoustic whale detections during a dedicated Antarctic voyage. The Australian Antarctic Data Centre compiled this log, where acousticians recorded whale call bearings, group counts, and vessel interactions. Data collection occurred during the 2013 Antarctic Blue Whale Voyage to the Southern Ocean.
LibriSpeech contains 1,000 hours of 16kHz read English speech derived from LibriVox audiobooks, prepared by Vassil Panayotov and Daniel Povey. The corpus features segmented and aligned audio paired with corresponding text transcripts for speech recognition and speaker identification tasks. The dataset is organized into subsets based on the difficulty of the speech recognition task and the quality of the recordings.
A Kaggle dataset titled 'face-tts-resources2' likely contains resources for text-to-speech (TTS) model development. The specific contents, such as audio samples, text transcripts, or model checkpoints, require verification after download. Metadata regarding size, format, and origin is not provided.
A Kaggle dataset likely containing audio samples and associated text for speech synthesis. The title suggests a focus on a Gemini style of speech generation. The dataset's author, size, and specific temporal coverage are unknown.
Bengali-language audio data for training and evaluating long-form automatic speech recognition (ASR) models. The dataset is hosted on Kaggle, but its specific size, collection method, and origin are not detailed in the provided metadata. Metadata is minimal; actual content requires verification after download.
A cleaned dataset for text-to-speech (TTS) applications, sourced from Kaggle. The dataset's specific size, format, and creation details are not provided in the available metadata. Its content likely contains processed audio samples and corresponding text transcripts for speech synthesis model training.
Bengali Raw Subtitles Music-Speech Split JSON3 is a dataset from Kaggle. The title suggests it contains subtitle text for Bengali audio or video, with annotations to distinguish between speech and music segments. The dataset's specific size, origin, and update history are not provided in the available metadata.
LibriSpeech is a widely used corpus for automatic speech recognition research. This specific subset, 'train_clean_100', likely contains 100 hours of read English speech audio and corresponding transcripts. It is published on Kaggle, but detailed metadata about its exact composition and origin is not provided in the input.
An audio dataset published on Kaggle. The title suggests it contains audio files, but the specific content, size, and collection details are not provided in the metadata. The author, organization, and license are unknown.
A dataset of Egyptian music, sourced from Kaggle. The dataset's specific contents, size, and creation details are not provided in the available metadata. Further verification after download is required to determine the exact scope and nature of the audio files.
A dataset titled 'khmer_asr_cache' published on Kaggle. The title suggests it contains audio data for Khmer language automatic speech recognition. No further metadata on size, origin, or content is available.
A dataset of speech transcripts, likely aligned using the Montreal Forced Aligner (MFA) tool. The dataset is published on Kaggle, but details on its size, creation date, and specific source are not provided in the metadata. The title suggests it contains phonetic or word-level alignment data for audio recordings.
Transcripts from the Hue Voice Dataset, a collection of recorded speech data. The dataset is hosted on Kaggle, but the volume of transcripts, the recording source, and the creation date are not specified in the available metadata. Further details about the audio recordings, speakers, and transcription methodology require inspection of the actual data files.
Kaggle hosts a dataset titled 'vgis-asr-model'. The dataset likely contains audio data and associated metadata for training or evaluating automatic speech recognition models. Its author, organization, size, and specific content are unknown.
MCV Assamese Speech Corpus is a dataset of Assamese speech audio. It is hosted on Kaggle, but the author, organization, and collection details are not provided. The dataset's size, format, and specific content require verification after download.
A Kaggle-hosted dataset focused on wake words, likely containing audio samples for training speech recognition systems. The dataset's author is Deepa Deepak. Metadata is minimal; specifics on size, format, and content require verification after download.