Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,910 datasets
2,800 audio stimuli of 200 target words spoken in a carrier phrase by two actresses. The set includes recordings for seven distinct emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. It was created by M. Kathleen Pichora‐Fuller at the University of Toronto, modeled on the Northwestern University Auditory Test No. 6.
Aerial imagery of roads in Massachusetts has been processed to remove censored regions from associated image masks. The dataset likely contains georeferenced images suitable for computer vision tasks. The author, organization, and specific collection details are unknown.
SMB is a benchmark dataset of printed Common Western Modern Notation scores developed by the Pattern Recognition and Artificial Intelligence Group at the University of Alicante. It is designed for Optical Music Recognition and image segmentation tasks involving full-page music scores.
81 sentences across three CSV files provide the first phonetically balanced corpus for Tarifit (Riffian Berber) text-to-speech training, created by jamalinu in 2026. The collection includes IPA transcriptions and a native-validated customer service subset specifically formatted for Coqui TTS.
AxonData's English Contact Center Audio Dataset provides over 1,000 hours of inbound and outbound telephone call audio paired with English transcripts. The data consists of real-world, non-synthetic conversations featuring diverse English accents. The dataset was last updated on February 13, 2026.
A restructured subset of the AVSpeech dataset provides separated video and audio streams. The dataset was created by ProgramComputer and was last updated on February 20, 2026. Each clip has a unique identifier derived from the original YouTube ID and timestamps.
Various routines for drawing ellipses and ellipse-like confidence regions, implementing plots from Murdoch and Chow (1996). The dataset also includes routines for profile plots described in Bates and Watts (1988). It was ported to R by Jesus M. Frias Celayeta.
An integrated set of tools for analyzing and simulating networks using exponential-family random graph models (ERGMs). The package is part of the Statnet suite for network analysis and is authored by Mark S. Handcock. It is described in peer-reviewed publications from the Journal of Statistical Software.
Music Emotion IoT Multimodal Dataset is a collection of data for analyzing emotional responses to music. It likely contains synchronized audio, physiological, and image features gathered from IoT devices. The dataset's author, organization, size, and update history are unknown.
Librispeech Synth 300h is a synthetic speech dataset derived from the LibriSpeech corpus, containing up to 300 hours of audio. It is hosted on Kaggle and appears to be a processed version for speech synthesis tasks, likely containing audio generated by text-to-speech systems. The specific creator, generation method, and exact audio characteristics require verification after download.
A processed speech dataset derived from i4ds/spc_r. Each row represents a merged speech segment from a single speaker, created by applying speaker diarization and merging consecutive segments from the same speaker. The dataset was created by i4ds and last updated on Hugging Face in February 2026.
Polish-language training data for text-to-speech models, published on the HuggingFace platform. The dataset was uploaded by the user 'agnostic' and last updated on April 3, 2026. Its specific content, size, and structure require verification after download.
ONEMUSIC is a free, open-source dataset available on Kaggle. The dataset originates from a GitHub project of the same name. The specific contents, size, and creation details are not provided in the available metadata.
Chat-TTS_SM is a dataset published on Kaggle. Its title suggests it contains data related to a text-to-speech model, likely for training or evaluation. The dataset's specific content, size, and origin are not detailed in the provided metadata.
An audio dataset likely containing samples generated by the ChatTTS text-to-speech model. The dataset is published on Kaggle, but details about its size, creation date, and specific content are not provided in the metadata. The author and organization are unknown.
A dataset titled 'F5-TTS_Marathi_SD' is hosted on Kaggle. The title suggests it contains audio data for Marathi text-to-speech synthesis. Metadata such as size, row count, columns, and license details are unknown.
F5-TTS_Urdu_SD is a dataset for Urdu text-to-speech synthesis, published on Kaggle. The dataset likely contains audio samples and corresponding text transcripts. Metadata is minimal; specifics on size, format, and collection details require verification after download.
Fronkon Games maintained this collection of over 120,000 published Steam games using the Steam API and Steam Spy. The data focuses exclusively on standalone games, omitting DLCs, soundtracks, and videos as of February 2026.
Arabic calligraphy styles likely collected for machine learning applications. The dataset is published on Kaggle. Its specific size, creation date, and author are unknown.
A text-to-speech dataset for the Bible in the Ewe language, likely containing 50 audio files or chapters. It was published by the Ghana NLP Community on the Hugging Face platform and was last updated on April 10, 2026. The dataset's primary purpose appears to be generating spoken audio from biblical text.