Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,909 datasets
125 terms define the vocabulary of the UK live music sector. The glossary includes 348 cross-references linking related concepts and 145 question-answer pairs for clarification. It originates from the GigXchange platform.
Wu dialect speech data provides a manually curated benchmark for multiple speech processing tasks. It includes 9.75 hours of Wu dialect ASR data, covering Shanghainese, Suzhounese, and Mandarin code-mixed speech. The benchmark was created by ASLP-lab and updated in February 2026.
900 hours of Finnish and 5,090 hours of Swedish speech data extracted from recordings of Nordic parliamentary proceedings. The dataset was created by Aalto-Speech-Synthesis and announced as accepted for presentation at ICASSP 2026. The dataset page was last updated on February 18,ζ们εη°δΈδΈͺιθ――οΌ2026.
TTS_BETE is a dataset hosted on Kaggle. Its title suggests it contains data related to text-to-speech synthesis. The dataset's specific content, size, and origin are not detailed in the available metadata.
test_music is a dataset hosted on Kaggle. Its specific content and structure are not detailed in the provided metadata. The dataset's scale, origin, and creation date are unknown.
Spectrogram images generated from audio clips for training machine learning models. The dataset's author, organization, and specific scale are unknown. It was sourced from the Kaggle platform.
Leyu Amharic provides a parallel corpus of audio recordings paired with text transcripts for the Wello dialect of Amharic. The dataset is curated by leyu-amharic to capture dialect-specific phonetic and prosodic variations. It was last updated in February 2026.
Rick Perlstein's book NIXONLAND analyzes the political and social divisions in the United States during the 1960s. The text begins with the 1965 Watts riots and traces the political resurgence of Richard Nixon, culminating in his 1972 landslide re-election. The work is described as a bestseller and is published under a closed license.
A dataset titled 'Heldout Product Hard Negative TTS' is hosted on Kaggle. The dataset's content likely relates to text-to-speech (TTS) evaluation, specifically containing 'hard negative' audio samples for a held-out product set. Metadata is minimal; the exact number of samples, audio characteristics, and creation details are unknown.
TTS-Romanian contains 267,410 speech samples totaling 720 hours of audio derived from Romanian audiobooks. Released by datadriven-company and updated in early 2026, the collection features 456 unique speakers with an average DNSMOS quality score of 3.84.
383,710 audio samples totaling 997 hours of Polish speech, derived from the public domain Wolne Lektury digital library. The dataset features recordings from 1,207 unique professional narrators, with a split of 294,756 male and 88,945 female samples. It was created by datadriven-company and last updated on Hugging Face in February 2026.
UTS provides a unified tag vocabulary bridging speech, music, and environmental sounds derived from high-fidelity audio captions. It was created by AudenAI using Qwen3-Omni-Captioner and Qwen2.5-7B-Instruct models on a subset of CaptionStew. The dataset was last updated on March 11, 2026.
Encompassing between 10,000 and 100,000 cleaned audio clips derived from the Mozilla Common Voice 24.0 Chinese (Taiwan) corpus. Released by OKHand in early 2026, it provides 'Voice Seeds' processed through Silero-VAD to remove silence and environmental noise for generative speech tasks.
A parallel speech corpus contains audio recordings paired with text transcripts for the Shewa dialect of Amharic. The dataset, created by leyu-amharic, is designed for speech technology research and was last updated in February 2026.
A text-to-speech model named Qwen3-TTS-Model is available on Kaggle. The dataset likely contains model weights, configuration files, or audio samples for speech synthesis. No further details on size, author, or specific contents are provided in the available metadata.
Product 1000 ASR TTS is a dataset hosted on Kaggle. The title suggests it contains data related to automatic speech recognition and text-to-speech synthesis. The dataset's specific content, size, and origin are not detailed in the provided metadata.
A dataset for text-to-speech synthesis, likely containing audio samples and corresponding text transcripts. It appears to focus on code-switching, where speakers alternate between two or more languages. The dataset is hosted on Kaggle, but its specific size, creation date, and author are unknown.
Kaggle hosts a dataset titled 'Disjoint Code Switch Entity TTS'. The title suggests it contains data for text-to-speech synthesis, likely involving code-switching between languages and named entities. The dataset's author, organization, and specific contents are unknown from the provided metadata.
MMedFD is a benchmark dataset for multi-turn, full-duplex automatic speech recognition in real-world healthcare settings. The dataset was created by HanselZz and was last updated on Hugging Face in February 2026. Full access requires internal approval and a research-only data use agreement.
CAMMDLS is an academic dataset of metal music lyrics and subgenres. The dataset was sourced from Kaggle, but the author, organization, and last update date are unknown. The description indicates a focus on academic analysis of lyrics and subgenre classification.