Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,909 datasets
Vivoice Relabeled is a speech dataset derived from the original capleaf/viVoice collection. The dataset has been processed using the Qwen/Qwen3-ASR-1.7B model to update audio-text labels, retaining samples with a Word Error Rate below 15%. It was uploaded by author JayLL13 to Hugging Face in March 2026.
CMI-Pref provides between 1,000 and 10,000 human preference comparisons for multimodal music generation, published by HaiwenXia in 2026. Each record captures a human vote comparing two generated audio samples based on musicality, alignment, and confidence.
33,228 synthetic audio clips for Taiwanese Hokkien text-to-speech, generated using the Qwen3-TTS-1.7B-Base model with voice cloning. The dataset was created by lianghsun and last updated in March 2026.
Encompassing 2,700 pairs of text-to-speech audio renderings with 15 human preference annotations per pair. Produced by datapointai and updated in March 2026, it provides comparative naturalness ratings for audio generated from identical text prompts. The collection totals 40,500 individual human judgments to support high-confidence audio quality evaluation.
NileTTS provides 38.1 hours of transcribed Egyptian Arabic speech across 9,521 utterances, published by KickItLikeShika in February 2026. The collection is segmented into specific domains, including over 21 hours dedicated to sales and customer service interactions.
TTS Human Preferences (Medium) is a dataset for text-to-speech audio quality evaluation. It contains 2,000 rows, each with two TTS audio renderings and 15 human preference annotations, totaling 30,000 annotations. The dataset was created by datapointai and last updated in March 2026.
Ttsmodels is a dataset published on HuggingFace by author phongluong197. The dataset was last updated on 2026-05-07 07:59:33. Its specific content and scale are not detailed in the available metadata.
A multi-speaker clinical speech corpus containing nursing handover statements. It is designed for research in Automatic Speech Recognition and speech-driven clinical documentation, featuring speakers with different English accents.
PersianPunc is a large-scale dataset for Persian punctuation restoration, containing 17 million token-level sequence labeling samples aggregated from 6 source corpora. It was created by MohammadJRanjbar and accepted at the EACL 2026 SilkRoad NLP Workshop.
tw-hokkien-seed-text is a dataset of approximately 3 million full-character Taiwanese Hokkien sentences designed for training text-to-speech (TTS) and automatic speech recognition (ASR) models. The dataset was created by lianghsun and was last updated on March 20, 2026. Each sentence is 50โ80 characters long, corresponding to a speech duration of 10โ15 seconds, and is written exclusively in Chinese characters to preserve authentic Taiwanese Hokkien vocabulary and syntax.
Abjad-Kids is an Arabic speech classification dataset containing spoken recordings of the Arabic alphabet, numbers, and colors from multiple child speakers. It supports research in automatic speech recognition and educational technology for Arabic-speaking children. The dataset was created by Aziz-snoubra and was last updated on March 14, 2026.
A speech dataset intended as an example for training a text-to-speech fine-tuning platform. It contains audio files with associated transcripts and speaker identifiers, with missing transcripts generated automatically by the Whisper-large v3 model. The dataset was created by mgrei and was last updated on April 12, 2026.
13 primary spoken languages, including English, Spanish, Mandarin, and Hmong, are tracked for individuals who enrolled in a Covered California Qualified Health Plan. The data originates from the California Healthcare Eligibility, Enrollment and Retention System (CalHEERS) and is part of public reporting requirements. Enrollment counts are reported by period for individuals who paid their first premium.
A sample dataset of high-fidelity, ethically sourced conversational audio data. The description indicates it is intended for voice cloning applications. The dataset's size, specific source, and temporal coverage are unknown.
James D. Tucker's fdasrvf package implements the square-root velocity framework for elastic functional data analysis. The method, based on research by Srivastava et al. (2011) and Tucker et al. (2014), performs alignment, PCA, and modeling of multidimensional and unidimensional functions. It is sourced from the paperswithcode platform.
Approximately 1,700 musical pieces in MP3 format, sourced from NetEase music. The audio clips are 270 to 300 seconds long and sampled at 22,050 Hz. The dataset was created by ccmusic-database and last updated on 2026-02-27.
A speech dataset covering multiple regional dialects of the Bangla language, intended for automatic speech recognition tasks. The dataset is hosted on Kaggle, but details on its size, collection method, and creator are unspecified. Its primary focus is on capturing linguistic diversity within the Bengali-speaking regions.
A high-quality, speaker-paired subset of the LAION Emolia dataset, created by TTS-AGI and last updated on March 9,ๆไปฌๅ็ฐ 2026. Each sample includes a target and a reference utterance from the same speaker, filtered for quality using a DNSMOS score threshold of 3.0.
ToneWebinars Balalaika is a 248.9-hour Russian speech corpus curated from podcasts by the MTUCI lab260 team. Released in early 2026, the dataset was processed using the BALALAIKA pipeline to provide high-quality audio for generative speech tasks. It serves as a refined version of the original ToneWebinars source, specifically filtered for speech synthesis and recognition.
CMI Pref Pseudo contains 56,000 music generations from 23 models and 165,000 pairwise comparisons for preference modeling research. The dataset was created by HaiwenXia and last updated on March 3, 2026. Prompts are compositional, including text, optional lyrics, and reference audio.