Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,909 datasets
A filtered dataset for automatic speech recognition (ASR) created by OpenSpeechHub. The dataset has been cleaned by removing samples with fewer than three words, repetitive tokens, or chat token leaks. It was last updated on March 31, 2026.
MTUCI's lab260 team released this Russian speech corpus in early 2026, containing between 100,000 and 1,000,000 records. The dataset consists of audiobook recordings filtered and annotated using the BALALAIKA pipeline to support advanced generative speech tasks.
Weak class augmentation part 3 focuses on music genres and silence noise. The dataset appears to be part of a larger series for sleep visualization or analysis. Its specific scale and creation details are not provided.
SleepViz V12 Batch 10-6 is an audio dataset focused on environmental sounds and fan noise. The dataset appears to be part of a series for weak class augmentation, suggesting its use in machine learning tasks. Its author, organization, and specific scale are unknown.
Between 100,000 and 1,000,000 Uzbek language audio segments and transcriptions sourced from YouTube by openbank-uz in early 2026. The collection utilizes vocal isolation to separate speakers and Google's Gemini 2.0 Flash model for automated transcription.
CommonVoice is a dataset hosted on Kaggle. The title suggests it is a speech and audio dataset, likely containing voice recordings. The specific content, size, and collection details are not provided in the available metadata.
The Chinese Musical Instruments Timbre Evaluation Database contains subjective timbre evaluation scores for 37 Chinese and 24 Western instruments. The data was collected from Chinese participants with musical backgrounds in a subjective evaluation experiment using 16 descriptive terms. The dataset also includes 10 spectrogram analysis reports.
A Kaggle dataset titled 'codemix_tts' likely contains audio data for text-to-speech synthesis. The dataset's specific content, such as the number of audio samples or languages covered, is not detailed in the provided metadata. It is hosted on the Kaggle platform, but the author, organization, and last update date are unknown.
Between 10,000 and 100,000 synthetic Turkish audio-text pairs across 13 specialized domains were generated by Anilosan15 and updated in March 2026. The data includes synthesized speech for sectors such as finance, healthcare, and technical support, created using a high-quality TTS model.
A dataset by Ganaa0614, last updated on 2026-04-14. The title suggests it contains Mongolian speech audio and corresponding text translations, likely derived from the Common Voice project. The specific volume of audio clips and translated sentences is unknown.
United States historical data on the partisan composition of state legislatures and the party affiliation of governors from 1834 to 1985. The collection provides annual and biennial records for each legislature. Data from 1834-1868 were collected by W. Dean Burnham of MIT, with subsequent years added by ICPSR staff.
Time-series question answering evaluation data for ChatTS, sourced from Kaggle. The dataset's author, organization, and specific size are unknown. Its last update date is also unspecified.
Sentence classification datasets containing Automatic Speech Recognition (ASR) errors, hosted on AWS Open Data. The data is provided by Amazon and is associated with a research project on ASR error robustness. The license details are available via a linked GitHub repository.
Standard Moroccan Amazigh audio recordings and text transcripts totaling fewer than 1,000 records, created by abdelhaqueidali and updated in March 2026. The dataset provides raw, unprocessed speech data for the development of Automatic Speech Recognition and Text-to-Speech models.
The customer_service_persian_diarization_dataset is a synthetic multi-speaker speech dataset designed for training and evaluating speaker diarization models in Persian (Farsi). It contains approximately 80 hours of audio, built using utterances from a customer service dataset and processed through a synthesis framework to simulate realistic conversational dynamics. The dataset was created by atiyehghm and was last updated on the platform in February 2026.
A sample from the Silencio corpus, which contains over 100,000 hours of speech data. The full dataset is collected from a community of over 2 million contributors across more than 180 countries and 100 languages.
A collection of audio samples likely generated by a text-to-speech model named Qwen3, potentially for automotive voice interface applications. The dataset is published on Kaggle, but its specific size, creation date, and author are unknown. The content appears to focus on synthesized speech, possibly for testing or training voice systems.
Car-Voice-Qwen3-TTS-Models is a collection of text-to-speech models likely designed for automotive voice interfaces. The dataset is hosted on Kaggle, but its specific contents, scale, and creation details are not provided in the available metadata. Further verification is required to determine the exact model architectures, audio samples, and performance characteristics included.
OpenSpeechHub provides a filtered dataset for automatic speech recognition. The dataset has been processed to remove samples with fewer than three words, repetitive tokens, or chat token leaks. It was last updated on March 31, 2026.
Kazattsd B1 B2 B3 is a speech audio dataset authored by 'issai' and published on the Hugging Face platform. The dataset's title suggests it contains Kazakh language audio recordings, potentially categorized by proficiency levels B1, B2, and B3. It was last updated on April 15, 2026, but specific details on size, format, and content are not provided in the metadata.