Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
A Kaggle-hosted dataset for classifying sounds from musical instruments. The dataset likely contains audio samples or features for various instrument types. Metadata is minimal; actual content requires verification after download.
A corpus for Urdu text-to-speech (TTS) applications, published on Kaggle. The dataset likely contains audio recordings paired with corresponding text transcripts. Specific details on size, collection method, and contributors are not provided in the minimal metadata.
Hindi-language audio data intended for training Automatic Speech Recognition (ASR) systems. The dataset is hosted on Kaggle, but its specific size, collection method, and creator are not detailed in the provided metadata. The content likely contains speech recordings and corresponding transcriptions.
Anime TTS 3 is a dataset for text-to-speech and voice generation tasks, likely containing audio samples or related features. It is published on Kaggle, but the author, organization, and specific creation date are unknown. The dataset's exact size, format, and content require verification after download.
84,641 audio samples totaling 66 hours and 51 minutes of speech data, averaging 1.58 seconds per sample. The dataset was curated by PapaRazi from various online sources for research and non-commercial purposes. It was last updated on July 15, 2025.
A dataset titled 'UA_ASR' hosted on Kaggle, likely containing audio recordings and transcriptions for Ukrainian speech. The dataset's specific size, origin, and update history are not detailed in the provided metadata. Its content and structure require verification after download.
Giving access to between 10,000 and 100,000 audio recordings and transcriptions for Hokkien speech recognition, published by adi-gov-tw in late 2024. It is organized into training and test subsets using the WebDataset format to facilitate high-throughput training in PyTorch and Hugging Face environments.
The dataset likely contains historical records related to music, denazification, and American policy in Germany from 1945 to 1953. It is published on the paperswithcode platform. The specific content, format, and scale are unknown.
A corpus of Ukrainian audio data with associated labels, likely for automatic speech recognition (ASR) tasks. The dataset is hosted on Kaggle, but its specific size, origin, and creation date are not provided in the available metadata. Columns suggest it contains audio files and corresponding text transcriptions.
Over 100,000 audio clips, likely up to 1 million, form this collection focused on Arabic speech and audio. The dataset was created by MohamedGomaa30 and was last updated in February 2026. It includes a text modality, suggesting paired audio and transcript data.
A dataset titled 'vibeasr' published on Kaggle. The title suggests it contains audio data likely intended for Automatic Speech Recognition (ASR) tasks. The dataset's specific content, size, and origin require verification after download.
Over 114 hours of high-quality Persian audio sampled at 44.1 kHz, released under the CC-0 license. Collected from Nasl-e-Mana magazine, the dataset covers a diverse range of topics. It was created by MahtaFetrat and last updated on July 12, 2025.
Offering audio and text data for pre-training Automatic Speech Recognition models, specifically for Taiwanese Mandarin. It is structured into training and test subsets using the WebDataset format for direct integration with PyTorch and Hugging Face tools. The dataset is published by the author 'adi-gov-tw' and was last updated in December 2025.
Rasa provides at least 20 hours of audio per speaker for expressive Text-to-Speech (TTS) across multiple Indian languages. Created by ai4bharat and funded by Bhashini (Ministry of Electronics and Information Technology, India), the collection targets male and female voices for high-quality synthesis. The dataset includes recordings for over 13 languages including Bengali, Malayalam, and Sanskrit.
205 human-recorded audio samples totaling 38 minutes and 43 seconds of speech focused on technical and developer vocabulary. The dataset, created by danielrosehill, is a work in progress with a target of 5 hours of audio and 50,000 words. It was last updated on November 26, 2025.
MusicMoveArr provides a collection of music metadata sourced from MusicBrainz, Tidal, Spotify, and Deezer as of March 2026. The repository aggregates data across multiple major streaming platforms to support music information retrieval (MIR) tasks. It serves as a cross-reference point for music entities across different commercial and open-source databases.
A voice dataset published on Kaggle. The dataset likely contains audio recordings for speech-related tasks. Specific details on size, format, and collection methodology are not provided in the metadata.
Nearly 1,000 hours of professionally cleaned Vietnamese audio form this large-scale corpus created by the Dolly AI Team. The dataset features 152 speakers from different regions of Vietnam, aiming to advance research in speech synthesis and recognition. It was last updated on Hugging Face in November 2025.
A collection of long Bengali audio recordings paired with transcriptions. The description mentions the inclusion of augmented audio data, suggesting techniques to increase dataset size or variability. The dataset is hosted on Kaggle, but details on its size, creation date, and author are unknown.
Malayalam Tts Pro Voice is a Malayalam speech dataset containing audio clips paired with text transcripts. The dataset is designed for training and fine-tuning ASR, TTS, and speech-to-speech translation systems. It was uploaded by author sachin6624 and last updated on December 16, 2025.