Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
An annotated version of the LibriTTS-R corpus, which is a multi-speaker English speech dataset. The dataset provides approximately 960 hours of read English speech at a 24kHz sampling rate, originally published in 2019. It was annotated by parler-tts and updated on the Hugging Face platform in August 2024.
Multitask-National-Speech-Corpus V1 is a multitask speech understanding dataset derived from IMDA's NSC Corpus. It focuses on Singapore's local accent, localized terms, and code-switching, and was created by MERaLiON. The dataset page was last updated on 2025-01-21.
AfriSpeech-Dialog v1 provides 6 hours of recorded dialogue designed for speech recognition and speaker diarization applications. The dataset was collected from diverse accents across Nigeria, Kenya, and South Africa and is authored by intronhealth. It was last updated on the Hugging Face platform in October 2024.
A cleaned version of a Wolof text-to-speech dataset by GalsenAI. The dataset features a female voice that has been denoised and enhanced using the Resemble Enhance library, with annotations cleaned of special characters, emojis, and non-Latin scripts. The author notes that some lines and audios judged insufficiently qualitative have been removed, and annotation corrections are ongoing.
251.86 hours of Turkmen speech audio paired with transcriptions, comprising 119,847 clips. The dataset was created by mamed0v and is hosted on Hugging Face, with a last recorded update in November 2025. It is described as one of the largest publicly available resources for the Turkmen language.
Meta FAIR released the Omnilingual ASR Corpus in 2024, providing spontaneous speech recordings and transcriptions for 348 under-served languages. The collection was developed to facilitate the training of automatic speech recognition and spoken language identification models for low-resource linguistic contexts.
Filtered LibriTTS-R is a multi-speaker English speech corpus of approximately 585 hours of read speech at 24kHz sampling rate. It is a filtered version of the LibriTTS-R corpus, excluding samples flagged for failed speech restoration and speakers with detected multi-speaker issues. The dataset was created by parler-tts and last updated on August 6, 2024.
A collection of 849 hours of Saudi Arabic spontaneous speech audio covering multiple topics. All audio is manually transcribed into text, with annotations for speaker identity and gender. It was created by Nexdata and last updated in April 2025.
Featuring 15 hours of Thai speech recordings from 490 native speakers, with each speaker contributing 50 sentences. The audio was recorded in a quiet environment and covers categories such as in-car scenes, smart homes, and speech assistants.
1,704 hours of speech data from 10,496 speakers across 22 Indian languages, derived from an automatic speech recognition dataset. Created by ai4bharat, this corpus was updated in March 2025 to support the development of text-to-speech models.
Aggregating 9.9 hours of Spanish speech recordings from 343 native speakers in Spain, Mexico, and Argentina. Each speaker contributed 50 sentences, recorded in a quiet environment using mainstream Android phones and iPhones. All audio has been manually transcribed with high accuracy.
Approximately 10.33 hours of high-quality Hindi speech recordings from the Indic TTS Database project, contributed by SPRINGLab. The dataset contains audio from both male and female speakers, with corresponding text transcriptions. It was last updated on November 5, 2024.
A collection of 347 hours of Italian speech audio recorded by 800 native speakers, with gender balance. Recordings were made in quiet environments and texts were manually transcribed with high accuracy.
114,036 preprocessed Indonesian speech samples totaling approximately 4GB of data. The dataset includes WAV audio recordings sampled at 16,000 Hz paired with corresponding text transcriptions.
Comprising 360 hours of Indonesian speech data collected from 496 native speakers. Each speaker recorded approximately 400 sentences covering categories like economics, entertainment, news, figures, letters, and oral content in a quiet environment.
A collection of 9.8 hours of Italian speech data collected from 351 native speakers in quiet environments. Each speaker contributed 50 sentences, covering categories like in-car scenes, smart home, and speech assistant interactions.
Over 1 million utterances from 6,112 celebrities were extracted from YouTube videos for speaker identification research. The dataset was created by Reverb and was last updated on the Hugging Face platform in August 2025.
Offering a filtered collection of Uzbek speech recordings processed through voice activity detection, noise removal, and reading speed analysis. It excludes original Mozilla Common Voice files in favor of a refined subset validated via automatic speech-to-text (STT) models to ensure high-quality audio-text alignment.
Presenting a 357-hour sample of Korean speech audio recorded by 999 speakers in quiet environments. All audio has corresponding text transcripts created by professional annotators with a stated sentence accuracy rate of 95%.
Persian-language speech data derived from Mozilla's Common Voice 17 dataset. The dataset has been refined to correct spelling inconsistencies and typographical errors present in the original collection. It was created by author 'vhdm' and was last updated on June 11, 2025.