DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

vi_asr_dataset: Vietnamese Speech Recognition Audio

vi_asr_dataset is a dataset for Vietnamese automatic speech recognition, published on Kaggle. The dataset likely contains audio files and corresponding transcriptions. Its specific size, collection method, and authorship are currently unknown.

AudioVietnameseSpeech Recognition+1

0 views

Speech & Audio

UK Live Music Industry Glossary with 125 Terms and Cross-References

125 terms define the vocabulary of the UK live music sector. The glossary includes 348 cross-references linking related concepts and 145 question-answer pairs for clarification. It originates from the GigXchange platform.

TextAudioMusic IndustryGlossaryLive EventsTerminology+1

0 views

Speech & Audio

Wenetspeech Wu Bench: Wu Dialect Speech Processing Benchmark

Wu dialect speech data provides a manually curated benchmark for multiple speech processing tasks. It includes 9.75 hours of Wu dialect ASR data, covering Shanghainese, Suzhounese, and Mandarin code-mixed speech. The benchmark was created by ASLP-lab and updated in February 2026.

AudioMultilingualBenchmarkSpeech ProcessingSpeech RecognitionWu Dialect+1

0 views

Speech & Audio

Nord-Parl-TTS: Finnish and Swedish Speech Synthesis Data from Parliament Recordings

900 hours of Finnish and 5,090 hours of Swedish speech data extracted from recordings of Nordic parliamentary proceedings. The dataset was created by Aalto-Speech-Synthesis and announced as accepted for presentation at ICASSP 2026. The dataset page was last updated on February 18,我们发现一个错误，2026.

TextAudioText To SpeechSpeech SynthesisParliamentary SpeechSwedishFinnish+1

0 views

Speech & Audio

TTS_BETE: Text-to-Speech Audio Samples

TTS_BETE is a dataset hosted on Kaggle. Its title suggests it contains data related to text-to-speech synthesis. The dataset's specific content, size, and origin are not detailed in the available metadata.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

Speech & Audio

Test Music Dataset for Audio Analysis

test_music is a dataset hosted on Kaggle. Its specific content and structure are not detailed in the provided metadata. The dataset's scale, origin, and creation date are unknown.

AudioTest Data+1

0 views

Speech & Audio

Spectrogram Images for Multi-class Audio Classification

Spectrogram images generated from audio clips for training machine learning models. The dataset's author, organization, and specific scale are unknown. It was sourced from the Kaggle platform.

ImageAudioMachine LearningAudio ClassificationSpectrogramAudio ProcessingSynthetic+1

0 views

Speech & Audio

Amharic Wello Dialect Parallel Speech Corpus

Leyu Amharic provides a parallel corpus of audio recordings paired with text transcripts for the Wello dialect of Amharic. The dataset is curated by leyu-amharic to capture dialect-specific phonetic and prosodic variations. It was last updated in February 2026.

TextAudioText To SpeechAmharic LanguageSpeech CorpusDialectologyNatural Language ProcessingSpeech Recognition+1

0 views

Speech & Audio

Nixonland: Historical Text on US Political Fracturing in the 1960s

Rick Perlstein's book NIXONLAND analyzes the political and social divisions in the United States during the 1960s. The text begins with the 1965 Watts riots and traces the political resurgence of Richard Nixon, culminating in his 1972 landslide re-election. The work is described as a bestseller and is published under a closed license.

TextHistoryHistorical AnalysisVotingUs PoliticsLawVictoryPolitical HistoryResentmentSocial DivisionEconomic HistoryPolitical SciencePolitics+1

0 views

Speech & Audio

Heldout Product Hard Negative TTS: Speech Synthesis Evaluation Samples

A dataset titled 'Heldout Product Hard Negative TTS' is hosted on Kaggle. The dataset's content likely relates to text-to-speech (TTS) evaluation, specifically containing 'hard negative' audio samples for a held-out product set. Metadata is minimal; the exact number of samples, audio characteristics, and creation details are unknown.

AudioText To SpeechAudio DataSpeech Synthesis+1

0 views

Speech & Audio

TTS-Romanian: 720 Hours of Speech from 456 Speakers

TTS-Romanian contains 267,410 speech samples totaling 720 hours of audio derived from Romanian audiobooks. Released by datadriven-company and updated in early 2026, the collection features 456 unique speakers with an average DNSMOS quality score of 3.84.

AudioParquetText To SpeechTask Categoriestext To SpeechLibrarypolarsLibrarydaskModalityaudioModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLicensecc By 40AudiobooksRomanianRegionusTask Categoriesautomatic Speech RecognitionSpeechLanguagero+1

0 views

Speech & Audio

WolneLektury-TTS-Polish: 997 Hours of Polish Audiobook Speech

383,710 audio samples totaling 997 hours of Polish speech, derived from the public domain Wolne Lektury digital library. The dataset features recordings from 1,207 unique professional narrators, with a split of 294,756 male and 88,945 female samples. It was created by datadriven-company and last updated on Hugging Face in February 2026.

AudioSpeech SynthesisAudiobooksLarge ScalePolish LanguageAutomatic Speech Recognition+1

0 views

Speech & Audio

Unified Tag System For Speech Music And Environmental Sounds

UTS provides a unified tag vocabulary bridging speech, music, and environmental sounds derived from high-fidelity audio captions. It was created by AudenAI using Qwen3-Omni-Captioner and Qwen2.5-7B-Instruct models on a subset of CaptionStew. The dataset was last updated on March 11, 2026.

AudioMultimodalJSONLibrarypolarsSpeech Music SoundsModalityaudioLanguageenAudio CaptionsModalitytextSize Categories100 Kn1 MLibrarymlcroissantAudio TaggingTask Categoriesaudio ClassificationLibrarydatasetsLibrarypandasUnified LabelingRegionusLarge ScaleUnified Tag SystemArxiv251116757Audio CaptioningLicensemitSynthetic+1

0 views

Speech & Audio

Clean Common Voice 24.0: 10K-100K Taiwan Chinese Voice Seeds

Encompassing between 10,000 and 100,000 cleaned audio clips derived from the Mozilla Common Voice 24.0 Chinese (Taiwan) corpus. Released by OKHand in early 2026, it provides 'Voice Seeds' processed through Silero-VAD to remove silence and environmental noise for generative speech tasks.

AudioOPTIMIZED-PARQUETParquetSize Categories10 Kn100 KCommon VoiceText To SpeechLibrarypolarsLanguagezhLibrarydaskModalityaudioLicensecc0 10Voice CloningModalitytextLibrarymlcroissantLibrarydatasetsRegionusSpeech+1

0 views

Speech & Audio

Amharic Shewa Dialect Parallel Speech Corpus

A parallel speech corpus contains audio recordings paired with text transcripts for the Shewa dialect of Amharic. The dataset, created by leyu-amharic, is designed for speech technology research and was last updated in February 2026.

TextAudioText To SpeechAmharic LanguageSpeech CorpusDialectologyNatural Language ProcessingSpeech Recognition+1

0 views

Speech & Audio

Qwen3-TTS-Model: Text-to-Speech AI Model

A text-to-speech model named Qwen3-TTS-Model is available on Kaggle. The dataset likely contains model weights, configuration files, or audio samples for speech synthesis. No further details on size, author, or specific contents are provided in the available metadata.

AudioAi ModelText To SpeechSpeech Synthesis+1

0 views

Speech & Audio

Product 1000 ASR TTS: Speech Recognition and Synthesis Data

Product 1000 ASR TTS is a dataset hosted on Kaggle. The title suggests it contains data related to automatic speech recognition and text-to-speech synthesis. The dataset's specific content, size, and origin are not detailed in the provided metadata.

AudioText To SpeechSpeech Recognition+1

0 views

Speech & Audio

Disjoint Code Switch Entity TTS: Speech Synthesis for Mixed-Language Text

Kaggle hosts a dataset titled 'Disjoint Code Switch Entity TTS'. The title suggests it contains data for text-to-speech synthesis, likely involving code-switching between languages and named entities. The dataset's author, organization, and specific contents are unknown from the provided metadata.

TextAudioText To SpeechSpeech SynthesisCode SwitchingEntity Tts+1

0 views

Speech & Audio

Targeted Code Switch Entity TTS: Speech Synthesis for Mixed-Language Text

A dataset for text-to-speech synthesis, likely containing audio samples and corresponding text transcripts. It appears to focus on code-switching, where speakers alternate between two or more languages. The dataset is hosted on Kaggle, but its specific size, creation date, and author are unknown.

TextAudioText To SpeechSpeech SynthesisCode SwitchingEntity Tts+1

0 views

Speech & Audio

MMedFD: A Real-World Healthcare Benchmark for Multi-Turn Full-Duplex ASR

MMedFD is a benchmark dataset for multi-turn, full-duplex automatic speech recognition in real-world healthcare settings. The dataset was created by HanselZz and was last updated on Hugging Face in February 2026. Full access requires internal approval and a research-only data use agreement.

AudioMultimodalMulti TurnMedical SpeechBenchmarkFull DuplexHealthcare BenchmarkAutomatic Speech Recognition+1

0 views

PreviousPage 60 of 130Next