DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

Nursing Handover Speech Recordings with Multiple English Accents

A multi-speaker clinical speech corpus containing nursing handover statements. It is designed for research in Automatic Speech Recognition and speech-driven clinical documentation, featuring speakers with different English accents.

AudioAUDIOFOLDERLanguageenSize Categoriesn1 KLibrarymlcroissantAccentLibrarydatasetsHealthcareClinicalRegionusTask Categoriesautomatic Speech RecognitionSpeech Recognition+1

0 views

Speech & Audio

Persian Punctuation Restoration Dataset with 17 Million Samples

PersianPunc is a large-scale dataset for Persian punctuation restoration, containing 17 million token-level sequence labeling samples aggregated from 6 source corpora. It was created by MohammadJRanjbar and accepted at the EACL 2026 SilkRoad NLP Workshop.

ParquetToken ClassificationLibrarypolarsSequence LabelingPunctuationArxiv260305314PersianModalitytextSize Categories100 Kn1 MLibrarymlcroissantParsbertLibrarydatasetsLibrarypandasLicensecc By 40LanguagefaRegionusNatural Language ProcessingFarsiPunctuation RestorationTask Categoriestoken Classification+1

0 views

Speech & Audio

Taiwanese Hokkien Seed Text: 3 Million Sentences for Speech Synthesis and Recognition

tw-hokkien-seed-text is a dataset of approximately 3 million full-character Taiwanese Hokkien sentences designed for training text-to-speech (TTS) and automatic speech recognition (ASR) models. The dataset was created by lianghsun and was last updated on March 20, 2026. Each sentence is 50–80 characters long, corresponding to a speech duration of 10–15 seconds, and is written exclusively in Chinese characters to preserve authentic Taiwanese Hokkien vocabulary and syntax.

TextParquetText To SpeechTask Categoriestext To SpeechLibrarypolarsLanguagezhLibrarydaskSize Categories1 Mn10 MSpeech SynthesisModalitytextTaiwaneseLibrarymlcroissantTaiwanese HokkienLibrarydatasetsLicensecc By 40LanguagenanRegionusTask Categoriesautomatic Speech RecognitionMin NanSpeech RecognitionAutomatic Speech RecognitionText CorpusHokkien+1

0 views

Speech & Audio

IndicTTS-p2: Indic Language Text-to-Speech Data

IndicTTS-p2 is a dataset for text-to-speech synthesis, likely containing audio recordings and corresponding text transcripts. It is hosted on Kaggle, but the author, organization, and specific data characteristics are not provided. The dataset's size, format, and exact language coverage are unknown from the available metadata.

AudioText To SpeechSpeech SynthesisIndic Languages+1

0 views

Speech & Audio

Genshin Matcha TTS: Text-to-Speech Audio Samples

Genshin Matcha TTS is a dataset hosted on Kaggle. The title suggests it contains audio data for text-to-speech synthesis, likely related to the 'Genshin' context. No further metadata on size, source, or creation date is available.

AudioText To SpeechSpeech SynthesisVoice Generation+1

0 views

Speech & Audio

Abjad-Kids: Arabic Speech Recordings for Primary Education

Abjad-Kids is an Arabic speech classification dataset containing spoken recordings of the Arabic alphabet, numbers, and colors from multiple child speakers. It supports research in automatic speech recognition and educational technology for Arabic-speaking children. The dataset was created by Aziz-snoubra and was last updated on March 14, 2026.

AudioAUDIOFOLDERSize Categories10 Kn100 KSpeech ClassificationLanguagearModalityaudioArabic LanguageLibrarymlcroissantTask Categoriesaudio ClassificationNumbersLibrarydatasetsPrimary EducationRegionusTask Categoriesautomatic Speech RecognitionAlphabetAudio RecognitionColorsSpeech RecognitionLicensemit+1

0 views

Speech & Audio

Trump Voice Audio Samples for TTS Fine-Tuning

A speech dataset intended as an example for training a text-to-speech fine-tuning platform. It contains audio files with associated transcripts and speaker identifiers, with missing transcripts generated automatically by the Whisper-large v3 model. The dataset was created by mgrei and was last updated on April 12, 2026.

AudioText To SpeechSpeech SynthesisVoice CloningAudio ProcessingSynthetic+1

0 views

Speech & Audio

Covered California Health Plan Enrollment by Primary Spoken Language

13 primary spoken languages, including English, Spanish, Mandarin, and Hmong, are tracked for individuals who enrolled in a Covered California Qualified Health Plan. The data originates from the California Healthcare Eligibility, Enrollment and Retention System (CalHEERS) and is part of public reporting requirements. Enrollment counts are reported by period for individuals who paid their first premium.

TabularMultilingualZIPCSVHealth InsuranceEnrollment StatisticsHealthcareUnited StatesPublic Health+1

0 views

Speech & Audio

SaaS Corporate Voice Dataset Sample for Voice Cloning

A sample dataset of high-fidelity, ethically sourced conversational audio data. The description indicates it is intended for voice cloning applications. The dataset's size, specific source, and temporal coverage are unknown.

AudioAudio DataVoice CloningSaasCorporate Speech+1

0 views

Speech & Audio

Massachusetts Corporate Accounting Practices, 1870-1895

Between 1875 and 1895, the prevalence of double-entry bookkeeping among Massachusetts corporations surged from 60% to over 96%. This dataset supports a quantitative analysis of accounting innovation, tracking the adoption of depreciation and its correlation with firm survival. It includes balance statement data for corporations and supplementary citation counts from the Accountants' Index.

TabularTime SeriesMassachusettsCorporate FinanceAccounting HistoryHistorical DataBusiness Practices+1

0 views

Speech & Audio

fdasrvf: Elastic Functional Data Analysis for Phase and Amplitude Separation

James D. Tucker's fdasrvf package implements the square-root velocity framework for elastic functional data analysis. The method, based on research by Srivastava et al. (2011) and Tucker et al. (2014), performs alignment, PCA, and modeling of multidimensional and unidimensional functions. It is sourced from the paperswithcode platform.

Time SeriesElastic AlignmentComputer ScienceSquare Root VelocityStatistical Modeling+1

0 views

Speech & Audio

Music Genre Audio Dataset with 16 Style Labels from NetEase

Approximately 1,700 musical pieces in MP3 format, sourced from NetEase music. The audio clips are 270 to 300 seconds long and sampled at 22,050 Hz. The dataset was created by ccmusic-database and last updated on 2026-02-27.

AudioMusic Information RetrievalAudio ClassificationMusic Genre+1

0 views

Speech & Audio

Bangla Regional Dialects Speech Dataset

A speech dataset covering multiple regional dialects of the Bangla language, intended for automatic speech recognition tasks. The dataset is hosted on Kaggle, but details on its size, collection method, and creator are unspecified. Its primary focus is on capturing linguistic diversity within the Bengali-speaking regions.

AudioAudio DataBengali LanguageRegional DialectsSpeech Recognition+1

0 views

Speech & Audio

Emolia-HQ: High-Quality Speaker-Paired Audio for Voice Synthesis

A high-quality, speaker-paired subset of the LAION Emolia dataset, created by TTS-AGI and last updated on March 9,我们发现 2026. Each sample includes a target and a reference utterance from the same speaker, filtered for quality using a DNSMOS score threshold of 3.0.

AudioWEBDATASETText To SpeechTask Categoriestext To SpeechLanguagezhLanguageenLibrarywebdatasetSize Categories10 Mn100 MSpeech SynthesisModalitytextLibrarymlcroissantTask Categoriesaudio ClassificationLibrarydatasetsLicensecc By 40Emotion RecognitionVoice ConversionSpeaker IdentityAudio PairsLanguagekoRegionusEmotionLanguagefrLanguagejaLanguagede+1

0 views

Speech & Audio

CMI Pref Pseudo: Multimodal Music Generation Preference Comparisons

CMI Pref Pseudo contains 56,000 music generations from 23 models and 165,000 pairwise comparisons for preference modeling research. The dataset was created by HaiwenXia and last updated on March 3, 2026. Prompts are compositional, including text, optional lyrics, and reference audio.

AudioMultimodalAi EvaluationMultimodal DataMusic GenerationBenchmarkPreference Modeling+1

0 views

Speech & Audio

IMSLP MIDI Files and Metadata Crawled in July 2024

IMSLP MIDI Dataset contains MIDI files and associated metadata crawled from the International Music Score Library Project on July 21-22, 2024. The dataset includes fields such as composer, year, era, style, key, and license, along with raw MIDI bytes and serialized mido objects. It was created by TiMauzi and is available under a CC-BY-SA-4.0 license.

TabularAudioClassical MusicMusic ScoresMidi+1

0 views

Speech & Audio

Taiwanese Hokkien Synthetic Speech Audio Dataset

722 seed utterances and 32,506 Common Voice samples were used to generate this Taiwanese Hokkien (Min Nan) speech dataset via the CosyVoice3 model. The dataset includes audio files, corresponding text, and speaker metadata. It was created by lianghsun and last updated on March 19, 2026.

TabularAudioMultilingualParquetText To SpeechTask Categoriestext To SpeechLibrarypolarsLanguagezhModalityaudioSize Categoriesn1 KModalitytextTaiwaneseLibrarymlcroissantLibrarydatasetsLibrarypandasLicensecc By 40LanguagenanRegionusHokkienAudio Synthesis+1

0 views

Speech & Audio

WeWe Pidgin TTS Dataset

WeWe Pidgin TTS Dataset is a speech synthesis dataset published on Kaggle. The dataset likely contains audio recordings and corresponding text transcriptions for text-to-speech applications. Its specific size, creation details, and update history are not provided in the available metadata.

AudioText To SpeechSpeech SynthesisPidginLow Resource Language+1

0 views

Speech & Audio

ToneWebinars Balalaika: 248 Hours of Annotated Russian Podcast Speech

ToneWebinars Balalaika is a 248.9-hour Russian speech corpus curated from podcasts by the MTUCI lab260 team. Released in early 2026, the dataset was processed using the BALALAIKA pipeline to provide high-quality audio for generative speech tasks. It serves as a refined version of the original ToneWebinars source, specifically filtered for speech synthesis and recognition.

ParquetSize Categories10 Kn100 KTask Categoriestext To SpeechLibrarypolarsModalitytextArxiv250713563ModalitytabularLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusTask Categoriesautomatic Speech RecognitionLanguageruLicenseapache 20+1

0 views

Speech & Audio

TWB Voice Kanuri TTS 1.0 Sample Set: High-Quality Read Speech

TWB Voice Kanuri TTS 1.0 Sample Set is a high-quality text-to-speech corpus containing read speech data in Kanuri. It was recorded by a single female speaker under acoustically optimal conditions and represents 10% of the complete dataset collected by CLEAR Global (formerly Translators without Borders). The dataset page was last updated on 2026-02-23.

AudioText To SpeechSpeech SynthesisSpeech CorpusNatural Language ProcessingKanuri Language+1

0 views

PreviousPage 51 of 130Next