DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

Tobydata TTS Dataset: Luganda Read Speech on Tailoring and Fashion

Tobydata TTS Dataset is a Luganda text-to-speech collection created by Bateesa. It contains read speech recordings, primarily on topics related to tailoring, fashion, and vocational training. The data was recorded via a mobile application.

AudioVocational TrainingSpeech SynthesisFashionLugandaRead Speech+1

0 views

Speech & Audio

CAMMDLS: Academic Metal Music Lyrics and Subgenres

CAMMDLS is an academic dataset of metal music lyrics and subgenres. The dataset was sourced from Kaggle, but the author, organization, and last update date are unknown. The description indicates a focus on academic analysis of lyrics and subgenre classification.

TextAudioLyricsMusic AnalysisSubgenres+1

0 views

Speech & Audio

Toronto Emotional Speech Set (TESS): 2,800 Audio Samples of 7 Emotions

2,800 audio stimuli of 200 target words spoken in a carrier phrase by two actresses. The set includes recordings for seven distinct emotions: anger, disgust, fear, happiness, pleasant surprise, sadness, and neutral. It was created by M. Kathleen Pichora‐Fuller at the University of Toronto, modeled on the Northwestern University Auditory Test No. 6.

AudioHappinessAudio DatasetAngerSet Abstract Data TypeSurpriseSpeech EmotionComputer ScienceAffective ComputingSadnessPsychologyPhraseLinguisticsSocial PsychologyDisgust+1

0 views

Speech & Audio

Open TTS Tracker: Open-Access Text-to-Speech Models

Open TTS Tracker is a dataset tracking open-access and open-source Text-to-Speech models. The dataset, created by Pendrokar, was last updated on February 20, 2026. It is hosted on Hugging Face and aims to be a central resource for awareness of these models.

TabularAudioText To SpeechOpen Source ModelsSpeech SynthesisModel Tracking+1

0 views

Speech & Audio

Aerial Images of Massachusetts Roads with Censored Regions Removed

Aerial imagery of roads in Massachusetts has been processed to remove censored regions from associated image masks. The dataset likely contains georeferenced images suitable for computer vision tasks. The author, organization, and specific collection details are unknown.

ImageGeospatialRoadsMassachusettsSatellite ImageryMask CleaningAerial Photography+1

0 views

Speech & Audio

Multi-Texture Sheet Music Recognition Benchmark

SMB is a benchmark dataset of printed Common Western Modern Notation scores developed by the Pattern Recognition and Artificial Intelligence Group at the University of Alicante. It is designed for Optical Music Recognition and image segmentation tasks involving full-page music scores.

TextAudioIMAGEFOLDERSystem LevelTask Categoriesimage To TextAnnotations Creatorsmanually Expert GeneratedSize Categoriesn1 KModalitytextTask Categoriestext RetrievalLibrarymlcroissantModalityimageLibrarydatasetsTask Categoriesimage SegmentationLicensecc By Nc 40Full PageRegionus+1

0 views

Speech & Audio

Tarifit PBC: 81 Phonetically Balanced Sentences for Riffian Berber TTS

81 sentences across three CSV files provide the first phonetically balanced corpus for Tarifit (Riffian Berber) text-to-speech training, created by jamalinu in 2026. The collection includes IPA transcriptions and a native-validated customer service subset specifically formatted for Coqui TTS.

Text To SpeechTask Categoriestext To SpeechCustomer ServiceBerberTarifitPhoneticsIpaAmazighRegionusLanguagezghLow ResourceLicensemit+1

0 views

Speech & Audio

English Contact Center Audio with Transcripts, 1000+ Hours

AxonData's English Contact Center Audio Dataset provides over 1,000 hours of inbound and outbound telephone call audio paired with English transcripts. The data consists of real-world, non-synthetic conversations featuring diverse English accents. The dataset was last updated on February 13, 2026.

TextAudioContact CenterCustomer SupportSentiment AnalysisNatural Language ProcessingSpeech RecognitionSynthetic+1

0 views

Speech & Audio

AVSpeech: Separated Video and Audio Streams from YouTube Clips

A restructured subset of the AVSpeech dataset provides separated video and audio streams. The dataset was created by ProgramComputer and was last updated on February 20, 2026. Each clip has a unique identifier derived from the original YouTube ID and timestamps.

AudioVideoMultimodalAudio VisualYoutube DerivedSpeech Recognition+1

0 views

Speech & Audio

Ellipse: Functions for Drawing Ellipses and Confidence Regions

Various routines for drawing ellipses and ellipse-like confidence regions, implementing plots from Murdoch and Chow (1996). The dataset also includes routines for profile plots described in Bates and Watts (1988). It was ported to R by Jesus M. Frias Celayeta.

TabularEllipseMathematicsGeometryConfidence RegionsArt+1

0 views

Speech & Audio

ergm: Tools for Exponential-Family Random Graph Model Analysis

An integrated set of tools for analyzing and simulating networks using exponential-family random graph models (ERGMs). The package is part of the Statnet suite for network analysis and is authored by Mark S. Handcock. It is described in peer-reviewed publications from the Journal of Statistical Software.

GraphMathematical AnalysisEconometricsComputer ScienceMathematicsExponential Random Graph ModelsNatural Exponential FamilyTheoretical Computer ScienceNetwork AnalysisPhysicsStatisticsComputational SociologySocial NetworkExponential FamilyStatistical PhysicsExponential Function+1

0 views

Speech & Audio

Music Emotion Multimodal Dataset with IoT and Physiological Signals

Music Emotion IoT Multimodal Dataset is a collection of data for analyzing emotional responses to music. It likely contains synchronized audio, physiological, and image features gathered from IoT devices. The dataset's author, organization, size, and update history are unknown.

AudioMultimodalAudio FeaturesMusic EmotionComputer VisionPhysiological DataIot+1

0 views

Speech & Audio

Librispeech Synth 300h: Synthetic Speech with Up to 3 Speakers

Librispeech Synth 300h is a synthetic speech dataset derived from the LibriSpeech corpus, containing up to 300 hours of audio. It is hosted on Kaggle and appears to be a processed version for speech synthesis tasks, likely containing audio generated by text-to-speech systems. The specific creator, generation method, and exact audio characteristics require verification after download.

AudioSpeech SynthesisAudio ProcessingSpeech Recognition+1

0 views

Speech & Audio

Spc R Segmented: Diarized and Merged German Speech Segments

A processed speech dataset derived from i4ds/spc_r. Each row represents a merged speech segment from a single speaker, created by applying speaker diarization and merging consecutive segments from the same speaker. The dataset was created by i4ds and last updated on Hugging Face in February 2026.

TextAudioTime SeriesParquetLibrarypolarsSource Datasetsi4dsspc RLibrarydaskGerman LanguageModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsLicensecc By 40Speech SegmentationRegionusTask Categoriesautomatic Speech RecognitionAudio ProcessingLanguagedeAutomatic Speech RecognitionSpeech Diarization+1

0 views

Speech & Audio

Qwen3 TTS Polish Training Data for Speech Synthesis

Polish-language training data for text-to-speech models, published on the HuggingFace platform. The dataset was uploaded by the user 'agnostic' and last updated on April 3, 2026. Its specific content, size, and structure require verification after download.

TextAudioJSONText To SpeechLibrarypolarsTraining DataSpeech SynthesisModalitytextSize Categories100 Kn1 MModalitytabularLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusPolish Language+1

0 views

Speech & Audio

ONEMUSIC: An Open-Source Music Dataset

ONEMUSIC is a free, open-source dataset available on Kaggle. The dataset originates from a GitHub project of the same name. The specific contents, size, and creation details are not provided in the available metadata.

AudioOpen SourceAudio Processing+1

0 views

Speech & Audio

Chat-TTS_SM: Text-to-Speech Model Data

Chat-TTS_SM is a dataset published on Kaggle. Its title suggests it contains data related to a text-to-speech model, likely for training or evaluation. The dataset's specific content, size, and origin are not detailed in the provided metadata.

TextAudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

Speech & Audio

F5-TTS_Marathi_SD: Marathi Text-to-Speech Audio Samples

A dataset titled 'F5-TTS_Marathi_SD' is hosted on Kaggle. The title suggests it contains audio data for Marathi text-to-speech synthesis. Metadata such as size, row count, columns, and license details are unknown.

AudioText To SpeechAudio DatasetMarathiSpeech Synthesis+1

0 views

Speech & Audio

F5-TTS_Urdu_SD: Urdu Speech Synthesis Dataset

F5-TTS_Urdu_SD is a dataset for Urdu text-to-speech synthesis, published on Kaggle. The dataset likely contains audio samples and corresponding text transcripts. Metadata is minimal; specifics on size, format, and collection details require verification after download.

AudioText To SpeechAudio DataSpeech SynthesisUrdu Language+1

0 views

Speech & Audio

ChatTTS-02_SM: Speech Synthesis Audio Samples

An audio dataset likely containing samples generated by the ChatTTS text-to-speech model. The dataset is published on Kaggle, but details about its size, creation date, and specific content are not provided in the metadata. The author and organization are unknown.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

PreviousPage 61 of 130Next