DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

Latin Music Features and CLAP Embeddings from 30-Second Audio Fragments

30-second audio fragments of Latin music are provided with extracted features. Each fragment includes a 512-dimensional CLAP embedding, 13 MFCCs, and a BPM value. The dataset is hosted on Kaggle, but details about the creator, size, and license are not specified.

AudioLatin MusicMusic AnalysisAudio FeaturesClap Embeddings+1

0 views

Speech & Audio

VoxCeleb: 1 Million+ Audiovisual Clips for Speaker Recognition

VoxCeleb and VoxCeleb2 provide over 1 million audiovisual clips of human speech from celebrities, compiled by researchers at the University of Oxford. This repository aggregates both versions into a single source containing MP4 video and AAC/WAV audio files.

Size Categories100 Kn1 MTask Categoriesaudio ClassificationDoi1057967hf0999Licensecc By 40Task Categoriesimage ClassificationArxiv170608612RegionusTask Categoriesautomatic Speech RecognitionTask Categoriesvideo Classification+1

0 views

Speech & Audio

Benchmarking Automatic Music Transcription Systems

A dataset for benchmarking automatic music transcription (AMT) systems, likely containing audio samples and corresponding transcription outputs or evaluation metrics. It originates from a Data Visualisation course project (DA332) and was published on Kaggle. The specific content and scale require verification after download.

TabularAudioMachine LearningBenchmarkingMusic TranscriptionAudio Processing+1

0 views

Speech & Audio

KHM-ASR-Cultural-DDD: Cultural Heritage Audio Speech Recognition Data

KHM-ASR-Cultural-DDD is a speech dataset published on Kaggle. The title suggests it contains audio recordings for automatic speech recognition, likely related to cultural heritage. Metadata is minimal; the actual content, scale, and origin require verification after download.

AudioLanguageCultural HeritageSpeech Recognition+1

0 views

Speech & Audio

XTTSv2 Patch 9: Text-to-Speech Model Checkpoint

XTTSv2 patch 9 is a model checkpoint for a text-to-speech system, published on Kaggle. The dataset's specific content, such as audio samples or model weights, requires verification after download. No information is provided about the author, organization, or the exact data format.

AudioText To SpeechMachine LearningSpeech Synthesis+1

0 views

Speech & Audio

Music Foundry Vault: A Collection of Audio Samples

Music Foundry Vault is a dataset hosted on Kaggle. Its title suggests it contains audio samples or music production assets. The dataset's specific contents, scale, and origin require verification after download.

AudioMusic ProductionSound Library+1

0 views

Speech & Audio

Piano Music Emotions Dataset for Emotional Perception Analysis

Kaggle hosts a dataset designed for analyzing emotional perception in piano music. The dataset's creator, size, and specific contents are not detailed in the provided metadata. Its last update date and license information are also unknown.

AudioEmotion AnalysisMusic PerceptionPiano Music+1

0 views

Speech & Audio

Fongbe Speech Dataset: A Tone-Preserved Continuous Speech Corpus

Professor's Fongbe Speech Dataset is a unified, high-quality collection of Fongbe speech data curated to preserve the linguistic integrity of this tonal language. It acts as a complete, unsegmented, and tone-accurate assembly of the Fongbe Continuous Speech Recognition corpora, merging the foundational ALFFA Project data from 2016 with an expanded Zenodo release from 2022. The dataset was last updated on the Hugging Face platform in February 2026.

AudioFongbeAudio CorpusLinguisticsSpeech Recognition+1

0 views

Speech & Audio

ESC-50: Environmental Sound Classification Dataset

A dataset for Environmental Sound Classification. It likely contains audio recordings of various environmental sounds. The dataset is published on Kaggle.

AudioMachine LearningEnvironmental SoundAudio Classification+1

0 views

Speech & Audio

CoRal V3: Danish Conversational and Read-Aloud Speech Dataset

CoRal V3 is an Automatic Speech Recognition dataset designed to capture the diversity of spoken Danish. The dataset, created by the CoRal-project, includes variations across dialects, accents, genders, and age groups. It was last updated on February 24, 2026.

AudioParquetLibrarypolarsLibrarydaskLanguagedaAudio ClassificationModalitytextSize Categories100 Kn1 MLibrarymlcroissantTask Categoriesaudio ClassificationLicenseopenrailLibrarydatasetsSpeech DatasetRegionusTask Categoriesautomatic Speech RecognitionDanish LanguageAutomatic Speech RecognitionDialects+1

0 views

Speech & Audio

Custom Egyptian Arabic Text-to-Speech Dataset

A text-to-speech dataset for Egyptian Arabic, created by AlaaSamir and hosted on Hugging Face. The dataset was last updated on April 2, 2026. Its specific size, format, and content require verification after download.

AudioText To SpeechEgyptian-ArabicSpeech Synthesis+1

0 views

Speech & Audio

Musicskills 3.5M: Audio Data for Music Skill Analysis

Musicskills 3.5M is a dataset published on HuggingFace by AndreasXi, with a last update timestamp of 2026-03-31. Its title suggests a collection of data related to musical skills, potentially containing audio recordings or performance metrics. The dataset's specific content, scale of 3.5 million items, and intended use require verification after download due to minimal provided metadata.

AudioSkillsMachine Learning+1

0 views

Speech & Audio

Independent Music Venue Economic Impact Across 109 U.S. Zones

This dataset estimates the economic impact of 1,423 independent music venues across 109 U.S. music zones. It provides regional-level estimates of annual economic output and jobs supported, calculated using a venue economic impact calculator. The analysis finds venues contribute approximately $1.4 billion annually and support 11,824 jobs.

Arts And HumanitiesBusiness and ManagementAgglomeration Theory Independent Music Venues Musi+1

0 views

Speech & Audio

Speech Dataset by Author wonderwind271

Speech Dataset is an audio collection uploaded to HuggingFace by author wonderwind271. The dataset was last updated on April 4, 2026. Its specific content, size, and structure require verification after download.

AudioMachine Learning+1

0 views

Speech & Audio

XTTSv2 Checkpoint: A Text-to-Speech Model

XTTSv2_checkpoint is a dataset published on Kaggle. The title suggests it contains model weights or training data for a text-to-speech system. The dataset's specific content, size, and origin are not detailed in the available metadata.

AudioText To SpeechMachine LearningSpeech SynthesisCheckpoint+1

0 views

Speech & Audio

Irodori-TTS Training Data

Kaggle hosts the Irodori-TTS Training Data. The dataset likely contains audio recordings and corresponding text transcripts for training text-to-speech models. Its creator, size, and specific collection date are unknown.

TextAudioText To SpeechTraining DataSpeech Synthesis+1

0 views

Speech & Audio

OtoSpeech: 141 Hours of Processed Full-Duplex Conversational Audio

Otoearth released this 141-hour dataset of processed, two-speaker full-duplex conversational English speech in February 2026. It is a curated subset of the otoSpeech-full-duplex-280h collection, refined through human quality reviews and noise reduction techniques.

AudioEnglishTask Categoriesaudio To AudioModalityaudioLanguageenSize Categoriesn1 KLicensecc By 40Regionus+1

0 views

Speech & Audio

Librispeech Synth 300h: Synthetic Speech Audio from Up to 10 Speakers

Librispeech Synth 300h is a synthetic speech audio dataset derived from the LibriSpeech corpus. The title suggests it contains up to 300 hours of generated audio, likely from a maximum of 10 distinct speaker profiles. It is hosted on the Kaggle platform, but detailed metadata about its creation and contents is not provided.

AudioMachine LearningAudio DatasetSpeech SynthesisSpeech Recognition+1

0 views

Speech & Audio

Tricky Tts Orpheus: Text-to-Speech Audio Samples

Tricky Tts Orpheus is a dataset authored by Trelis and hosted on Hugging Face. The dataset was last updated on March 31, 2026. Its specific content and scale are not detailed in the available metadata.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

Speech & Audio

Synthetic Speaker Diarization Dataset for Hindi Speech

Hindi speech data created by sol9x-sagar and published on Hugging Face. The dataset is designed for speaker diarization tasks, which involve identifying and segmenting speech by different speakers. It was last updated on April 1, 2026.

AudioHindi SpeechSpeech ProcessingSpeaker DiarizationSyntheticSynthetic Audio+1

0 views

PreviousPage 64 of 130Next