DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

HiFi-HARP: High-Fidelity Hybrid Ambisonic Room Impulse Responses

A large-scale collection of First-Order Ambisonic (FOA) Room Impulse Responses (RIRs) generated through high-fidelity hybrid acoustic simulation. This dataset is a specialized version derived from the 7th-order HiFi-HARP source to support spatial audio research in sound localization and dereverberation.

AmbisonicsTask Categoriesaudio To AudioModalityaudioTask Categoriesaudio ClassificationLicensecc By 40Room Impulse ResponseRir3d-frontRegionusSpatial AudioSimulationArxiv251021257+1

0 views

Speech & Audio

Viola Timbre Audio Recordings Across Multiple Genres

High fidelity audio recordings of viola performances across multiple musical genres. The dataset likely contains spectral and temporal features extracted from the audio signals. The author, organization, and specific collection details are unknown.

AudioTime SeriesSpectral AnalysisViolaTimbreData AnalyticsData Cleaning+1

0 views

Speech & Audio

Music by Emotion: 1,000 SoundCloud Audio Clips for Emotion Recognition

1,000 music samples are provided as 30-second audio clips sourced from SoundCloud. Each clip is labeled for emotional content based on musical characteristics like tempo and key. The dataset was created by LaurenGurgiolo and last updated on December 16, 2025.

AudioSoundcloudEmotion Recognition+1

0 views

Speech & Audio

Josh Talk ASR Dataset: Speech Audio for Recognition Models

A speech dataset titled 'josh-talk-ASR-dataset' hosted on Kaggle. The dataset likely contains audio recordings and corresponding transcriptions for training automatic speech recognition systems. Specific details on volume, contributors, and creation date are unavailable in the provided metadata.

AudioSpeech DataAudio ProcessingAutomatic Speech Recognition+1

0 views

Speech & Audio

Urgent2025 SQA: Noisy and Enhanced Speech with Quality Metrics

The URGENT Speech Enhancement Challenge dataset provides noisy and enhanced speech samples curated for SQA/SE research. Each entry includes audio/IDs with objective and model-predicted quality metrics, and human Mean Opinion Scores collected from 8 distinct subjects via Amazon Mechanical Turk. The dataset was created by urgent-challenge and last updated on December 10, 2025.

AudioMachine LearningSpeech EnhancementAudio QualityHuman MosSpeech Quality Assessment+1

0 views

Speech & Audio

Japanese Multi-Speaker Speech Dataset in LJSpeech Format

60,233 speech utterances from 20 Japanese speakers, totaling approximately 90.6 hours of audio. The dataset is formatted for LJSpeech compatibility and optimized for training TTS models like Piper. Audio samples have a 22,050 Hz sample rate.

AudioJapaneseSize Categories10 Kn100 KText To SpeechTask Categoriestext To SpeechLicenseotherMulti SpeakerVitsPiperRegionus+1

0 views

Speech & Audio

Kashmiri Language Speech Synthesis and Recognition Data

Kashmiri Text-to-Speech | Speech-to-Text is a dataset hosted on Kaggle aimed at enabling speech synthesis and digital accessibility for the Kashmiri language. The dataset's specific size, format, and structure are not detailed in the provided metadata. Its author, organization, and last update date are unknown.

TextAudioText To SpeechSpeech To TextKashmiri LanguageNatural Language ProcessingAutomatic Speech Recognition+1

0 views

Speech & Audio

ViVoice34: Speaker Verification Audio Samples

A dataset designed for the speaker verification task. The dataset's author, size, and specific contents are not detailed in the provided metadata. It is hosted on the Kaggle platform.

AudioVoice BiometricsSpeaker VerificationAudio Processing+1

0 views

Speech & Audio

Vietnamese Lexicons for Text Normalization and TTS

A collection of Vietnamese acronym and transliteration lexicons for text normalization and text-to-speech applications. The dataset is hosted on Kaggle and is associated with platform tags for text and speech processing. Specific details on size, authorship, and update frequency are not provided.

TextText To SpeechText Pre ProcessingSpeech To TextVietnameseNatural Language ProcessingText Preprocessing+1

0 views

Speech & Audio

Nepali ASR Dataset: Speech Recognition Data for Nepali Language

A speech recognition dataset for the Nepali language, published on the Hugging Face platform by Aadarsh17. The dataset was last updated on February 12, 2026. Its specific content, size, and structure require verification after download.

TextAudioOPTIMIZED-PARQUETParquetSize Categories10 Kn100 KLibrarypolarsAudio DataModalitytextLibrarymlcroissantLibrarydatasetsLibrarypandasRegionusSpeech Recognition+1

0 views

Speech & Audio

Massachusetts County Subdivision Boundaries for 2025 Census

2025 boundaries for county subdivisions in Massachusetts, as reported through the Census Bureau's Boundary and Annexation Survey and Participant Statistical Areas Program. This shapefile extract from the MAF/TIGER System provides geographic and cartographic information for legally-recognized minor civil divisions and statistical census county divisions.

MassachusettsMaCensus County DivisionPolygonMcdState Or Equivalent EntityCounty SubdivisionMinor Civil DivisionTownUnorganized TerritorySubdivisionUtTownshipBarrioCcd+1

0 views

Speech & Audio

Multimodal Music Genre and Emotion Dataset

Audio features are paired with emotion and genre labels for analysis. The dataset is multimodal, combining audio signal data with categorical annotations. Specific row counts, column details, and creation metadata are unavailable.

AudioMultimodalAudio FeaturesMusic Genre+1

0 views

Speech & Audio

DeepFakeVoice-Wac2Vec-4Datasets

Featuring segmented deepfake speech audio clips aggregated from 4 public source datasets. The audio is partitioned into 2-second clips with a 1-second overlap to provide consistent input lengths for acoustic feature extraction and temporal analysis.

AudioAudio ClassificationArtificial IntelligenceDeep Learning+1

0 views

Speech & Audio

XLSR 5 Epochs: Telugu Automatic Speech Recognition Model

XLSR 5 Epochs Telugu ASR is a dataset for training or evaluating automatic speech recognition models for the Telugu language. The dataset is hosted on the Kaggle platform, but its specific contents, size, and creation details are not provided in the available metadata. The title suggests it may be related to a cross-lingual speech representation (XLSR) model fine-tuned for five epochs.

AudioTelugu LanguageAudio ProcessingSpeech RecognitionAutomatic Speech Recognition+1

0 views

Speech & Audio

Music Metadata Records Dataset

Records is a dataset containing metadata for music releases. The dataset is tagged for Arts and Entertainment and Audio applications. Specific details on record count, features, and provenance are unavailable.

AudioArts And Entertainment+1

0 views

Speech & Audio

Music Database for Cinematic Video Editing

music_db_cinematic_video_edit is a dataset hosted on Kaggle. The title suggests it contains music or audio-visual material intended for use in video editing, particularly for cinematic projects. The dataset's specific contents, size, and origin are not detailed in the available metadata.

AudioVideoAudio VisualVideo EditingCinematic+1

0 views

Speech & Audio

Hakka Speech Recognition Dataset for Taiwanese Languages

Hakka-language audio recordings and transcriptions form a pre-training dataset for the Taiwan-Tongues-ASR-CE project. The dataset is packaged in WebDataset format for direct use with PyTorch and Hugging Face libraries. It was created by the adi-gov-tw organization and last updated in December 2025.

AudioMultimodalTaiwanese LanguagesAudio TranscriptionSpeech RecognitionHakka Language+1

0 views

Speech & Audio

Mdf En Emilia Yodas: 616 Hours of English Audio Events

616 hours of English audio extracted from the Emilia-Dataset, licensed under CC BY 4.0. The audio events are classified using Scribe v1, an STT/ASR system from ElevenLabs, and filtered using Facebook audio aesthetics metrics. The dataset is described as a 'v1' version, with further collaboration invited via Discord.

AudioEvent ClassificationEnglish LanguageSpeech Recognition+1

0 views

Speech & Audio

Music Genre Audio Data for Pre-Trained Model Training

A Kaggle-hosted dataset for audio classification tasks, likely containing audio files or features for music genre identification. The dataset is intended for training or fine-tuning pre-trained models. Specific details on size, origin, and creation date are not provided in the available metadata.

AudioAudio DataPre Trained ModelAudio ClassificationMusic Genre+1

0 views

Speech & Audio

Urdu Speech Emotional Corpus For Emotion Recognition

Encompassing speech audio samples annotated for emotional content in the Urdu language. It is designed for tasks in emotion recognition and spoken language processing. The specific number of audio files, features, and rows is unknown.

TextLanguages+1

0 views

PreviousPage 82 of 130Next