DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

Seismic-Reflection Surveys of Outer Cape Cod Nearshore

High-resolution seismic-reflection surveys map the stratigraphy of the nearshore areas from Chatham to Provincetown, Massachusetts. The U.S. Geological Survey Woods Hole Field Center conducted this investigation to correlate geologic units between the nearshore and onshore. The data defines the Quaternary geologic framework of outer Cape Cod.

GeospatialGeologyStratigraphyCape CodSeismic Reflection+1

0 views

Speech & Audio

Bathymetric Contours for the Gulf of Maine Sea Floor

A 2006 data set provides bathymetric contours for the Gulf of Maine and New England Shelf. The U.S. Geological Survey constructed it for geologic framework studies. It was reprojected into the NAD83 Massachusetts State Plane coordinate system by the Massachusetts Office of Coastal Zone Management.

GeospatialSea Floor GeologyMarine HabitatGulf Of MaineGeospatial ContoursBathymetry+1

0 views

Speech & Audio

Boston Harbor Water Quality Monitoring for Wastewater Effluent

Monitoring data tracks the environmental effects of secondary-treated sewage effluent discharged from a 9.5-mile outfall tunnel into Massachusetts Bay. The Environmental Quality Department (Enquad) collects this data to ensure compliance with an NPDES discharge permit for 43 communities. The dataset covers water quality in Massachusetts Bay, Boston Harbor, and Cape Cod Bay.

Time SeriesGeospatialMassachusetts BayCoastal Water QualityEnvironmental ImpactWastewater Monitoring+1

0 views

Speech & Audio

Hubline Natural Gas Pipeline Route in Massachusetts Bay

Massachusetts Bay hosts the as-built location of the Hubline, a 29.5-mile natural gas pipeline constructed primarily offshore between Beverly and Weymouth. The dataset was created by SCIOPS, representing the pipeline's surveyed bottom position. The route traverses 11 coastal communities including Salem, Boston, and Quincy.

GeospatialNatural Gas InfrastructureGeospatial DataCoastal Engineering+1

0 views

Speech & Audio

Afaan Oromo Text to Speech Synthesis Dataset

A dataset for Afaan Oromo text-to-speech synthesis, published on Kaggle. The dataset likely contains paired text and audio samples for training and evaluating speech synthesis models. Specific details on size, format, and collection methodology are not provided in the available metadata.

TextAudioText To SpeechSpeech SynthesisAfaan OromoLow Resource Language+1

0 views

Speech & Audio

MLAAD English: 5 Audio Samples per TTS Model

MLAAD English provides audio samples for evaluating text-to-speech models. The dataset likely contains five audio clips generated by each of several TTS systems. It is hosted on Kaggle, but the specific creator and update date are unknown.

AudioMachine LearningSpeech SynthesisAudio SamplesTts Evaluation+1

0 views

Speech & Audio

Penguin Colony Geospatial Data from Antarctic Islands

1982 aerial photography of penguin colonies on islands approximately 12km northeast of Brattstrand Bluff, Antarctica, was digitized into DXF files and later georeferenced into a shapefile. The dataset includes digitized colony boundaries and four supporting photographs from 2009. Work was contributed by Eric Woehler, John Cox, Tom Velthuis, and Ursula Harris of the Australian Antarctic Data Centre.

GeospatialAerial SurveyAntarctic EcologyComputer VisionPenguin ColoniesGeospatial Data+1

0 views

Speech & Audio

Granary: 1 Million Hours of Speech for 25 European Languages

NVIDIA's Granary dataset provides approximately 1 million hours of high-quality speech data across 25 European languages for speech recognition and translation. Released in 2026, it consolidates multiple sources into a unified framework to support low-resource language modeling. The collection is designed for both Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) tasks.

LanguageenLanguagedaLanguageelLanguagebgLanguagecsTask Categoriesautomatic Speech RecognitionLanguageesTask CategoriestranslationLanguagede+1

0 views

Speech & Audio

PICO-8 Games Multimodal Dataset with 10,967 Cartridges

PICO-8 Games Dataset contains 10,967 game cartridges scraped from the Lexaloffle BBS. Each cartridge is decomposed into Lua source code, pixel-art spritesheets, tile maps, sound effects, music patterns, and metadata. The dataset was created by Fraser and includes label screenshots from the top 48 games by star count.

OPTIMIZED-PARQUETParquetSize Categories10 Kn100 KTask Categoriestext GenerationLibrarypolarsGamesTask Categoriesimage To TextRetroLanguageenLuaModalitytextLibrarymlcroissantModalityimageLibrarydatasetsLibrarypandasCode GenerationRegionusPico 8Pixel Art+1

0 views

Speech & Audio

LibriVAD: A Large-Scale Noise-Augmented Dataset for Voice Activity Detection

LibriVAD is a large-scale, noise-augmented dataset for Voice Activity Detection (VAD) generated from the LibriSpeech corpus. The dataset was created by LibriVAD and was last updated on March 17, 2026. It is designed for training and evaluating VAD models in noisy environments.

AudioAudio DatasetNoise AugmentationSpeech ProcessingLarge ScaleNatural Language ProcessingVoice Activity Detection+1

0 views

Speech & Audio

Vivoice Relabeled: Vietnamese Speech Recognition Data with Qwen3-ASR

Vivoice Relabeled is a speech dataset derived from the original capleaf/viVoice collection. The dataset has been processed using the Qwen/Qwen3-ASR-1.7B model to update audio-text labels, retaining samples with a Word Error Rate below 15%. It was uploaded by author JayLL13 to Hugging Face in March 2026.

AudioParquetLibrarypolarsLibrarydaskModalitytextSize Categories100 Kn1 MLibrarymlcroissantLibrarydatasetsRegionusRelabeledVietnamese LanguageAudio ProcessingSpeech Recognition+1

0 views

Speech & Audio

CMI-Pref: Human Music Preference Comparisons with Alignment Scores

CMI-Pref provides between 1,000 and 10,000 human preference comparisons for multimodal music generation, published by HaiwenXia in 2026. Each record captures a human vote comparing two generated audio samples based on musicality, alignment, and confidence.

JSONSize Categories1 Kn10 KLibrarypolarsLanguagezhPreference PredictionTask Categoriesaudio To AudioModalityaudioLanguageenLicensecc By Nc Sa 40ModalitytextMultimodal Music GenerationModalitytabularLibrarymlcroissantTask Categoriestext To AudioLibrarydatasetsLibrarypandasRegionus+1

0 views

Speech & Audio

Naija-ASR-Corpus: Nigerian Pidgin Speech Recognition Dataset

Naija-ASR-Corpus v1.0 (NAC-v1.0) is a foundational speech dataset for Nigerian Pidgin (Naija, PCM). It was created by the NAC Team, who processed long-form recordings from the Universal Dependencies Naija Spoken Corpus into sentence-level audio-text pairs suitable for ASR training. The dataset was last updated on March 16, 2026.

AudioPidginNatural Language ProcessingNigerian LanguagesSpeech RecognitionAudio Text Pairs+1

0 views

Speech & Audio

IndicTTS-p1: Indic Language Text-to-Speech Data

IndicTTS-p1 is a dataset for text-to-speech synthesis, published on Kaggle. The title suggests it contains data for Indic languages, which likely includes audio recordings and corresponding text transcripts. The dataset's specific size, languages, and collection details are not provided in the available metadata.

TextAudioText To SpeechSpeech SynthesisIndic Languages+1

0 views

Speech & Audio

IndicTTS-p3: Indic Language Text-to-Speech Data

IndicTTS-p3 is a text-to-speech dataset likely containing audio samples and corresponding text transcripts for one or more Indic languages. It is hosted on Kaggle, but the author, organization, and specific data characteristics are not provided. The dataset's size, format, and exact contents require verification after download.

AudioText To SpeechSpeech SynthesisIndic Languages+1

0 views

Speech & Audio

Taiwanese Hokkien Synthetic Speech Dataset from Qwen3-TTS

33,228 synthetic audio clips for Taiwanese Hokkien text-to-speech, generated using the Qwen3-TTS-1.7B-Base model with voice cloning. The dataset was created by lianghsun and last updated in March 2026.

TabularAudioSpeech SynthesisVoice CloningTaiwanese Hokkien+1

0 views

Speech & Audio

TTS Human Preferences: 2,700 Audio Pairs with 40,500 Annotations

Encompassing 2,700 pairs of text-to-speech audio renderings with 15 human preference annotations per pair. Produced by datapointai and updated in March 2026, it provides comparative naturalness ratings for audio generated from identical text prompts. The collection totals 40,500 individual human judgments to support high-confidence audio quality evaluation.

OPTIMIZED-PARQUETParquetSize Categories1 Kn10 KText To SpeechLibrarypolarsRlhfModalityaudioLanguageenModalitytextAudio QualityLibrarymlcroissantTask Categoriesaudio ClassificationLibrarydatasetsLibrarypandasPreference DataLicensecc By 40Human PreferencesRegionusDpo+1

0 views

Speech & Audio

NileTTS: 38 Hours of Egyptian Arabic Text-to-Speech Data

NileTTS provides 38.1 hours of transcribed Egyptian Arabic speech across 9,521 utterances, published by KickItLikeShika in February 2026. The collection is segmented into specific domains, including over 21 hours dedicated to sales and customer service interactions.

Text To SpeechTask Categoriestext To SpeechLanguagearEgyptian-ArabicRegionusArxiv260215675Licenseapache 20Synthetic Data+1

0 views

Speech & Audio

Human Preference Data For TTS Audio Quality Evaluation

TTS Human Preferences (Medium) is a dataset for text-to-speech audio quality evaluation. It contains 2,000 rows, each with two TTS audio renderings and 15 human preference annotations, totaling 30,000 annotations. The dataset was created by datapointai and last updated in March 2026.

AudioMultimodalOPTIMIZED-PARQUETParquetSize Categories1 Kn10 KText To SpeechLibrarypolarsRlhfModalityaudioLanguageenModalitytextAudio QualityLibrarymlcroissantTask Categoriesaudio ClassificationLibrarydatasetsBenchmarkLibrarypandasPreference DataLicensecc By 40Human PreferencesRegionusDpo+1

0 views

Speech & Audio

Ttsmodels: Text-to-Speech Model Artifacts

Ttsmodels is a dataset published on HuggingFace by author phongluong197. The dataset was last updated on 2026-05-07 07:59:33. Its specific content and scale are not detailed in the available metadata.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

PreviousPage 50 of 130Next