DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,602 datasets

Speech & Audio

Apple Music and Spotify Hits, 638k Tracks from 2010 to 2023

AppleMusic/Spotify Hits 638k Tracks 2010-2023 is a dataset of music tracks from two major streaming platforms. It contains 638,000 tracks released between 2010 and 2023, sourced from Kaggle. The dataset likely includes audio features and popularity metrics for analysis.

TabularAudioApple MusicSpotifyPopularityAudio FeaturesMusic Tracks+1

0 views

Speech & Audio

Livekit Wake Word Audio Dataset

An audio dataset for wake word detection, likely associated with the Livekit platform. The dataset was created by the author 'yepher' and was last updated in March 2026. Specific details on the number of samples, audio length, and recording conditions are not provided.

AudioVoice TriggerAudio ClassificationWake Word DetectionRegionusSpeech Commands+1

0 views

Speech & Audio

Japanese Speech Dataset With 380 Speakers And 1.2 Million Samples

Filtered GOL Dataset is a Japanese text-to-speech resource containing approximately 1.2 million audio samples totaling 1,880 hours from 380 speakers. It was filtered by tts-dataset for TTS training, applying rules on text length, audio duration, and speaker minimums. The audio is in FLAC format at 44.1kHz and is packaged as a WebDataset.

WEBDATASETTask Categoriestext To SpeechLicenseotherSize Categories1 Mn10 MLibrarywebdatasetModalitytextLibrarymlcroissantLibrarydatasetsRegionusTask Categoriesautomatic Speech Recognition+1

0 views

Speech & Audio

Song Feature Dataset with 3.3 Million Tracks

Kaggle hosts a dataset of 3.3 million songs with corresponding musical feature data. The description states all song feature fields are filled out. The author, organization, and last update date are not specified.

TabularAudioAudio DataData VisualizationMusic AnalysisLarge ScaleSong FeaturesData Analytics+1

0 views

Speech & Audio

XTTSv2 Finetuning Data: Audio and Text Pairs for Speech Synthesis

XTTSv2 Finetuning Data 20260417 is a dataset for training or adapting text-to-speech models, published on Kaggle. The dataset likely contains audio recordings and corresponding text transcripts suitable for fine-tuning the XTTSv2 speech synthesis system. Specific details regarding its size, origin, and collection methodology are not provided in the available metadata.

TextAudioText To SpeechAudio DataSpeech SynthesisFinetuning+1

0 views

Speech & Audio

Japanese Eroge Voice V2: 1-10 Million Audio-Transcription Pairs

1 to 10 million audio-transcription pairs extracted from Japanese adult games by NandemoGHS in January 2026. The dataset consists of entirely new audio clips and transcriptions with no overlap from the original version.

AudioJapaneseOPTIMIZED-PARQUETParquetTask Categoriestext To SpeechLibrarypolarsLibrarydaskModalityaudioSize Categories1 Mn10 MModalitytextLibrarymlcroissantLibrarydatasetsAnimeRegionusTask Categoriesautomatic Speech RecognitionLanguagejaLicensemit+1

0 views

Speech & Audio

Thai_TTS_config: Thai Text-to-Speech Configuration Files

Thai_TTS_config is a dataset hosted on Kaggle. The title suggests it contains configuration files or parameters for Thai language text-to-speech (TTS) systems. The dataset's author, organization, size, and specific content are unknown.

AudioText To SpeechSpeech SynthesisAudio ConfigurationThai Language+1

0 views

Speech & Audio

ESC-50: Environmental Sound Classification Dataset

50 distinct environmental sound classes are likely represented in this dataset. The dataset is hosted on Kaggle and is intended for machine learning tasks. Metadata is minimal; actual content requires verification after download.

AudioMachine LearningAudio ClassificationEnvironmental Sounds+1

0 views

Speech & Audio

Urdu G2P Dictionary with 478,000+ Word-to-IPA Mappings

Urdu G2P Phoneme Dictionary is a Grapheme-to-Phoneme (G2P) resource for Urdu containing over 478,000 word-to-IPA mappings. The dataset, created by humair025, is described as the largest publicly available Urdu phoneme dictionary. It was last updated on the Hugging Face platform on 2026-01-19.

TextAudioIpa MappingsSpeech SynthesisUrdu LanguageNatural Language ProcessingLinguisticsGrapheme To Phoneme+1

0 views

Speech & Audio

Quran Recitations Paired with Verses by Qaris

A collection of Quranic verses paired with their respective audio recitations by Qaris (reciters). The dataset was created by Zackmortar and was last updated on the Hugging Face platform on 2026-01-19. It is intended as a resource for research and development in Quranic studies and audio processing.

AudioQuranIslamic StudiesAudio AnalysisSpeech Recognition+1

0 views

Speech & Audio

Invoice Dataset for Document Processing

An invoice dataset published on Kaggle. The dataset likely contains structured or semi-structured information related to business transactions. Specific details such as the number of records, columns, and collection methodology are not provided in the available metadata.

TabularBusiness DocumentsFinancial DataInvoice+1

0 views

Speech & Audio

IISc-MILE Tamil ASR Corpus: Tamil Speech Recognition Data

Tamil language audio data for automatic speech recognition (ASR). The dataset is published on Kaggle and likely contains speech recordings and corresponding transcriptions. The Indian Institute of Science (IISc) MILE lab is inferred as the source, but specific details on size, collection method, and time range are unavailable.

AudioNatural Language ProcessingAudio CorpusSpeech Recognition+1

0 views

Speech & Audio

Mile Tamil ASR Corpus: Speech Recognition Data for Tamil Language

Tamil language audio data for automatic speech recognition (ASR) tasks. The dataset is hosted on Kaggle, but details on its size, collection method, and specific content are not provided in the metadata. Further verification is required to confirm the exact scope and characteristics of the corpus.

AudioAudio CorpusSpeech Recognition+1

0 views

Speech & Audio

dataindextts3_song: Text-to-Speech Audio Samples

dataindextts3_song is a dataset published on Kaggle. The title suggests it contains audio data related to text-to-speech synthesis, potentially for song generation. The dataset's specific content, size, and origin are not detailed in the available metadata.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

Speech & Audio

TeleVRSLUBench: A Multimodal Spoken Language Understanding Benchmark

TeleVRSLUBench is a spoken language understanding benchmark that incorporates visual scene information and explicit reasoning processes for joint intent detection and slot filling. The dataset was proposed by Tele-AI in the paper 'Introducing Visual Scenes and Reasoning: A More Realistic Benchmark for Spoken Language Understanding'. It was last updated on the Hugging Face platform on 2026-01-22.

MultimodalIntent DetectionSpoken Language UnderstandingSlot FillingBenchmarkMultimodal BenchmarkVisual Reasoning+1

0 views

Speech & Audio

Erhu Timbre Audio Dataset with CSV Labels

Erhu Timbre Audio Dataset contains audio-based timbre records for the Chinese string instrument, the erhu. The dataset includes CSV labels, likely for categorizing or annotating the audio samples. It is hosted on Kaggle, but details about its creation, size, and update history are unavailable.

AudioInstrumentErhuTimbre+1

0 views

Speech & Audio

Restaurant Locations in Quincy Massachusetts

A sample of restaurant market data for the city of Quincy, Massachusetts, provided by BeamStation. The dataset contains listings for all restaurants in the area, though the exact number of records is unspecified. The original creation date and update frequency are not documented.

TabularGeospatialBusiness ListingsRestaurant LocationsMarket Data+1

0 views

Speech & Audio

PERiScoPe: Aligned Piano Scores and Performances for Music AI

PERiScoPe is a large-scale dataset of aligned piano scores and performances developed for the SyMuPe research project. It combines and processes open-source collections like (n)ASAP and ATEPP with curated web-collected MIDI performances. The dataset was created by SyMuPe and was last updated on the platform in January 2026.

AudioMultimodalMusic GenerationScore AlignmentMidiSymbolic MusicLarge ScalePiano Performance+1

0 views

Speech & Audio

Oral Transcripts and Linguistic Features for Cognitive Ability Screening

Raw oral transcription texts, preprocessing results, and derived linguistic feature datasets for cognitive ability research. It includes code for classification experiments. The author is chen, xuanshu, and the dataset was last updated in February 2026.

Arts And HumanitiesComputer and Information Science+1

0 views

Speech & Audio

Invoice Dataset for Financial Document Processing

An invoice dataset published on Kaggle. The dataset likely contains structured information related to business invoices, such as amounts, dates, and vendor details. Its specific content, size, and origin require verification after download.

TabularBusiness DocumentsFinancial DataInvoice+1

0 views

PreviousPage 75 of 130Next