DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,013 datasets

Speech & Audio

Gloria: Folk Music and Dance Traditions in Olivenza Villages

Descriptive text data on folk music and dance traditions from the Olivenza region, likely documenting cultural practices. The dataset was coordinated by Álvarez Pérez, Xosé Afonso and harvested into the e-cienciaDatos Dataverse platform. It was last updated on May 5, 2024.

TextOral TraditionIberian PeninsulaEthnomusicologyCultural HeritageFolk Dance+1

0 views

Speech & Audio

Sara Delgado Oral History Interview on Life in Piedras Albas

An interview with Sara Delgado from Piedras Albas, harvested by e-cienciaDatos. The audio recording captures personal recollections about childhood in the town, local livelihoods, life on the border, and cultural topics like music festivals and contraband. The dataset was last updated on May 5, 2024.

AudioBorder LifeOral HistoryInterviewCultural Heritage+1

0 views

Speech & Audio

Galician-Portuguese Border Dialect Interviews from Rubiás

An interview with Francisco and Lola in Rubiás, focusing on language similarities and differences across the border. The dataset likely contains discussions on the assessment of Galician spoken on television, dialectal variations in border villages like Montalegre, and comparisons between Galician and Portuguese. It was coordinated by Álvarez Pérez, Xosé Afonso and last updated on May 5, 2024.

AudioGalician LanguageInterview DataBorder DialectsLinguistics+1

0 views

Speech & Audio

Voicevox Voice Corpus: 577 Hours of Synthetic Japanese Audio

Containing 445,793 synthetic Japanese voice recordings totaling over 577 hours of audio generated via the VOICEVOX engine. Created by ayousanz and updated in May 2024, the data is based on the ITA, Tsukuyomi-chan, and ROHAN text corpora.

RegionusLanguageja+1

0 views

Speech & Audio

Ewe Bible V2 TTS: Ewe Language Text-to-Speech Dataset

Audio recordings and text transcripts of the Ewe Bible organized for Text-to-Speech (TTS) development. These linguistic resources support speech synthesis for the Ewe language, a Gbe language spoken primarily in Ghana and Togo.

ParquetSize Categories1 Kn10 KTask Categoriestext To SpeechLibrarypolarsLibrarydaskModalityaudioModalitytextLibrarymlcroissantTask Categoriestext To AudioLibrarydatasetsCroissantRegionusLanguageeeLanguageTask Categoriesautomatic Speech RecognitionEweTask CategoriestranslationLicenseapache 20+1

0 views

Speech & Audio

MELD TTS: Gender-Specific Speaker 3 Audio Samples

A dataset named 'Meld Tts Gender Speaker3' was published on the HuggingFace platform by author TAESOO98 on 2024-05-28. The title suggests it contains audio samples for a specific speaker, likely intended for text-to-speech synthesis tasks. The dataset's specific content, size, and structure require verification after download.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

Speech & Audio

Voice Datasets

95+ open-source datasets across voice and sound computing categories. The index facilitates discovery of specialized audio resources for various machine learning applications.

AudioAudio DatasetNoiseVoice DatasetVoice DatasetsVoice RecognitionVoice ChatVoice ConversionVoice ControlVoice SynthesisVoice Activity DetectionVoice ComputingAudio DatasetsVoice CommandsVoice Assistant+1

0 views

Speech & Audio

MSDWild: Multimodal Speaker Diarization and Localization Dataset

MSDWild is a dataset designed for testing multimodal analysis in tasks including multimodal speaker diarization, multimodal speaker localization, and audio-visual lip synchronization. The dataset is hosted on Hugging Face by author 'taocode' and was last updated on April 29, 2024. A sample can be viewed on the associated GitHub repository.

AudioMultimodalMultimodal AnalysisAudio VisualLip SynchronizationSpeaker Diarization+1

0 views

Speech & Audio

Flemish Mozilla Common Voice: 15,000 Male Dutch Flemish Speech Samples

A dataset containing 15,000 audio samples of a male Dutch Flemish voice. It was created by fibleep and ported from the dutch-vl-tts GitHub repository to the Hugging Face platform. The data was last updated on April 16, 2024, and originates from the Mozilla Common Voice project's Dutch language data.

AudioCommon VoiceAudio DatasetSpeech SynthesisDutch Flemish+1

0 views

Speech & Audio

Telugu Text-to-Speech Audio Synthesis Data

Telugu TTS is a dataset for speech synthesis published on HuggingFace by author deboleen6. Platform tags indicate it contains text and audio data for generating Telugu speech. The dataset was last updated on May 27, 2024.

TextAudioMultilingualParquetSize Categories1 Kn10 KText To SpeechLibrarypolarsLibrarydaskSpeech SynthesisModalitytextLibrarymlcroissantLibrarydatasetsTeluguRegionusAudio Generation+1

0 views

Speech & Audio

Reazon Speech V2 Denoised: Japanese Speech Audio with Background Noise Removed

3,674 denoised audio files from the Reazon Speech v2 dataset, processed using UVR to remove background music and noise. The dataset was cleaned by author Stardust-minus using eight A800 GPUs over approximately 10 days and was mirrored to Hugging Face by litagin in April 2024.

AudioSpeech AudioJapanese SpeechDenoisedAudio Processing+1

0 views

Speech & Audio

Unichart Pretrain Data: 6.9 Million Chart Image-Text Pairs

6,898,333 rows of chart images paired with text queries and labels, hosted on Hugging Face by ahmed-masry and last updated in March 2024. The dataset is structured for training multimodal models, with each row containing an image name, an input query, and an output label. Its primary use appears to be pretraining models for chart understanding and generation tasks.

MultimodalParquetMachine LearningLibrarypolarsMultimodal PretrainingLibrarydaskSize Categories1 Mn10 MModalitytextLibrarymlcroissantVision LanguageChart ImagesLibrarydatasetsTabular DataArxiv230514761Regionus+1

0 views

Speech & Audio

IrishMAN: 216,284 Irish Tunes in ABC Notation for Music Generation

216,284 Irish tunes in ABC notation, split into 214,122 for training and 2,162 for validation. The Irish Massive ABC Notation (IrishMAN) dataset was compiled from traditional music sources thesession.org and abcnotation.com. It was created by sander-wood and last updated on March 16,我们发现了一个问题。

TextAudioJSONTask Categoriestext GenerationLibrarypolarsModalitytextSize Categories100 Kn1 MTraditional MusicLibrarymlcroissantMusic GenerationLibrarydatasetsMidiLibrarypandasRegionusLarge ScaleMusicxmlLicensemitAbc Notation+1

0 views

Speech & Audio

Jenny TTS 6H: Text-to-Speech Audio Samples

Jenny TTS 6H is a text-to-speech dataset published on HuggingFace by author shacharu. The dataset was last updated on 2024-05-06. The specific content and scale of the audio samples are not detailed in the available metadata.

AudioText To SpeechSpeech SynthesisAudio Generation+1

0 views

Speech & Audio

BanSpeech: Bangladeshi Bangla Broadcast Speech Benchmark for ASR Evaluation

A benchmark containing approximately 6.52 hours of human-annotated broadcast speech, totaling 8085 utterances, across 13 distinct domains. It is designed for automatic speech recognition performance evaluation in challenging conditions. The dataset was created by SUST-CSE-Speech and last updated on March 9, 2024.

AudioBroadcastBenchmarkMulti DomainBanglaSpeech Recognition+1

0 views

Speech & Audio

Free Spoken Digit Dataset

10 categories of spoken digits (0-9) provided in an audio format. This dataset serves as an acoustic counterpart to the MNIST handwritten digit collection for speech recognition tasks.

AudioMachine LearningMnistSpeech RecognitionSpoken DigitsSpoken Language+1

0 views

Speech & Audio

Musica: Audio and Text Data for Multimodal Analysis

Musica is a multimodal dataset hosted on HuggingFace by author zaibutcooler, last updated on May 2, 2024. Its platform tags indicate it contains both audio and text data, likely related to music. The specific content, size, and structure require verification after download.

AudioMultimodalParquetTextSize Categories1 Kn10 KLibrarypolarsLibrarydaskModalityaudioModalitytextLibrarymlcroissantLibrarydatasetsRegionusLicensemit+1

0 views

Speech & Audio

Open Large Bengali ASR Data: 5000 Hours of Speech Audio with Quality Filter

A collection of 5000 hours of Bengali speech audio for automatic speech recognition, aggregated from nine public sources including Common Voice and OpenSLR. The dataset, created by SKNahin and last updated in March 2024, includes a filtering column to identify higher-quality audio segments based on word error rate and word-per-second metrics.

AudioMultilingualBengaliNatural Language ProcessingSpeech Recognition+1

0 views

Speech & Audio

InsuranceQA V2: Question Answering Dataset for Insurance Domain

A dataset released as part of a 2015 IEEE ASRU workshop paper by Feng, Minwei, et al. titled 'Applying deep learning to answer selection: A study and an open task.' The data was deconstructed from tokens provided in a GitHub repository by the user 'deccan-ai'.

TextAudioQuestion AnsweringInsuranceNatural Language ProcessingDeep Learning+1

0 views

Speech & Audio

EdAcc: 40 Hours of English Conversations with Diverse Accents

EdAcc (The Edinburgh International Accents of English Corpus) is an automatic speech recognition dataset composed of 40 hours of English dyadic conversations. It was created by edinburghcstr and includes speakers with a diverse set of first and second-language English accents, along with linguistic background profiles. The dataset was last updated on February 22,我们发现了一个错误。

AudioAccent DiversityEnglish LanguageNatural Language ProcessingConversational SpeechSpeech Recognition+1

0 views

PreviousPage 84 of 101Next