DataSalon

Discover quality datasets for AI training — aggregated from 40+ platforms, curated by AI.

ProductSearch Datasets Browse Topics Rankings Community API / MCP

ResourcesDocumentation Blog Changelog Status

LegalPrivacy Policy Terms of Service Cookie Policy

Speech & Audio Datasets | DataSalon

All Categories

🎤

Speech & Audio

Speech recognition, text-to-speech, speaker identification, music classification, audio event detection

2,013 datasets

Speech & Audio

Jenny TTS Dataset: 30 Hours of Irish-Accented Speech for Synthesis

A 30-hour voice dataset recorded by an Irish speaker named Jenny. The dataset includes audio of newspaper headlines, YouTube video transcripts, sections from books '1984' and 'Little Women', Wikipedia articles, recipes, Reddit comments, song lyrics, and transcripts from the show 'Friends'. Audio files are 48kHz, 16-bit PCM format, and the dataset was last updated on HuggingFace in January 2024.

AudioText To SpeechVoice DatasetIrish AccentSpeech TrainingAudio Synthesis+1

0 views

Speech & Audio

Voice Kusanaginene: Project Sekai Nene Kusanagi Vocal Dataset

A collection of isolated dry voice recordings and a corresponding nene_org.txt label file for the character Nene Kusanagi. These vocal stems are voiced by Machico and sourced from the game Project Sekai for use in speech synthesis and voice cloning.

AUDIOFOLDERSize Categories1 Kn10 KTask Categoriestext To SpeechTask Categoriesaudio To AudioModalityaudioPjskLicensegpl 30LibrarymlcroissantPrskLibrarydatasetsTask CategoriesotherRegionusLanguageja+1

0 views

Speech & Audio

English Speech Dataset with Dialog Act Annotations

A 2024 release from ASAPP contains a subset of the Gridspace-Stanford Harper Valley speech corpus, annotated for dialog act classification. The dataset includes English audio and text data tagged for customer service applications.

TextAudioEnglishCustomer ServiceDialog Act ClassificationSpeech Recognition+1

0 views

Speech & Audio

SLUE: Spoken Language Understanding Benchmark

The Spoken Language Understanding Evaluation (SLUE) benchmark tracks research progress on multiple SLU tasks. It facilitates the development of pre-trained representations by providing fine-tuning and evaluation sets for a variety of SLU tasks. The benchmark was created by ASAPP and focuses on freely available datasets.

TextAudioSpoken Language UnderstandingBenchmarkSpeech ProcessingNatural Language Processing+1

0 views

Speech & Audio

Paulmooney Medical ASR Data: Speech Recognition for Clinical Context

Paulmooney Medical ASR Data is a dataset for automatic speech recognition in a medical context, published on HuggingFace by yashtiwari. It was last updated on February 16, 2024. The specific content, scale, and collection methodology require verification after download.

AudioClinical AudioMedical AsrHealthcareSpeech Recognition+1

0 views

Speech & Audio

MUVideo: 13,203 Music Files for Image-to-Music Generation

13,203 music files with a total playtime of 36.72 hours, generated using the MU-LLaMA and VideoMAE captioning models. The dataset was created by M2UGen to train the M2UGen model and was last updated on 2024-01-02.

AudioMultimodalMultimodal GenerationImage To MusicMusic GenerationComputer VisionVideo CaptioningSynthetic+1

0 views

Speech & Audio

Common Voice 16.0: Mozilla Multilingual Speech Corpus

Aggregating crowdsourced speech recordings and transcriptions for over 20 listed languages including Abkhaz, Basaa, and Cantonese. It is an unofficial conversion of the Mozilla Common Voice Corpus 16.0, providing paired audio and text data for multilingual speech technology development.

LanguagecyLanguagecnhLanguageckbLanguagearLanguagebrLanguagecaLanguagecvLanguagebnLanguagebgLanguagecsLanguageabLanguagebeLanguageazLanguageasLanguageastLanguageamLanguagebasTask Categoriesautomatic Speech RecognitionLanguageafLanguageba+1

0 views

Speech & Audio

Multilingual TTS: Text-to-Speech Audio Samples

MohamedRashad compiled a dataset of text-to-speech samples designed to showcase linguistic diversity. The dataset page was last updated on December 12, 2023. The description suggests the collection likely contains speech samples across multiple languages.

AudioMultilingualText To SpeechSpeech DataAudio Synthesis+1

0 views

Speech & Audio

Librispeech Lm: Language Modeling Resources for LibriSpeech ASR

800 million words of normalized text and pre-trained n-gram models derived from 14,500 public domain books. These resources provide the linguistic foundation for the LibriSpeech ASR corpus across multiple model formats.

Source DatasetsoriginalTask Categoriestext GenerationLanguageenLanguage CreatorsfoundSize Categories10 Mn100 MLicensecc0 10Annotations Creatorsno AnnotationTask Idslanguage ModelingRegionusMultilingualitymonolingual+1

0 views

Speech & Audio

Chilean Spanish Speech Recordings for Model Training

7 hours of transcribed audio recordings of Chilean Spanish sentences. The dataset was created by author ylacombe from restructured OpenSLR archives and was last updated in November 2023.

AudioText To SpeechChilean SpanishSpeech Recognition+1

0 views

Speech & Audio

Chilean Spanish Speech Audio with 7 Hours of Transcriptions

Chilean Spanish audio data consisting of 7 hours of transcribed, high-quality sentences recorded by 31 volunteers. The dataset was created by ylacombe and restructured from original OpenSLR archives for easier streaming. It was last updated on November 27, 2023.

AudioTranscriptionChilean+1

0 views

Speech & Audio

Librispeech Alignments: 1000 Hours of English Speech with Forced Alignments

Librispeech is a 1000-hour corpus of 16kHz read English speech derived from audiobooks, designed for automatic speech recognition. This version includes alignments generated by the Montreal Forced Aligner (MFA). The dataset was uploaded to Hugging Face by gilkeyio and last updated on November 22, 2023.

AudioMultimodalAudio AlignmentsAudiobooksNatural Language ProcessingEnglish SpeechSpeech RecognitionSynthetic+1

0 views

Speech & Audio

Wiki En Asr Adapt: Wikipedia Text for ASR Model Adaptation

English Wikipedia text and ASR error data presented in an ASRU-2023 paper. It contains 4.3 million unique words or phrases from Wikipedia titles occurring in 33.8 million paragraphs, plus 26 million phrase pairs representing ASR recognition errors. The dataset was created by bene-ges and last updated on Hugging Face in December 2023.

TextWikipediaLarge ScaleNatural Language ProcessingAsr AdaptationSpeech RecognitionText Corpus+1

0 views

Speech & Audio

CML-TTS: Multilingual Text-to-Speech Audiobooks from LibriVox

CML-TTS is a multilingual Text-to-Speech dataset developed at the Center of Excellence in Artificial Intelligence (CEIA) of the Federal University of Goias. It comprises audiobooks sourced from public domain books on Project Gutenberg, read by volunteers from the LibriVox project, and includes recordings in languages such as Dutch, German, French, Italian, and Polish. The dataset was last updated on the Hugging Face platform on 2023-11-24.

TextAudioMultilingualText To SpeechSpeech SynthesisAudiobooksSynthetic+1

0 views

Speech & Audio

MASC: Massive Arabic Speech Corpus

1,000 hours of Arabic speech audio sampled at 16 kHz, sourced from over 700 YouTube channels. The collection spans multiple regions, genres, and dialects to support the development of speech recognition technologies.

LanguagearLanguage CreatorscrowdsourcedLicensecc By Nc 40Annotations CreatorscrowdsourcedRegionus+1

0 views

Speech & Audio

HiFiTTS: Multi-Speaker English Speech Synthesis from Public Domain Audio Books

Hi-Fi TTS is a multi-speaker English text-to-speech dataset derived from LibriVox's public domain audio books and Project Gutenberg texts. The dataset was created by MikhailT and was last updated on Hugging Face on November 30, 2023. Its specific size, row count, and file formats are not detailed in the provided metadata.

TextAudioText To SpeechSpeech SynthesisMulti SpeakerAudio Books+1

0 views

Speech & Audio

Dummy Optimus Prime TTS: Text-to-Speech Audio Samples

Dummy Optimus Prime Tts is a dataset hosted on HuggingFace by the author ylacombe. It was last updated on December 20, 2023. The dataset likely contains audio samples or related data for text-to-speech synthesis, inferred from its title.

AudioText To SpeechTransformersVoice GenerationAudio Synthesis+1

0 views

Speech & Audio

GTZAN Musical Genre Audio Dataset with 10 Genres

The GTZAN dataset contains 1,000 audio tracks for musical genre classification, each 30 seconds long. It includes 10 distinct genres, with 100 tracks per genre, all formatted as 22,050Hz Mono 16-bit WAV files.

Regionus+1

0 views

Speech & Audio

MixologyDB: Music Mixing Data for Intelligent Production

MixologyDB is a dataset created to advance the field of intelligent music production, specifically targeting music mixing in a digital audio workstation. The dataset was created by mclemcrew and was last updated on Hugging Face in November 2023. Its specific size, row count, and column structure are not detailed in the provided metadata.

AudioMachine LearningDigital Audio WorkstationAudio MixingMusic Production+1

0 views

Speech & Audio

Iban Language Speech Corpus for ASR Training

Audio files and corresponding transcriptions for training Automatic Speech Recognition models for the Iban language. The dataset was created by user 'meisin123' and was last updated in November 2023. It is hosted on the Hugging Face platform.

TextAudioAudio DataNatural Language ProcessingLow Resource LanguageSpeech RecognitionIban Language+1

0 views

PreviousPage 86 of 101Next