Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
2,018 datasets
SMAPVEX19-22 Massachusetts Airborne Lidar V001 contains lidar measurements collected by the NSIDC_CPRD organization. The data was gathered in April and August 2022 near Petersham, Massachusetts, as part of a campaign to validate satellite-derived soil moisture estimates. The two acquisition periods were selected to capture differences in forest conditions during "leaf-off" and "leaf-on" seasons.
April and August 2022 ground surface elevations derived from lidar measurements collected near Petersham, Massachusetts. These data were gathered during the SMAPVEX19-22 campaign to validate satellite-derived soil moisture estimates in forested areas. The two acquisition periods characterize differences during 'leaf-off' and 'leaf-on' conditions.
Cv Tts Clean is a speech dataset for text-to-speech applications, created by neongeckocom and uploaded to Hugging Face in September 2022. The dataset name suggests it contains clean audio recordings, likely paired with corresponding text transcripts. Its BSD 3-Clause license and US region tag indicate permissible use and a primary geographic source.
Kennslurómur is a collection of audio recordings and corresponding text from instructional lectures recorded in courses at the University of Reykjavík and the University of Iceland. The dataset is intended for training speech recognition models, with recordings provided by lecturers, processed by a speech recognizer, and subsequently proofread by students and a professional proofreader.
2021 collection of Polish language text samples categorized for punctuation restoration tasks within Automatic Speech Recognition (ASR) workflows. The dataset provides unpunctuated transcriptions paired with their punctuated versions to facilitate the training of sequence labeling models.
Comprising solo guitar pieces from the Mutopia Project, encoded as text tokens from MIDI files. It primarily features music by western classical composers such as Sor, Aguado, Carcassi, and Giuliani. The dataset is intended for language modeling and text generation tasks.
Text data, as indicated by the 'Modalitytext' tag. It is associated with the US region and supports multiple data processing libraries including polars, dask, and datasets. The dataset was last updated on August 29, 2022.
47,723 aviation incident reports sourced from NASA's Aviation Safety Reporting System (ASRS) database. Each entry pairs a detailed narrative account of a safety event with a corresponding summary suitable for text generation tasks.
This dataset supports Vietnamese Inverse Text Normalization (ITN), a task that transforms spoken-style text to written form, particularly for improving Automatic Speech Recognition (ASR) output readability. It was created by VietAI and last updated in July 2022.
55,000 full audio tracks categorized by 195 tags across genre, instrument, and mood/theme classes. The data is sourced from Jamendo under Creative Commons licenses and includes tags provided by original content creators.
Featuring 12,800 balanced audio samples in WAV format and related transcriptions from 18 speakers. It is assembled from multiple sources including VCTK, LJSpeech, m-ailabs, and SIWIS, covering languages such as English, French, German, Luxembourgish, and Portuguese.
1,000 hours of Arabic speech audio sampled at 16 kHz, collected from over 700 YouTube channels. The data spans multiple regions, genres, and dialects to support the development of speech recognition technologies.
Azure Tts Yasmin is a text-to-speech voice model for the Malay language, created by mesolitica and uploaded to Hugging Face in August 2022. The model is associated with the 'Regionus' tag, suggesting a regional focus. Specific details on dataset size, audio samples, or training methodology are not provided in the available metadata.
Featuring 9.5 hours of Vietnamese speech audio paired with text transcripts, totaling 1.28GB. The audio was crawled from YouTube audiobooks, and the text was labeled by VinBrain JSC.
The KSS Dataset is a Korean text-to-speech dataset consisting of audio files recorded by a professional female voice actress, with aligned text extracted from books. The dataset is the first publicly available speech dataset for Korean, released by the copyright holder.
Azure Tts Osman Wikipedia is a text-to-speech dataset created by mesolitica, likely containing synthesized audio for Malay language Wikipedia articles. The dataset was last updated on July 31, 2022. It is hosted on the Hugging Face platform and is associated with text modality tags.
Azure TTS Yasmin Wikipedia contains speech audio generated from Wikipedia text using Microsoft Azure's text-to-speech technology. The dataset was created by the user 'mesolitica' and uploaded to Hugging Face in July 2022. It is categorized as containing over 100,000 rows of data.
Azure Tts Osman is a text-to-speech model for the Malay language, created by the author mesolitica. The model was uploaded to the Hugging Face platform in July 2022. It is tagged for regional use, indicating a focus on specific linguistic or acoustic characteristics.
A collection of configuration files and training scripts for developing Text-to-Speech models across various public speech datasets. It includes specific parameters for audio preprocessing, model architectures, and training schedules tailored to diverse audio-text corpora.
Containing 1000 audio tracks, each 30 seconds long. It includes 10 music genres, with 100 tracks per genre, and provides both raw WAV files and 8000 derived Mel Spectrograms. The audio files are 22050Hz Mono 16-bit WAVs.