Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,910 datasets
Speakers_xtts is a dataset hosted on Kaggle. Its title suggests it contains audio data related to speech synthesis, likely for text-to-speech applications. The dataset's specific content, scale, and origin are not detailed in the available metadata.
speakers_xttsv3 is a dataset hosted on Kaggle. The title suggests it contains audio samples for text-to-speech applications. The dataset's author, organization, and specific content details are unknown.
A Vietnamese text-to-speech dataset containing 1,805 paired audio recordings and text transcriptions for fine-tuning VieNeu-TTS models. The dataset was created by author 'quocs' and last updated on February 10, 2026. Audio files are in WAV format at 24kHz, mono, with 16-bit PCM encoding.
Packed with an index analyzing the impact of 109 independent music venue zones on 4,190 local businesses across the United States. It was created by Stanislas Renard to measure how these zones reinforce local economic resilience, with approximately 95% of surrounding businesses being locally owned. The data categorizes establishments by type and distinguishes between total business impact and specific local business impact.
Phase 0 UVR5/Demucs vocal + instrumental stems from the Music Foundry project. The dataset likely contains separated audio tracks for music source separation tasks. It is hosted on Kaggle, but details on size, format, and creation date are unspecified.
A speech synthesis dataset for the Hausa language. It was published on Kaggle, but the author, organization, and creation date are unknown. The dataset's size, specific content, and structure are not detailed in the available metadata.
An audio classification dataset published on Kaggle. The dataset likely contains audio samples with associated labels for classification tasks. Specific details on size, source, and creation date are not provided in the available metadata.
bn-bd-tts is a dataset hosted on Kaggle. The title suggests it contains data for Bengali text-to-speech synthesis, likely including audio recordings and corresponding text transcripts. Specific details on volume, creator, and update history are not provided in the available metadata.
EMOPIA is a dataset of 1,087 pop piano music clips from 387 songs, annotated with clip-level emotion labels by four dedicated annotators. It was created by researchers including Hsiao-Tzu Hung from Academia Sinica and presented at ISMIR in 2021. The dataset includes multi-modal data in audio and MIDI formats.
A collection of Khmer speech audio files and corresponding transcripts sourced from the Women's Media Centre of Cambodia (WMC) website. The dataset is prepared for machine learning tasks, with scripts provided to process audio and metadata into Parquet files. It was created by user 'vichetkao' and last updated on February 21, 2026.
A Kazakh speech audio dataset published on the Hugging Face platform by the organization ai4kazakh. The dataset was last updated on March 30, 2026. The specific content, size, and collection methodology are not detailed in the available metadata.
XTTS Checkpoint 3000 is a dataset published on Kaggle. The title suggests it contains a checkpoint for an XTTS (text-to-speech) model, likely used for speech synthesis tasks. The specific content, size, and origin of the checkpoint require verification after download.
Trail map locations for select preserved lands along the Massachusetts coast. The dataset is provided by the organization SCIOPS via the NASA Earthdata platform.
A geologic map characterizes the sea floor of Western Massachusetts Bay. It was constructed by the CEOS_EXTRA organization using sidescan-sonar imagery, photography, and sediment samples. The temporal coverage and specific data volume are not provided.
71 articles and 474 FAQs comprise this text corpus focused on UK live music. Published on Kaggle, the dataset likely contains blog posts and guides related to music events. The raw description indicates a total of 174,000 words across the collection.
5 college students spent 2 months annotating shrimps for use with the YOLO26 object detection model. The dataset is designed for computer vision tasks related to counting and detection. Its specific scale and annotation methodology are detailed in the provided description.
30-second audio fragments of Latin music are provided with extracted features. Each fragment includes a 512-dimensional CLAP embedding, 13 MFCCs, and a BPM value. The dataset is hosted on Kaggle, but details about the creator, size, and license are not specified.
VoxCeleb and VoxCeleb2 provide over 1 million audiovisual clips of human speech from celebrities, compiled by researchers at the University of Oxford. This repository aggregates both versions into a single source containing MP4 video and AAC/WAV audio files.
A dataset for benchmarking automatic music transcription (AMT) systems, likely containing audio samples and corresponding transcription outputs or evaluation metrics. It originates from a Data Visualisation course project (DA332) and was published on Kaggle. The specific content and scale require verification after download.
KHM-ASR-Cultural-DDD is a speech dataset published on Kaggle. The title suggests it contains audio recordings for automatic speech recognition, likely related to cultural heritage. Metadata is minimal; the actual content, scale, and origin require verification after download.