Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,926 datasets
A large-scale collection of First-Order Ambisonic (FOA) Room Impulse Responses (RIRs) generated through high-fidelity hybrid acoustic simulation. This dataset is a specialized version derived from the 7th-order HiFi-HARP source to support spatial audio research in sound localization and dereverberation.
High fidelity audio recordings of viola performances across multiple musical genres. The dataset likely contains spectral and temporal features extracted from the audio signals. The author, organization, and specific collection details are unknown.
A speech dataset titled 'josh-talk-ASR-dataset' hosted on Kaggle. The dataset likely contains audio recordings and corresponding transcriptions for training automatic speech recognition systems. Specific details on volume, contributors, and creation date are unavailable in the provided metadata.
60,233 speech utterances from 20 Japanese speakers, totaling approximately 90.6 hours of audio. The dataset is formatted for LJSpeech compatibility and optimized for training TTS models like Piper. Audio samples have a 22,050 Hz sample rate.
Kashmiri Text-to-Speech | Speech-to-Text is a dataset hosted on Kaggle aimed at enabling speech synthesis and digital accessibility for the Kashmiri language. The dataset's specific size, format, and structure are not detailed in the provided metadata. Its author, organization, and last update date are unknown.
A dataset designed for the speaker verification task. The dataset's author, size, and specific contents are not detailed in the provided metadata. It is hosted on the Kaggle platform.
A collection of Vietnamese acronym and transliteration lexicons for text normalization and text-to-speech applications. The dataset is hosted on Kaggle and is associated with platform tags for text and speech processing. Specific details on size, authorship, and update frequency are not provided.
A speech recognition dataset for the Nepali language, published on the Hugging Face platform by Aadarsh17. The dataset was last updated on February 12, 2026. Its specific content, size, and structure require verification after download.
2025 boundaries for county subdivisions in Massachusetts, as reported through the Census Bureau's Boundary and Annexation Survey and Participant Statistical Areas Program. This shapefile extract from the MAF/TIGER System provides geographic and cartographic information for legally-recognized minor civil divisions and statistical census county divisions.
Audio features are paired with emotion and genre labels for analysis. The dataset is multimodal, combining audio signal data with categorical annotations. Specific row counts, column details, and creation metadata are unavailable.
Featuring segmented deepfake speech audio clips aggregated from 4 public source datasets. The audio is partitioned into 2-second clips with a 1-second overlap to provide consistent input lengths for acoustic feature extraction and temporal analysis.
XLSR 5 Epochs Telugu ASR is a dataset for training or evaluating automatic speech recognition models for the Telugu language. The dataset is hosted on the Kaggle platform, but its specific contents, size, and creation details are not provided in the available metadata. The title suggests it may be related to a cross-lingual speech representation (XLSR) model fine-tuned for five epochs.
Records is a dataset containing metadata for music releases. The dataset is tagged for Arts and Entertainment and Audio applications. Specific details on record count, features, and provenance are unavailable.
music_db_cinematic_video_edit is a dataset hosted on Kaggle. The title suggests it contains music or audio-visual material intended for use in video editing, particularly for cinematic projects. The dataset's specific contents, size, and origin are not detailed in the available metadata.
Hakka-language audio recordings and transcriptions form a pre-training dataset for the Taiwan-Tongues-ASR-CE project. The dataset is packaged in WebDataset format for direct use with PyTorch and Hugging Face libraries. It was created by the adi-gov-tw organization and last updated in December 2025.
A Kaggle-hosted dataset for audio classification tasks, likely containing audio files or features for music genre identification. The dataset is intended for training or fine-tuning pre-trained models. Specific details on size, origin, and creation date are not provided in the available metadata.
Encompassing speech audio samples annotated for emotional content in the Urdu language. It is designed for tasks in emotion recognition and spoken language processing. The specific number of audio files, features, and rows is unknown.
Aggregating Tamil language speech recordings related to the travel industry. The specific number of audio files, their duration, and associated metadata are not provided.
A collection of Spotify tracks across multiple genres paired with their specific audio features. The dataset is updated on a weekly basis to reflect current music trends and library additions.
This dataset aggregates audio samples from 4 public speech sources, processed into 2-second segments with a 1-second overlap. The collection focuses on deepfake voice detection through the application of MFCC (Mel-frequency cepstral coefficients) features extracted from the segmented clips.