Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
2,013 datasets
Naija-Stopwords is a list of collected stopwords from the four most widely spoken languages in Nigeria β Hausa, Igbo, Nigerian-Pidgin, and YorΓΉbΓ‘. It is part of the Naija-Senti project and was authored by HausaNLP. The dataset was last updated on June 18, 2023.
10 hours of speech recordings and transcriptions from the ATCOSIM project for Air Traffic Management. The data captures interactions between controllers and pilots during real-time simulations to support automatic speech recognition research.
6 pre-trained base models for SoVITS 4.0 voice conversion, featuring 768-dimensional vectors and layer 12 configurations. These models were trained on the m4singer and vctk datasets, reaching up to 320,000 training steps with loss values as low as 14.1.
A dataset for Automatic Speech Recognition (ASR) containing Hebrew speech audio files. The dataset was created by author 'imvladikon' and was last updated in May 2023.
Featuring audio files for DCASE 2022 - Task 3, sourced from the AudioSet ontology. The included labels are limited to a subset of sound events, such as female speech, male speech, clapping, and telephone sounds.
2023 data from the City of Pittsburgh Police documents arrests for offenses including felonies, parole violations, and failures to appear for trial. Information is reported at the block or intersection level, except for sex crimes which are aggregated to the police zone level. The dataset excludes incidents handled solely by other police departments operating within the city.
10.4 hours of Khmer speech audio with a mean duration of 2.5 seconds per sample, compiled by author seanghay and last updated in May 2023. It contains audio clips ranging from 0.45 to 19.39 seconds, sampled at 16 kHz. The dataset is hosted on Hugging Face and is associated with text-to-speech and automatic speech recognition tasks.
Allegheny County dog license records include license dates, breeds, names, and zip codes. This dataset does not contain data for dogs within the City of Pittsburgh. The row count, column count, and specific temporal coverage are not provided in the input.
A dataset for music emotion recognition and affective computing, sourced from the Hugging Face platform. It was created by author akhmedsakip and last updated in May 2023.
A summary of building permits issued by the City of Pittsburgh's Department of Permits Licenses and Inspections (PLI). The dataset was last updated in May 2023. The specific number of records and features is unknown.
The MusicCaps dataset contains 5,521 music examples. Each example is labeled with an English aspect list and a free-text caption written by musicians.
Telugu_ASR_corpus is a dataset for automatic speech recognition in the Telugu language, authored by eswardivi. The dataset was last updated on Hugging Face on April 10, 2023. Specific details on size, format, and collection methodology are not provided in the available metadata.
Image data categorized into over 34 indoor scene classes including specialized environments like 'studiomusic', 'hospitalroom', and 'inside_bus'. It provides labeled examples for computer vision tasks focused on identifying specific architectural and functional interior spaces.
Bloom-speech is a dataset of text-aligned speech audio sourced from bloomlibrary.org, containing over 50 languages including many low-resource ones. It is intended for training and testing speech-to-text or text-to-speech models. The dataset was created by sil-ai and was last updated in February 2023.
Comprising 97 hours of parliamentary speeches from Poland. The audio is stored in .wav format and was published on the ClarinPL website.
311 Data contains service requests for the City of Pittsburgh, collected by the 311 Response Center. Requests originate from phone calls, tweets, emails, a city website form, and a mobile application. The dataset was last updated on January 24, 2023.
CORAA v1.1 contains 290.77 hours of Brazilian Portuguese audio with transcriptions, segmented into over 400,000 audio files. The dataset is compiled from five distinct speech projects, including academic recordings and TEDx talks, and is validated for automatic speech recognition research.
Multiple audio datasets and signal transforms categorized for the PyTorch deep learning framework. The resource provides standardized data structures for audio files and preprocessing functions to support acoustic model development.
ESC-50 is a labeled collection of 2,000 environmental audio recordings. It contains 50 distinct sound classes, each with 40 examples, created by K. J. Piczak. The dataset was published for the 23rd ACM Multimedia Conference in 2015.
Medical Asr En is a dataset for automatic speech recognition in a medical context, published on the Hugging Face platform by author jarvisx17. The dataset was last updated on January 30, 2023. Its specific content, size, and structure require verification after download.