Loading...
Loading...
Speech recognition, text-to-speech, speaker identification, music classification, audio event detection
1,971 datasets
A financial report for the City Municipal Cultural Institution "Dnipro Children's Music School Ü 12" covering the year 2020. The data originates from a public institution in Ukraine and was published on an open data platform in April 2021. The report likely contains details on the institution's income, expenditures, and financial performance.
OpenSLR 32 provides high-quality, multi-speaker transcribed audio data for Sesotho, one of four South African languages. The dataset consists of wave files and corresponding transcriptions in a TSV file. It was uploaded to Hugging Face by voice-biomarkers in August 2024.
A financial report details the consumption of communal resources for the City Municipal Institution of Culture 'Dnipro Children's Music School Ü14'. The data covers the period from January to August 2021 and was published on the States site of Ukraine. The dataset was last updated on September 9, 2021.
4,000 Romanian sentences recorded across 8 sessions by a single speaker in a hemianechoic chamber. Audio was captured at 96 kHz/24-bit and downsampled to 48 kHz using a Sennheiser MKH 800 small diaphragm condenser microphone.
1,000 hours of Arabic speech audio sampled at 16 kHz, sourced from over 700 YouTube channels. The collection spans multiple regions, genres, and dialects to support the development of speech recognition technologies.
400,213 audio files totaling 998 hours and 41 minutes of validated Ukrainian speech data. The dataset is a subset of the YODAS2 corpus, curated by the user 'speech-uk' and last updated on October 26, 2025. It is hosted on Hugging Face and associated with Ukrainian speech technology communities.
Aggregating crowdsourced speech recordings and transcriptions for over 20 listed languages including Abkhaz, Basaa, and Cantonese. It is an unofficial conversion of the Mozilla Common Voice Corpus 16.0, providing paired audio and text data for multilingual speech technology development.
1,000 hours of Arabic speech audio sampled at 16 kHz, collected from over 700 YouTube channels. The data spans multiple regions, genres, and dialects to support the development of speech recognition technologies.
A 2024 release from ASAPP contains a subset of the Gridspace-Stanford Harper Valley speech corpus, annotated for dialog act classification. The dataset includes English audio and text data tagged for customer service applications.
Los Angeles MIDI Dataset is a collection of MIDI files for music information retrieval and AI purposes, described as a state-of-the-art kilo-scale resource. It was created by projectlosangeles and was last updated in February 2024.
An interactive calculator from the Department of Energy for estimating electricity production and energy value of grid-connected photovoltaic systems. It allows users to input location, design parameters, and system economics to develop performance estimates for potential installations. The underlying data structure and record count are not specified.
Comprising accented English speech data from the Interspeech 2020 competition. The accompanying text has been manually proofread for high accuracy. It is suitable for automatic speech recognition, machine translation, and voiceprint recognition tasks.
The Spoken Language Understanding Evaluation (SLUE) benchmark tracks research progress on multiple SLU tasks. It facilitates the development of pre-trained representations by providing fine-tuning and evaluation sets for a variety of SLU tasks. The benchmark was created by ASAPP and focuses on freely available datasets.
A census and classification of road infrastructure based on noise and traffic characteristics, covering the average daily traffic network of more than five thousand vehicles. The dataset is published by the Prefect of the department of Seine-et-Marne and was last updated in April 2019. Classification references sound levels defined by French interministerial decrees from 1996 and 2013.
The Nuisance — Meurthe-et-Moselle Land Transport Infrastructure Layer of Type B Strategic Noise Maps dataset was created by the Bureau de Recherches Géologiques et Minières (BRGM) and sourced from CEREMA. It contains strategic noise maps produced under the European Directive 2002/49/EC, representing sectors affected by noise for assessment and urban planning purposes. The data was last updated on April 5, 2019.
The European Directive 2002/49/EC mandates a harmonised assessment of environmental noise exposure. This dataset contains strategic noise maps for type B land transport infrastructure in the Meurthe-et-Moselle department of France, produced by CEREMA and aggregated using the QGIS MIZOGEO plugin. The maps were last updated on April 5, 2019.
Akuapem Multispeaker Audio Transcribed dataset provides speech recordings and transcriptions in the Akan dialect of Akuapem Twi from Ghana. The dataset, created by michsethowusu and last updated on March 15, 2025, is designed for training and evaluating automatic speech recognition models. Its source is derived from the Financial dataset, as indicated in the description.
3,000 semi-natural Persian speech utterances totaling 3 hours and 25 minutes of audio extracted from online radio plays. The collection features 87 native speakers expressing five primary emotional states including anger, fear, happiness, and sadness.
The LibriSpeech corpus contains approximately 1000 hours of read English speech, sampled at 16 kHz. It was prepared by Vassil Panayotov with assistance from Daniel Povey, derived from audiobooks in the LibriVox project.
The largest open-source Persian Automatic Speech Recognition (ASR) dataset, collected from various sources. The dataset was created by farsi-asr and was last updated on March 13, 2025. Associated collection code is available in a GitHub repository.