Sign in to view source links and access this dataset
Description
A unified Danish speech recognition dataset combines approximately 3.5 million audio samples from seven distinct sources, totaling roughly 16,000 hours of speech. The collection includes European and Danish Parliament recordings, read-aloud and conversational speech, broadcast media, and crowd-sourced samples. It was created by syvai and last updated on the Hugging Face platform in April 2026.
Use Cases
Train Danish speech recognition models based on the large-scale, multi-source audio collection.
Fine-tune models for parliamentary speech recognition based on the VoxPopuli and ftspeech sources.
Develop models for conversational speech understanding based on the CoRal-v3 conversation subset.
Benchmark ASR performance on read-aloud speech based on the CoRal-v3 read_aloud and nst-da sources.
Create systems for transcribing broadcast media based on the nota source.
Strengths
Large scale with approximately 3.5 million audio samples.
Diverse speech sources covering parliamentary, conversational, read-aloud, and broadcast contexts.
Significant total duration of roughly 16,000 hours of Danish speech.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Specific file formats, licensing, and detailed sample structure are unknown from the provided metadata.
Provenance
Source
Combined from 7 sources: VoxPopuli, ftspeech, CoRal-v3 read_aloud, nst-da, CoRal-v3 conversation, nota, Common Voice 17.
Collection Method
Likely aggregated and unified from existing public speech datasets.
Freshness
Last updated 2026-04-06 13:10:45; freshness should be verified.
Geography
Denmark (Danish language focus).
License information is unknown and must be verified for each constituent source before commercial use.