Sign in to view source links and access this dataset
Description
TalTechNLP provides transcriptions for approximately 40,000 video news stories from Estonian National Broadcasting (ERR), totaling around 4,000 hours of audio. The transcriptions were generated automatically using the Gemini-3-Flash-Preview speech recognition model, with contextual biasing applied using related textual news to improve quality. The dataset was last updated on March 31, 2026.
Use Cases
Training or evaluating automatic speech recognition models based on broadcast news audio.
Analyzing news topics and language use in Estonian media based on the transcribed text.
Improving contextual biasing techniques for ASR systems using the described method of leveraging related text.
Conducting linguistic research on spoken Estonian from a formal news domain.
Strengths
Contains transcriptions for a large volume of approximately 40,000 news stories.
Covers a significant duration of around 4,000 hours of broadcast content.
Reports a relatively low average Word Error Rate (WER) of around 5% for the transcriptions.
Uses contextual biasing with related text, a technique likely to improve ASR accuracy for named entities and domain terms.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Transcriptions are generated automatically; manual verification may be required for high-precision applications.
Data may reflect geographic and source bias inherent to its origin from a single national broadcaster.
Provenance
Source
Estonian National Broadcasting (ERR)
Collection Method
Automatic transcription using Gemini-3-Flash-Preview speech recognition, with contextual biasing from related textual news.
Freshness
Last updated 2026-03-31 17:04:36; freshness should be verified.
Geography
Estonia
License is unknown; users should verify terms of use before downloading.