Name: ERR Video News Transcribed: 40,000 Estonian Broadcast Stories
Creator: TalTechNLP
Published: 2026-03-30T09:14:52
Keywords: Estonian Language, News Transcripts, Text, Broadcast Media, Audio, Speech Recognition, Synthetic

Description

TalTechNLP provides transcriptions for approximately 40,000 video news stories from Estonian National Broadcasting (ERR), totaling around 4,000 hours of audio. The transcriptions were generated automatically using the Gemini-3-Flash-Preview speech recognition model, with contextual biasing applied using related textual news to improve quality. The dataset was last updated on March 31, 2026.

Use Cases

Training or evaluating automatic speech recognition models based on broadcast news audio.
Analyzing news topics and language use in Estonian media based on the transcribed text.
Improving contextual biasing techniques for ASR systems using the described method of leveraging related text.
Conducting linguistic research on spoken Estonian from a formal news domain.

Strengths

Contains transcriptions for a large volume of approximately 40,000 news stories.
Covers a significant duration of around 4,000 hours of broadcast content.
Reports a relatively low average Word Error Rate (WER) of around 5% for the transcriptions.
Uses contextual biasing with related text, a technique likely to improve ASR accuracy for named entities and domain terms.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Transcriptions are generated automatically; manual verification may be required for high-precision applications.
Data may reflect geographic and source bias inherent to its origin from a single national broadcaster.

Provenance

Source: Estonian National Broadcasting (ERR)
Collection Method: Automatic transcription using Gemini-3-Flash-Preview speech recognition, with contextual biasing from related textual news.
Freshness: Last updated 2026-03-31 17:04:36; freshness should be verified.
Geography: Estonia

License is unknown; users should verify terms of use before downloading.

Text Audio Estonian Language News Transcripts Broadcast Media Speech Recognition Synthetic

ERR Video News Transcribed: 40,000 Estonian Broadcast Stories

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info