Deepspeech Balalaika: Russian Speech Corpus for Generative Tasks

Name: Deepspeech Balalaika: Russian Speech Corpus for Generative Tasks
Creator: lab260
Published: 2026-01-22T15:20:01
Keywords: Russian Language, Speech Synthesis, Audio, Natural Language Processing, Audio Corpus, Speech Recognition

by lab260Updated 5mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

A curated Russian speech dataset for advanced speech generative tasks. The corpus was filtered and annotated by the lab260 team at MTUCI using the BALALAIKA pipeline. It includes genres such as podcasts, public speech, YouTube content, audiobooks, phone calls, and TTS.

Use Cases

Training Russian speech recognition models based on the annotated audio corpus.
Developing text-to-speech systems using the diverse genres like audiobooks and podcasts.
Fine-tuning generative speech models on the filtered and high-quality Russian audio data.

Strengths

Dataset is described as high-quality and meticulously filtered.
Covers multiple genres including podcasts, public speech, YouTube, audiobooks, phone calls, and TTS.
Annotated using a specific pipeline (BALALAIKA) by a named team (lab260 at MTUCI).

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.

Provenance

Source: Deepspeech (GitHub link)
Collection Method: Curated, filtered, and annotated by lab260 at MTUCI using the BALALAIKA pipeline.
Freshness: Last updated 2026-01-29 15:40:21; freshness should be verified.

License is referenced as mpl-2.0 but details require checking the full description on the dataset page.

Audio Russian Language Speech Synthesis Natural Language Processing Audio Corpus Speech Recognition

Related Datasets

Quality Score

C41

Description

42

Source

44

Reputation

44

Access

26

Community

56 downloads

3 likes

0 views

Dataset Info

Author: lab260
Created: Jan 22, 2026
Updated: Jan 29, 2026
Last synced: Jun 29, 2026

Access

26

Community

56 downloads

3 likes

0 views

Dataset Info

Author: lab260
Created: Jan 22, 2026
Updated: Jan 29, 2026
Last synced: Jun 29, 2026

Deepspeech Balalaika: Russian Speech Corpus for Generative Tasks

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info