ToneWebinars Balalaika: 248 Hours of Annotated Russian Podcast Speech

Name: ToneWebinars Balalaika: 248 Hours of Annotated Russian Podcast Speech
Creator: MTUCI
Published: 2026-02-15T20:28:03
Keywords: Size Categories10 Kn100 K, Task Categoriestext To Speech, Librarypolars, Modalitytext, Arxiv250713563, Modalitytabular, Librarymlcroissant, Librarydatasets, Librarypandas, Parquet, Regionus, Task Categoriesautomatic Speech Recognition, Languageru, Licenseapache 20

by MTUCIUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

ToneWebinars Balalaika is a 248.9-hour Russian speech corpus curated from podcasts by the MTUCI lab260 team. Released in early 2026, the dataset was processed using the BALALAIKA pipeline to provide high-quality audio for generative speech tasks. It serves as a refined version of the original ToneWebinars source, specifically filtered for speech synthesis and recognition.

Use Cases

Training Russian text-to-speech (TTS) models using the BALALAIKA-filtered audio segments
Improving automatic speech recognition (ASR) for conversational Russian podcasts
Researching speech generative tasks using the provided transcriptions and audio pairs

Strengths

248.9 hours of total audio duration
Processed via the BALALAIKA pipeline for quality filtering
Apache 2.0 license allows for open research and commercial use

Limitations

Restricted to the Russian language
Genre-specific bias toward podcast-style conversational speech
Potential for residual background noise inherent in original podcast recordings

Provenance

Source: MTUCI lab260, derived from ToneWebinars
Collection Method: Annotated and filtered using the BALALAIKA pipeline
Freshness: Last updated March 2026.
Geography: Russia

Users should consult Arxiv paper 2507.13563 for technical details on the BALALAIKA filtering and annotation methodology.

Parquet Size Categories10 Kn100 K Task Categoriestext To Speech Librarypolars Modalitytext Arxiv250713563 Modalitytabular Librarymlcroissant Librarydatasets Librarypandas Regionus Task Categoriesautomatic Speech Recognition Languageru Licenseapache 20

Related Datasets

Quality Score

D39

Description

39

Source

44

Reputation

42

Access

22

Community

46 downloads

2 likes

0 views

Dataset Info

Author: MTUCI
Created: Feb 15, 2026
Updated: Mar 5, 2026

Access

22

Community

46 downloads

2 likes

0 views

Dataset Info

Author: MTUCI
Created: Feb 15, 2026
Updated: Mar 5, 2026

ToneWebinars Balalaika: 248 Hours of Annotated Russian Podcast Speech

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info