Deepspeech Balalaika

Name: Deepspeech Balalaika
Creator: MTUCI
Published: 2026-01-22T15:20:01
Keywords: Task Categoriestext To Speech, Librarypolars, Modalitytext, Size Categories100 Kn1 M, Arxiv250713563, Modalitytabular, Librarymlcroissant, Librarydatasets, Librarypandas, Parquet, Licensempl 20, Regionus, Task Categoriesautomatic Speech Recognition, Languageru

by MTUCIUpdated 5mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

This Russian speech corpus contains audio recordings across diverse genres including podcasts, public speeches, YouTube content, audiobooks, and phone calls. The dataset was processed using the BALALAIKA pipeline by the MTUCI lab260 team to provide high-quality annotations for generative speech tasks.

Use Cases

Train generative speech models using the high-quality Russian audio samples and BALALAIKA-generated annotations.
Evaluate automatic speech recognition performance across diverse acoustic domains such as phone calls and public speeches.
Develop text-to-speech systems by utilizing the audiobooks and TTS-specific segments within the corpus.

Strengths

Covers multiple Russian speech genres including podcasts, public speeches, YouTube, audiobooks, and phone calls.
Processed and filtered using the BALALAIKA pipeline from the MTUCI lab260 team.
Released under the Mozilla Public License 2.0 (mpl-2.0) for open research and development.

Parquet Task Categoriestext To Speech Librarypolars Modalitytext Size Categories100 Kn1 M Arxiv250713563 Modalitytabular Librarymlcroissant Librarydatasets Librarypandas Licensempl 20 Regionus Task Categoriesautomatic Speech Recognition Languageru

Related Datasets

Quality Score

D39

Description

39

Source

44

Reputation

43

Access

22

Community

37 downloads

3 likes

0 views

Dataset Info

Author: MTUCI
Created: Jan 22, 2026
Updated: Jan 29, 2026

Access

22

Community

37 downloads

3 likes

0 views

Dataset Info

Author: MTUCI
Created: Jan 22, 2026
Updated: Jan 29, 2026

Deepspeech Balalaika

Description

Use Cases

Strengths

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info