Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
ToneWebinars Balalaika is a 248.9-hour Russian speech corpus curated from podcasts by the MTUCI lab260 team. Released in early 2026, the dataset was processed using the BALALAIKA pipeline to provide high-quality audio for generative speech tasks. It serves as a refined version of the original ToneWebinars source, specifically filtered for speech synthesis and recognition.
Users should consult Arxiv paper 2507.13563 for technical details on the BALALAIKA filtering and annotation methodology.