This Russian speech corpus contains audio recordings across diverse genres including podcasts, public speeches, YouTube content, audiobooks, and phone calls. The dataset was processed using the BALALAIKA pipeline by the MTUCI lab260 team to provide high-quality annotations for generative speech tasks.
Use Cases
- Train generative speech models using the high-quality Russian audio samples and BALALAIKA-generated annotations.
- Evaluate automatic speech recognition performance across diverse acoustic domains such as phone calls and public speeches.
- Develop text-to-speech systems by utilizing the audiobooks and TTS-specific segments within the corpus.
Strengths
- Covers multiple Russian speech genres including podcasts, public speeches, YouTube, audiobooks, and phone calls.
- Processed and filtered using the BALALAIKA pipeline from the MTUCI lab260 team.
- Released under the Mozilla Public License 2.0 (mpl-2.0) for open research and development.