Sign in to view source links and access this dataset
Description
Russian-language podcast episodes scraped from mybook.ru, packaged as Parquet shards with embedded audio bytes. The dataset, created by Sinoosoida, was last updated on June 4, 2026. Each row contains an audio file and approximately 30 metadata columns including duration, ratings, and genre information.
Use Cases
Training automatic speech recognition (ASR) models based on the raw Russian audio content.
Analyzing podcast metadata such as ratings and genres to study content popularity.
Developing speaker diarization or audio classification models based on the embedded audio files.
Conducting linguistic or acoustic analysis of informal Russian speech patterns.
Strengths
Includes raw audio data (MP3) embedded within the dataset structure.
Contains approximately 30 metadata fields per episode, including duration, ratings, and genre tags.
Specifically focuses on Russian-language content, filling a potential niche in audio datasets.
Limitations
Dataset is explicitly unlabeled, requiring significant annotation effort for supervised tasks.
Row count and total size are unknown, which may limit suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Provenance
Source
mybook.ru
Collection Method
Scraped from the source website.
Freshness
Last updated 2026-06-04 07:01:03; freshness should be verified.
Geography
Likely Russia or Russian-speaking regions, based on content language.
License is unknown; users must verify terms of use. Audio is stored as embedded bytes in Parquet files, requiring specific tools for decoding.