An annotated, speaker-relabelled, and loudness-normalised Shona speech dataset prepared through a reproducible Modal-based data engineering pipeline. This release addresses speaker label contamination in the original source labels by replacing identity columns with acoustically-derived speaker assignments. The dataset is authored by manassehzw and was last updated in March 2026.
Use Cases
- Train automatic speech recognition models on the annotated Shona audio data.
- Perform speaker diarization analysis using the acoustically-derived speaker assignments.
- Benchmark audio processing pipelines on loudness-normalised Shona speech samples.
Strengths
- Dataset addresses speaker label contamination by replacing original identity columns with acoustically-derived assignments.
- Audio data is annotated, speaker-relabelled, and loudness-normalised for consistency.
- Prepared through a reproducible Modal-based data engineering pipeline.
Limitations
- Specific scale details such as row count, column names, and file size are unknown.
- The dataset's temporal and geographic coverage are not specified.
- License information is unavailable, which may restrict usage.
Provenance
- Source
- huggingface
- Collection Method
- Prepared through a reproducible Modal-based data engineering pipeline, with acoustically-derived speaker assignments.
- Time Range
- null
- Freshness
- Last updated March 2026.
- Geography
- null