Name: Annotated Shona Speech Dataset with Acoustic Speaker Labels
Creator: manassehzw
Published: 2026-03-23T15:44:04
Keywords: Size Categories10 Kn100 K, Text To Speech, Task Categoriestext To Speech, Librarypolars, Librarydask, OPTIMIZED-PARQUET, Modalitytext, African Language, Librarymlcroissant, Librarydatasets, Licensecc By 40, Parquet, Audio, Regionus, Task Categoriesautomatic Speech Recognition, Speech Recognition, Shona

Description

An annotated, speaker-relabelled, and loudness-normalised Shona speech dataset prepared through a reproducible Modal-based data engineering pipeline. This release addresses speaker label contamination in the original source labels by replacing identity columns with acoustically-derived speaker assignments. The dataset is authored by manassehzw and was last updated in March 2026.

Use Cases

Train automatic speech recognition models on the annotated Shona audio data.
Perform speaker diarization analysis using the acoustically-derived speaker assignments.
Benchmark audio processing pipelines on loudness-normalised Shona speech samples.

Strengths

Dataset addresses speaker label contamination by replacing original identity columns with acoustically-derived assignments.
Audio data is annotated, speaker-relabelled, and loudness-normalised for consistency.
Prepared through a reproducible Modal-based data engineering pipeline.

Limitations

Specific scale details such as row count, column names, and file size are unknown.
The dataset's temporal and geographic coverage are not specified.
License information is unavailable, which may restrict usage.

Provenance

Source: huggingface
Collection Method: Prepared through a reproducible Modal-based data engineering pipeline, with acoustically-derived speaker assignments.
Time Range: null
Freshness: Last updated March 2026.
Geography: null

null

Audio OPTIMIZED-PARQUET Parquet Size Categories10 Kn100 K Text To Speech Task Categoriestext To Speech Librarypolars Librarydask Modalitytext African Language Librarymlcroissant Librarydatasets Licensecc By 40 Regionus Task Categoriesautomatic Speech Recognition Speech Recognition Shona

Annotated Shona Speech Dataset with Acoustic Speaker Labels

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info