Name: Shona Speech Dataset for ASR and TTS Workflows
Creator: manassehzw
Published: 2026-03-20T17:56:44
Keywords: Size Categories10 Kn100 K, Text To Speech, Task Categoriestext To Speech, Librarypolars, Librarydask, OPTIMIZED-PARQUET, Modalitytext, African Language, Librarymlcroissant, Librarydatasets, Licensecc By 40, Parquet, Audio, Regionus, Natural Language Processing, Task Categoriesautomatic Speech Recognition, Audio Corpus, Shona Speech, Speech Recognition, Automatic Speech Recognition, Shona

Description

A cleaned, metadata-rich Shona speech dataset prepared through a reproducible data engineering pipeline. The dataset is derived from the google/WaxalNLP source, specifically the sna_asr subset, and was last updated on March 20, 2026. It is intended as a general-purpose standard corpus for downstream tasks.

Use Cases

Training automatic speech recognition models based on Shona audio data.
Developing text-to-speech systems based on the provided speech corpus.
Conducting linguistic research on Shona phonetics and speech patterns based on the audio samples.
Benchmarking speech processing algorithms for low-resource languages based on the standardized corpus.

Strengths

Dataset is described as cleaned and metadata-rich.
Prepared through a reproducible data engineering pipeline.
Intended as a general-purpose standard corpus, avoiding aggressive filtering to allow for task-specific thresholds.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count, file formats, and exact size are unknown, which may limit suitability assessment.
License information is unknown, which could restrict usage.

Provenance

Source: Derived from the google/WaxalNLP source dataset, specifically the sna_asr subset.
Collection Method: Prepared through a reproducible data engineering pipeline; details of specific cleaning steps are not provided.
Time Range: null
Freshness: Last updated 2026-03-20 19:52:11.
Geography: Likely contains Shona language speech, which is primarily spoken in Zimbabwe.

License is unknown; users must verify permissions before use.

Shona Speech Dataset for ASR and TTS Workflows

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info