Sign in to view source links and access this dataset
Description
A multimodal dataset from DigitalUmuganda, last updated in 2026, where each data point consists of a JPEG image, a corresponding audio WAV file describing the image, and often a transcription of the audio. The description lists over 500 audio hours per language for six languages: Shona, Lingala, Fulani, Malagasy, Wolof, and Somali, with over 100 transcribed hours each.
Use Cases
Train automatic speech recognition models based on the transcribed audio files for African languages.
Develop image-to-speech or image captioning systems based on the paired image and audio description data.
Create or fine-tune multimodal AI models based on the alignment of visual content and spoken language.
Conduct linguistic research on the listed African languages based on the audio and transcription data.
Strengths
Includes over 500 hours of audio per language for six distinct African languages.
Provides over 100 hours of transcribed audio per language, which can be used for supervised training.
Each data point is multimodal, pairing an image with a descriptive audio file.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset page indicates a last updated date of 2026-05-25; freshness should be verified.
Provenance
Source
DigitalUmuganda on Hugging Face.
Freshness
Last updated 2026-05-25 13:38:21.
Geography
Likely covers regions where the listed languages (Shona, Lingala, Fulani, Malagasy, Wolof, Somali) are spoken.
License is unknown; users should verify terms of use before downloading.