Name: Afrivoice: Multimodal Image and Speech Dataset for Six African Languages
Creator: DigitalUmuganda
Published: 2026-03-02T13:51:02
Keywords: Image, Multimodal African Languages, Computer Vision, Image Captioning, Audio, Audio Transcription, Speech Recognition, Multimodal

Description

A multimodal dataset from DigitalUmuganda, last updated in 2026, where each data point consists of a JPEG image, a corresponding audio WAV file describing the image, and often a transcription of the audio. The description lists over 500 audio hours per language for six languages: Shona, Lingala, Fulani, Malagasy, Wolof, and Somali, with over 100 transcribed hours each.

Use Cases

Train automatic speech recognition models based on the transcribed audio files for African languages.
Develop image-to-speech or image captioning systems based on the paired image and audio description data.
Create or fine-tune multimodal AI models based on the alignment of visual content and spoken language.
Conduct linguistic research on the listed African languages based on the audio and transcription data.

Strengths

Includes over 500 hours of audio per language for six distinct African languages.
Provides over 100 hours of transcribed audio per language, which can be used for supervised training.
Each data point is multimodal, pairing an image with a descriptive audio file.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
The dataset page indicates a last updated date of 2026-05-25; freshness should be verified.

Provenance

Source: DigitalUmuganda on Hugging Face.
Freshness: Last updated 2026-05-25 13:38:21.
Geography: Likely covers regions where the listed languages (Shona, Lingala, Fulani, Malagasy, Wolof, Somali) are spoken.

License is unknown; users should verify terms of use before downloading.

Image Audio Multimodal Multimodal African Languages Computer Vision Image Captioning Audio Transcription Speech Recognition

Afrivoice: Multimodal Image and Speech Dataset for Six African Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info