Sign in to view source links and access this dataset
Description
Ng'akarimojong (kdj), an Eastern Nilotic language with approximately 370,000 speakers in Karamoja, Uganda, is the focus of this speech dataset. It was created by Speedykom using GRN recordings segmented via silence detection. Audio files are in WAV format at 16 kHz mono, paired with UTF-8 transcripts auto-generated by the facebook/mms-1b-all model with a Teso adapter.
Use Cases
Train automatic speech recognition (ASR) models for Ng'akarimojong based on the WAV audio and transcript pairs.
Benchmark or fine-tune multilingual speech models on a specific Eastern Nilotic language.
Conduct linguistic analysis of Ng'akarimojong phonetics and speech patterns using the segmented recordings.
Develop speech synthesis or text-to-speech systems for the Karamojong language community.
Strengths
Audio is in a standard, high-quality format (WAV, 16 kHz, mono).
Transcripts are provided, generated using a large-scale multilingual model (facebook/mms-1b-all).
Dataset explicitly targets an underserved language with around 370,000 speakers.
Limitations
Transcription method is auto-generated, which may introduce errors not manually verified.
Row count and total dataset size are unknown, limiting suitability assessment.
Column-level documentation is absent; field semantics must be inferred after download.
Provenance
Source
GRN recordings, processed by Speedykom.
Collection Method
Recordings segmented via silence detection; transcripts auto-generated via facebook/mms-1b-all (Teso adapter).
Time Range
null
Freshness
Last updated 2026-04-21 13:07:29; freshness should be verified.
Geography
Karamojong language region, Karamoja, Northeastern Uganda.
License is unknown; terms of use must be verified before application.