A multilingual evaluation benchmark for automatic speech recognition covering four under-served languages of the Horn of Africa: Amharic, Oromo, Somali, and Tigrinya. It contains 4,000 utterances totaling 15.44 hours of audio, drawn from spontaneous interview-style speech with transcripts validated by native speakers. The dataset was created by LesanAI and last updated on May 7, 2026.
Use Cases
- Benchmarking ASR model performance based on the 1,000 evaluation utterances per language
- Evaluating multilingual speech recognition systems based on the coverage of Amharic, Oromo, Somali, and Tigrinya
- Training or fine-tuning ASR models based on spontaneous interview-style speech data
- Studying linguistic features of Horn of Africa languages based on native-speaker validated transcripts
Strengths
- Contains 4,000 evaluation utterances across four languages
- Provides 15.44 hours of audio data
- Transcripts are post-edited and QC-validated by native-speaker annotators
- Audio is sourced from spontaneous interview-style speech
Limitations
- Column-level documentation is absent; field semantics must be inferred after download
- Row count is unknown, which may limit suitability assessment
- Data may reflect geographic or source bias inherent to the interview collection method
Provenance
- Source
- LesanAI
- Collection Method
- Utterances drawn from spontaneous interview-style speech, with transcripts post-edited and QC-validated by native-speaker annotators.
- Freshness
- Last updated 2026-05-07 11:12:02; freshness should be verified
- Geography
- Horn of Africa (Amharic, Oromo, Somali, Tigrinya)