A parallel speech corpus contains audio recordings paired with text transcripts for the Shewa dialect of Amharic. The dataset, created by leyu-amharic, is designed for speech technology research and was last updated in February 2026.
Use Cases
- Train ASR models to transcribe Shewa dialect audio recordings into text transcripts.
- Develop TTS systems to synthesize speech from text, capturing dialect-specific phonetic and prosodic features.
- Analyze dialect-specific phonetic variations and accent patterns by comparing audio features with transcript text.
- Build speech synthesis models that preserve the accent patterns found in the Shewa dialect audio recordings.
Strengths
- Curated parallel structure with audio recordings aligned to text transcripts.
- Focus on the Shewa dialect captures specific phonetic and prosodic variations.
Limitations
- Unknown sample size and recording duration limit assessment of model training suitability.
- Potential geographic bias, as it only covers one Amharic dialect.
Provenance
- Source
- leyu-amharic on Hugging Face.
- Collection Method
- Curated collection of audio recordings paired with corresponding text transcripts.
- Time Range
- null
- Freshness
- Last updated in February 2026.
- Geography
- Shewa dialect region of Amharic language.