The Waxal dataset is a large-scale multilingual speech corpus specifically designed for African languages. It was created to facilitate research in improving the accuracy and fluency of speech and language technologies across the continent. The dataset supports both Automated Speech Recognition (ASR) and Text-to-Speech (TTS) tasks.
Use Cases
- Training Automated Speech Recognition (ASR) models for African languages
- Developing Text-to-Speech (TTS) systems for under-represented languages
- Linguistic research on African language speech patterns
- Improving multilingual language model performance in speech tasks
Strengths
- Large-scale multilingual coverage of diverse African languages
- Supports dual tasks of Automated Speech Recognition and Text-to-Speech
- Aggregates data from multiple reputable sources including original recordings
Limitations
- Specific column structures and file formats are not detailed in the provided metadata
- Sample data and row counts are currently unavailable
Provenance
- Source
- google
- Collection Method
- Sourced from original data and existing datasets including DigitalUmuganda/AfriVoice and UGSpeechData.
- Freshness
- Last updated on 2026-03-13.
- Geography
- Africa