Swivuriso is a large-scale multilingual speech dataset targeting over 3000 hours of audio across 7 South African languages. The dataset is developed by dsfsi-anv to support Automatic Speech Recognition and inclusive speech technologies for low-resource African languages. It was last updated on the platform in February 2026.
Use Cases
- Train Automatic Speech Recognition models based on the described multilingual audio data.
- Develop inclusive speech technologies based on the described low-resource language coverage.
- Benchmark speech model performance across languages based on the described scripted and unscripted speech content.
- Study ethical data collection methods based on the described community-centered processes.
Strengths
- Targets over 3000 hours of audio data.
- Covers 7 South African languages.
- Combines both scripted and unscripted speech.
- Described as collected through ethical, community-centered processes.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Column-level documentation is absent; field semantics must be inferred after download.
- Dataset paper is noted as a work in progress.
Provenance
- Source
- dsfsi-anv
- Collection Method
- Combines scripted and unscripted speech collected through ethical, community-centered processes.
- Time Range
- null
- Freshness
- Last updated 2026-02-25 11:13:04; freshness should be verified.
- Geography
- South Africa