Name: Swivuriso: Over 3000 Hours of Speech Across 7 South African Languages
Creator: dsfsi-anv
Published: 2025-07-09T11:39:52
Keywords: Community Centered, Librarypolars, Audio Data, Librarydask, Modalitytext, Size Categories100 Kn1 M, Librarymlcroissant, Librarydatasets, Licensecc By 40, Languagexho, Parquet, Multilingual, Languagezul, Languageven, South Africa, Audio, Arxiv251202201, Regionus, Large Scale, Task Categoriesautomatic Speech Recognition, Languagetso, Languagetsn, Languagesot, Multilingual Audio, South African Languages, Speech Recognition, Low Resource Languages, Languagende

Description

Swivuriso is a large-scale multilingual speech dataset targeting over 3000 hours of audio across 7 South African languages. The dataset is developed by dsfsi-anv to support Automatic Speech Recognition and inclusive speech technologies for low-resource African languages. It was last updated on the platform in February 2026.

Use Cases

Train Automatic Speech Recognition models based on the described multilingual audio data.
Develop inclusive speech technologies based on the described low-resource language coverage.
Benchmark speech model performance across languages based on the described scripted and unscripted speech content.
Study ethical data collection methods based on the described community-centered processes.

Strengths

Targets over 3000 hours of audio data.
Covers 7 South African languages.
Combines both scripted and unscripted speech.
Described as collected through ethical, community-centered processes.

Limitations

Description metadata is limited; actual data quality requires manual inspection after download.
Column-level documentation is absent; field semantics must be inferred after download.
Dataset paper is noted as a work in progress.

Provenance

Source: dsfsi-anv
Collection Method: Combines scripted and unscripted speech collected through ethical, community-centered processes.
Time Range: null
Freshness: Last updated 2026-02-25 11:13:04; freshness should be verified.
Geography: South Africa

null

Swivuriso: Over 3000 Hours of Speech Across 7 South African Languages

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info