Description

Lwazi Afrikaans ASR corpus provides matched audio recordings and orthographic transcriptions designed for speech recognition systems. Audio files are telephone-quality, recorded at 8 KHz, 16-bit, and single-channel, with each utterance stored in a separate text file. This dataset was created to support the development of Automatic Speech Recognition (ASR) for the Afrikaans language.

Use Cases

Training acoustic models for Afrikaans ASR based on telephone-quality audio.
Benchmarking speech recognition systems using orthographic transcriptions.
Studying phonetic or linguistic patterns in South African Afrikaans speech.
Developing or fine-tuning language models for Afrikaans from transcribed utterances.

Strengths

Includes complete, matched pairs of audio and text transcriptions for each utterance.
Audio specifications (8 KHz, 16-bit, 1 channel) are explicitly documented for telephone speech.
Released under a permissive CC-BY-4.0 license, facilitating open research and reuse.

Limitations

Key metadata such as row count, dataset size, and specific column structure is unavailable.
The listed last update date of 2026-03 24 appears to be a future date, indicating a potential metadata error.
Documentation is sparse, lacking details on speaker demographics, recording conditions, or corpus size.

Provenance

Source: Charl van Heerden
Collection Method: Created for the Lwazi speech recognition systems; specific collection method not detailed.
Time Range: null
Freshness: 2026-03-24 12:39:11
Geography: South Africa

The listed 'last updated' date is in the future (2026), which may indicate incorrect metadata. Audio is telephone-quality, which may limit applicability for high-fidelity speech tasks.

Text Audio Transcription South Africa Natural Language Processing Afrikaans Speech Recognition Automatic Speech Recognition

Lwazi Afrikaans ASR Corpus: Telephone Speech and Transcriptions

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info