Name: Pidgin ASR Combined: Nigerian Pidgin Speech-to-Text Dataset with 8.6 Hours of Audio
Creator: michaelodafe
Published: 2026-05-06T15:31:01
Keywords: Whisper Model, Audio Dataset, Pidgin English, Benchmark, Audio, Nigerian Language, Speech Recognition

Description

Pidgin ASR Combined is a unified Nigerian Pidgin English speech-to-text dataset created by michaelodafe. It contains approximately 8.6 hours of audio across 4,278 clips from 10 source speakers, formatted as 16 kHz mono WAV files. The dataset was last updated on 2026-05-13 and was used to train a Whisper model that achieved a 21.37% word error rate.

Use Cases

Fine-tuning automatic speech recognition models based on the described Nigerian Pidgin audio clips.
Benchmarking model performance against the reported 21.37% word error rate on the test split.
Training or evaluating models for low-resource language speech recognition based on the combined Pidgin sources.

Strengths

Consolidates multiple public Pidgin ASR sources into a single, consistent schema with defined train/validation/test splits.
Contains approximately 8.6 hours of audio data across 4,278 clips, providing a foundation for model training.
The derived Whisper model achieved a 21.37% word error rate, demonstrating the dataset's utility for model development.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
The dataset size in terms of rows is unknown, which may limit suitability assessment.
Last updated 2026-05-13 10:18:51; freshness should be verified.

Provenance

Source: huggingface, author michaelodafe
Collection Method: Combination of publicly available Nigerian Pidgin ASR sources.
Freshness: 2026-05-13
Geography: Nigeria (implied by focus on Nigerian Pidgin)

License is unknown; users should verify terms of use before downloading.

Audio Whisper Model Audio Dataset Pidgin English Benchmark Nigerian Language Speech Recognition

Pidgin ASR Combined: Nigerian Pidgin Speech-to-Text Dataset with 8.6 Hours of Audio

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info