Sign in to view source links and access this dataset
Description
Ghana is the primary source for KasaSpeech, a large-scale speech dataset featuring natural switching between English and Twi. It contains 49,878 transcribed audio samples, split into training, validation, and test sets. The dataset was created by Kennethdot and last updated on Hugging Face in May 2026.
Use Cases
Train automatic speech recognition models based on transcribed English-Twi code-switching speech.
Develop language identification models based on segments of mixed English and Twi audio.
Research linguistic patterns and sociolinguistics of code-switching based on natural speech from Ghanaian speakers.
Benchmark speech AI systems for low-resource language performance based on the provided train/test splits.
Strengths
Contains 49,878 total speech samples, providing a substantial corpus for model training.
Includes a structured split of 48,292 training, 581 validation, and 1,005 test samples for evaluation.
Focuses on natural code-switching, a feature mentioned as key for advancing speech AI for low-resource languages.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the primary data table is unknown, which may limit suitability assessment.
The description metadata is limited; actual audio quality and speaker diversity require manual inspection.
Provenance
Source
Hugging Face user Kennethdot.
Collection Method
Likely contains speech recordings collected from diverse speakers, as mentioned in the description.
Freshness
Last updated 2026-05-21 10:41:20; freshness should be verified.
Geography
Primarily Ghana, as stated in the description.
License is unknown; terms of use must be verified before application.