Audio clips and transcriptions of Kalenjin speech sourced from the Mozilla Common Voice project. The dataset was created by author kln001 and last updated on July 28, 2025. It is intended for training and evaluating Automatic Speech Recognition models.
Use Cases
- Train Automatic Speech Recognition models based on Kalenjin audio clips.
- Evaluate ASR model performance on the Kalenjin language using validated test sets.
- Benchmark speech recognition accuracy for underrepresented languages using the provided transcriptions.
Strengths
- Data is sourced from the Mozilla Common Voice project, a known open-source speech data initiative.
- The dataset is structured with splits for training, testing, and validation, which suggests a standard machine-learning workflow.
Limitations
- The total number of audio hours, row count, and file formats are unknown, limiting suitability assessment.
- Column-level documentation is absent; field semantics must be inferred after download.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Mozilla Common Voice project
- Collection Method
- Likely contributed by volunteers recording and transcribing speech.
- Freshness
- Last updated 2025-07-28 11:26:44; freshness should be verified.