10K - 100K audio samples with transcriptions in Somali, designed for automatic speech recognition tasks. The dataset is hosted on Hugging Face by the author 'skydheere' and was last updated on 2025-05-09. It is provided in Parquet format under a CC-BY 4.0 license.
Use Cases
- Train automatic speech recognition models based on the described audio recordings and transcriptions.
- Evaluate the performance of ASR systems on Somali speech based on the provided audio-text pairs.
- Fine-tune pre-trained multilingual speech models for the Somali language based on the described dataset.
- Develop speech technology applications for Somali speakers based on the described audio data.
Strengths
- Contains between 10,000 and 100,000 samples, providing a substantial corpus for model training.
- Includes both audio and text modalities, which is essential for supervised ASR tasks.
- Released under the permissive CC-BY 4.0 license, facilitating open use and redistribution.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- huggingface
- Freshness
- Last updated 2025-05-09 13:26:42; freshness should be verified.