Tamazight-NLP hosts the Tamazight-Arabic Speech Recognition Dataset containing 20,344 audio segments. The dataset provides approximately 15.5 hours of Tamazight speech in the Tachelhit dialect paired with Arabic transcriptions. It was last updated on March 29, 2025.
Use Cases
- Train automatic speech recognition models based on Tamazight audio.
- Develop speech-to-text translation systems based on cross-lingual transcription pairs.
- Benchmark ASR model performance on the Tachelhit dialect based on the provided test set.
- Study code-switching or language representation in speech models based on the described language pair.
Strengths
- Contains 20,344 audio-text pairs.
- Provides approximately 15.5 hours of speech data.
- Includes a predefined split of 18,309 training and 2,035 test examples.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Tamazight-NLP
- Freshness
- Last updated 2025-03-29 22:57:49; freshness should be verified.