Moroccan Darija synthetic dataset for transliteration tasks. The dataset, created by Haitam03, provides Latin script input paired with Arabic script output and normalized forms. It was last updated on Hugging Face on October 29, 2025.
Use Cases
- Train transliteration models based on Latin-to-Arabic character mapping.
- Develop text normalization tools based on the described normalized forms.
- Benchmark machine translation systems for code-switched Moroccan Darija.
- Create educational resources for Darija script conversion.
Strengths
- Focuses on Moroccan Darija, a specific Arabic dialect.
- Provides normalized forms, which may aid in text standardization.
- Includes both Latin input and Arabic output for transliteration tasks.
Limitations
- Description metadata is limited; actual data quality requires manual inspection after download.
- Row count is unknown, which may limit suitability assessment.
- Column-level documentation is absent; field semantics must be inferred after download.
Provenance
- Source
- Haitam03 on Hugging Face.
- Collection Method
- Likely synthetically generated.
- Freshness
- Last updated 2025-10-29 18:12:38; freshness should be verified.
- Geography
- Morocco (Moroccan Darija).