Sign in to view source links and access this dataset
Description
8,169 Egyptian-Arabic text samples are manually annotated for offensive language and hate speech. The dataset was created by IbrahimAmin, Mostafa Abbas, Rany Hatem, Andrew Ihab, and Mohamed Waleed Fahkr. It was last updated on August 17, 2025.
Use Cases
Fine-tuning transformer models for hate speech detection based on Egyptian dialect text.
Training classifiers for offensive language identification based on manually labeled samples.
Benchmarking NLP models on Egyptian Arabic dialect tasks.
Studying linguistic patterns of hate speech in a specific Arabic dialect.
Strengths
8,169 text samples provide a substantial corpus for model training.
Manual labeling process suggests higher annotation quality.
Focus on the Egyptian dialect addresses a specific linguistic niche.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is known, but other metadata like file formats and license are unknown.
Data may reflect geographic bias inherent to its single-dialect focus.
Provenance
Source
huggingface
Collection Method
Manually labeled text samples.
Freshness
Last updated 2025-08-17 14:48:39
Geography
Egypt
License is listed as MIT in the raw description but 'unknown' in the input fields; verification is required.