Manueltonneau's French Hate Speech Superset contains 18,071 posts annotated as hateful or not. It merges all publicly available French hate speech datasets identified in a systematic 2024 survey. The dataset was last updated in October 2024.
Use Cases
- Train binary text classifiers to predict the hateful/not-hateful label from post content.
- Benchmark model performance on a consolidated set of 18,071 French hate speech examples.
- Analyze linguistic patterns and vocabulary associated with hateful annotations across merged datasets.
- Fine-tune pre-trained language models like CamemBERT for French hate speech detection tasks.
Strengths
- Consolidated 18,071 annotated posts from multiple sources.
- Based on a systematic survey of available French hate speech datasets in 2024.
Limitations
- Specific column names, data distributions, and class balance are not provided.
- The merge of multiple datasets may introduce inconsistencies in annotation guidelines.
- The dataset's temporal coverage and geographic origin of posts are unknown.
Provenance
- Source
- Merge of multiple publicly available French hate speech datasets.
- Collection Method
- Preprocessing and merge of datasets identified via a systematic survey in early 2024.
- Time Range
- null
- Freshness
- Last updated October 2024.
- Geography
- null