3,300 human-annotated Thai tweets categorized into 2,027 toxic and 1,273 non-toxic samples. The corpus includes labels from three annotators guided by a 44-word dictionary and accounts for 506 tweets that are no longer publicly available via a TWEET_NOT_FOUND placeholder in the text field.
Use Cases
- Train a binary classification model to distinguish between toxic and non-toxic content using the tweet_text and human labels
- Evaluate the accuracy of keyword-based moderation systems by comparing the 44-word dictionary against actual toxicity labels
- Research linguistic features of sarcasm and word sense ambiguity in Thai text that lead to annotator disagreement
Strengths
- 3,300 total tweets labeled by three human annotators for binary toxicity
- Contains 2,027 toxic and 1,273 non-toxic samples
- Identifies 506 missing tweets with the specific string TWEET_NOT_FOUND in the tweet_text column
- Annotation process utilized a specific 44-word dictionary to guide reviewers