Roman Urdu text data annotated for toxic language, sourced from the Kaggle platform. The dataset likely contains text samples with labels indicating the presence of harmful or offensive content. Specific details on volume, author, and collection timeframe are not provided in the available metadata.
Use Cases
- Training a classifier to detect toxic language in Roman Urdu text (inferred from domain, verify after download)
- Benchmarking multilingual toxicity detection models (inferred from domain, verify after download)
- Analyzing linguistic patterns of offensive speech in a low-resource language script (inferred from domain, verify after download)
Strengths
- Published on Kaggle, a major platform for sharing ML datasets.
- Focuses on Roman Urdu, a specific and potentially lower-resource script variant.
Limitations
- Metadata is minimal; actual content requires verification after download.
- Row count, column definitions, and license are unknown, limiting suitability assessment.
- Data may reflect bias inherent to its unspecified collection source on Kaggle.