Description

RUEmoCorp is a large-scale emotion classification corpus for Roman Urdu, the informal transliterated writing style dominant in Pakistani digital communication. The dataset includes a formally annotated benchmark subset of approximately 28,000 samples and a larger raw corpus, created to address the underrepresentation of Roman Urdu in NLP research. It was authored by Muhammad Khubaib Ahmad and last updated in May 2026.

Use Cases

Train emotion classification models for Roman Urdu text based on the annotated benchmark subset.
Study cross-lingual transfer learning from English to Roman Urdu based on the frequent code-mixed expressions mentioned.
Analyze spelling variation and non-standard orthography in informal digital communication.
Develop language understanding models for low-resource languages based on the corpus's scale and annotation.
Conduct research in affective computing for South Asian social media contexts.

Strengths

Annotated benchmark subset of approximately 28,000 samples with a label-balanced design.
Substantial inter-annotator agreement with a Fleiss' Kappa of κ = 0.6588.
Data sourced from natural contexts like public social media and anonymized WhatsApp conversations.
Annotation involved four native-speaker annotators from three Pakistani universities following a structured protocol.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
The total size of the full corpus and the exact row count for components are not specified in the provided metadata.
Data may reflect geographic and linguistic bias inherent to its specific collection sources in Pakistani digital spaces.

Provenance

Source: Collected from public Pakistani social media and anonymized WhatsApp group conversations.
Collection Method: Naturally occurring communication was annotated by four native Urdu speakers following a structured protocol.
Time Range: null
Freshness: Last updated 2026-05-07 18:23:16; freshness should be verified.
Geography: Primarily Pakistan, based on the described data sources.

License information is unknown and should be verified before use.

Text Time Series Multilingual Affective Computing Benchmark Emotion Classification Roman Urdu Large Scale Natural Language Processing Multilingual Nlp Low Resource Language

RUEmoCorp: A Large-Scale Roman Urdu Emotion Corpus

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info