Global COVID-19 Twitter Discourse with Topic and Emotion Labels
by Gupta, Raj / ICPSR Harvested Dataverse·Updated 3mo ago
Available on 1 platform
Sign in to view source links and access this dataset
Description
Filled with over 252 million Twitter posts from more than 29 million unique users, collected from January 28, 2020, to June 1, 2022, using keywords related to COVID-19. Each tweet is labeled with seventeen attributes, including relevance to ten topics, sentiment intensity, and emotion scores for fear, anger, sadness, and happiness.
Use Cases
Analyze the temporal distribution of sentiment intensity and dominant emotion attributes across the 2+ year pandemic period.
Model the co-occurrence of the ten binary topic relevance attributes to identify thematic clusters in public discourse.
Investigate geographic representation by correlating country-level data with quantitative emotion attributes like fear or anger intensity.
Train classifiers to predict the categorical sentiment attribute using the five quantitative emotion attributes as features.
Strengths
Large scale with over 252 million Twitter posts from more than 29 million unique users.
Each tweet is enriched with seventeen machine-generated attributes, including ten topic labels and five emotion scores.
Covers a significant time range of over two years during the COVID-19 pandemic.
Data is processed and labeled using probabilistic topic modeling and pre-trained emotion recognition algorithms.
Limitations
Data collection is limited to English-language tweets containing four specific keywords, which may not capture the full global discourse.
Labels for topics, sentiments, and emotions are generated by algorithms, which may introduce noise or bias compared to human annotation.
Geographic representation is at the country/region level, lacking finer granularity.
Provenance
Source
Twitter platform, collected via web scraping.
Collection Method
Tweets were collected using the keywords 'corona', 'wuhan', 'nCov', and 'covid' and processed with topic modeling and emotion recognition algorithms.
Time Range
January 28, 2020, to June 1, 2022.
Freshness
Data collection ended on June 1, 2022; the metadata was last updated in February 2026.
Geography
Global, labeled to the country/region level; includes 30 representative countries in specific downloads.
The dataset consists of public Twitter posts; users must comply with Twitter's terms of service and relevant data use agreements. The license information is not provided in the input.