MediaText Central-Eastern Europe (CEE): A Multilingual, Sentence-Level Database of over 1.
by Sebők, Miklós / Harvard Dataverse·Updated 1mo ago
Available on 1 platform
Sign in to view source links and access this dataset
Description
MediaText CEE is a multilingual corpus of over 1.4 million online news articles from Czechia, Hungary, Poland, and Slovakia. The database, introduced by Miklós Sebők, integrates full-text articles with sentence-level segmentation and spans at least three years (2021-2024), with some coverage extending nearly two decades. Each article includes metadata such as outlet name and political orientation, Comparative Agendas Project topic codes, sentence-level sentiment labels, and named entity recognition results.
Use Cases
Conduct cross-national comparisons of policy attention based on Comparative Agendas Project topic codes.
Analyze media polarization and political orientation based on outlet metadata and article content.
Perform fine-grained discourse-level exploration based on sentence-level segmentation.
Train or validate multilingual NLP models for sentiment analysis based on expert-validated sentence-level labels.
Study the representation of specific entities like persons and organizations based on named entity recognition results.
Strengths
Contains over 1.4 million articles from leading national outlets across four countries.
Provides sentence-level annotations including sentiment labels validated through expert coding.
Offers temporal coverage spanning at least 2021-2024, with some data covering almost two decades.
Integrates multiple analysis layers: full text, metadata, topic codes, sentiment, and named entities.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for specific subsets or countries is unknown, which may limit suitability assessment.
Data may reflect geographic and temporal bias inherent to the selected outlets and time periods.
Provenance
Source
Miklós Sebők via Harvard Dataverse.
Collection Method
Likely collected from leading online news outlets in Czechia, Hungary, Poland, and Slovakia.
Time Range
At least 2021-2024, with some coverage extending nearly two decades.
Freshness
Last updated 2026-05-28 09:29:18; freshness should be verified.
Geography
Central-Eastern Europe, specifically Czechia, Hungary, Poland, and Slovakia.
License restrictions are unknown and should be checked before use.