Infini News Corpus: Multilingual News Articles from 2021-2025

Name: Infini News Corpus: Multilingual News Articles from 2021-2025
Creator: ruggsea
Published: 2026-01-29T12:04:24
Keywords: Computational Journalism, Media Studies, News Corpus, Text, Multilingual, Large Scale, Natural Language Processing, Multilingual News

by ruggseaUpdated 4mo ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

INFINI-NEWS Corpus is a large-scale multilingual collection of news articles extracted from Common Crawl News archives. The dataset, created by author 'ruggsea', contains articles from 2021 to 2025, with partial statistics showing 242 GB of data for 2021 and 356 GB for 2022. It was last updated on the platform in February 2026.

Use Cases

Train multilingual language models based on the news article text.
Analyze media trends and narratives across languages and years.
Study computational journalism techniques using a large-scale news corpus.
Conduct cross-lingual information retrieval experiments on news content.

Strengths

Large-scale corpus with data volumes of 242 GB for 2021 and 356 GB for 2022.
Multilingual content supporting research across languages.
Multi-year temporal coverage from 2021 to 2025.
Sourced from Common Crawl News, a known web archive.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: Common Crawl News (CC-NEWS) archives.
Collection Method: Extracted from web archives.
Time Range: 2021-2025
Freshness: Last updated 2026-02-10 10:18:03; freshness should be verified.
Geography: null

null

Text Multilingual Computational Journalism Media Studies News Corpus Large Scale Natural Language Processing Multilingual News

Related Datasets

Quality Score

C44

Description

51

Source

41

Reputation

49

Access

26

Community

36.8K downloads

1 likes

0 views

Dataset Info

Author: ruggsea
Created: Jan 29, 2026
Updated: Feb 10, 2026
Last synced: Jul 3, 2026

Access

26

Community

36.8K downloads

1 likes

0 views

Dataset Info

Author: ruggsea
Created: Jan 29, 2026
Updated: Feb 10, 2026
Last synced: Jul 3, 2026

Infini News Corpus: Multilingual News Articles from 2021-2025

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info