INFINI-NEWS Corpus is a large-scale multilingual collection of news articles extracted from Common Crawl News archives. The dataset, created by author 'ruggsea', contains articles from 2021 to 2025, with partial statistics showing 242 GB of data for 2021 and 356 GB for 2022. It was last updated on the platform in February 2026.
Use Cases
- Train multilingual language models based on the news article text.
- Analyze media trends and narratives across languages and years.
- Study computational journalism techniques using a large-scale news corpus.
- Conduct cross-lingual information retrieval experiments on news content.
Strengths
- Large-scale corpus with data volumes of 242 GB for 2021 and 356 GB for 2022.
- Multilingual content supporting research across languages.
- Multi-year temporal coverage from 2021 to 2025.
- Sourced from Common Crawl News, a known web archive.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Description metadata is limited; actual data quality requires manual inspection after download.
Provenance
- Source
- Common Crawl News (CC-NEWS) archives.
- Collection Method
- Extracted from web archives.
- Time Range
- 2021-2025
- Freshness
- Last updated 2026-02-10 10:18:03; freshness should be verified.
- Geography
- null