Name: OmniCorpus-CC: 988 Million Image-Text Interleaved Documents
Creator: OpenGVLab
Published: 2024-08-30T06:16:02
Keywords: Librarypolars, Task Categoriesimage To Text, Librarydask, Languageen, Task Categoriesvisual Question Answering, Size Categories100 Mn1 B, Web Crawl, Modalitytext, Arxiv240608418, Librarymlcroissant, Vision Language, Image Text, Librarydatasets, Licensecc By 40, Computer Vision, Parquet, Regionus, Large Scale, Natural Language Processing, Multimodal

Description

OmniCorpus-CC is a unified multimodal corpus of 10 billion-level images interleaved with text. It contains 988 million image-text interleaved documents collected from Common Crawl. The dataset was created by OpenGVLab and was last updated on the platform in March 2025.

Use Cases

Train multimodal large language models based on the interleaved image-text structure.
Develop visual question answering systems based on the described image-text pairs.
Conduct research on cross-modal representation learning based on the web-crawled image-text documents.
Fine-tune image captioning models based on the large-scale collection of paired data.

Strengths

Contains 988 million image-text interleaved documents.
Part of a larger corpus targeting 10 billion-level images.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for this specific subset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality requires manual inspection after download.

Provenance

Source: OpenGVLab
Collection Method: Collected from Common Crawl.
Freshness: Last updated 2025-03-20 12:32:06; freshness should be verified.

The dataset page notes that several parquet files were flagged as unsafe by Hugging Face's official scanner, though they were reported safe by ClamAV and VirusTotal. Users should verify file safety.

OmniCorpus-CC: 988 Million Image-Text Interleaved Documents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info