Name: OmniCorpus CC 210M: 210 Million Image-Text Interleaved Documents
Creator: OpenGVLab
Published: 2024-08-29T01:37:15
Keywords: Librarypolars, Task Categoriesimage To Text, Librarydask, Languageen, Task Categoriesvisual Question Answering, Size Categories100 Mn1 B, Modalitytext, Arxiv240608418, Librarymlcroissant, Vision Language, Image Text, Common Crawl, Librarydatasets, Licensecc By 40, Computer Vision, Parquet, Regionus, Large Scale, Natural Language Processing, Multimodal

Description

OpenGVLab's OmniCorpus CC 210M dataset contains 210 million image-text interleaved documents filtered from the Common Crawl web corpus. The dataset is designed for large-scale vision-language model training, as described in an ICLR 2025 spotlight paper. It was last updated on the Hugging Face platform in March 2025.

Use Cases

Train multimodal large language models based on interleaved image-text sequences.
Benchmark visual question answering models using the described image-text pairs.
Conduct research on web-scale data curation and filtering for multimodal datasets.
Fine-tune image-to-text generation models on a large, diverse corpus.

Strengths

210 million documents provide substantial scale for model training.
Sourced from the diverse Common Crawl web corpus, suggesting broad content variety.
Dataset is associated with peer-reviewed research (ICLR 2025 spotlight).

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is known, but specific data quality and filtering details require inspection.
Data may reflect geographic, cultural, and temporal biases inherent to its Common Crawl source.

Provenance

Source: Common Crawl, processed by OpenGVLab.
Collection Method: Filtered from the larger OmniCorpus-CC dataset.
Time Range: null
Freshness: Last updated 2025-03-20 12:47:06.
Geography: null

License is listed as CC BY 4.0 in platform tags, but should be verified from the official repository.

OmniCorpus CC 210M: 210 Million Image-Text Interleaved Documents

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info