Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
OpenGVLab's OmniCorpus CC 210M dataset contains 210 million image-text interleaved documents filtered from the Common Crawl web corpus. The dataset is designed for large-scale vision-language model training, as described in an ICLR 2025 spotlight paper. It was last updated on the Hugging Face platform in March 2025.
License is listed as CC BY 4.0 in platform tags, but should be verified from the official repository.