Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
OmniCorpus-CC is a unified multimodal corpus of 10 billion-level images interleaved with text. It contains 988 million image-text interleaved documents collected from Common Crawl. The dataset was created by OpenGVLab and was last updated on the platform in March 2025.
The dataset page notes that several parquet files were flagged as unsafe by Hugging Face's official scanner, though they were reported safe by ClamAV and VirusTotal. Users should verify file safety.