OpenGVLab's OmniCorpus-YT is a large-scale multimodal dataset containing 10 million image-text interleaved documents collected from YouTube videos. The dataset is part of the broader OmniCorpus project, which encompasses billions of images, and was presented in an ICLR 2025 Spotlight paper. The repository was last updated on March 20, 2025.
Use Cases
- Training large multimodal language models based on the interleaved image-text structure.
- Researching cross-modal alignment and retrieval using paired visual and textual content from videos.
- Benchmarking video understanding systems based on data derived from YouTube.
- Developing models for image captioning or visual question answering based on the described image-text pairs.
Strengths
- Contains 10 million image-text interleaved documents, indicating a substantial scale.
- Part of a broader corpus encompassing billions of images, suggesting extensive source diversity.
- Associated with a peer-reviewed ICLR 2025 Spotlight paper, indicating academic rigor.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment for specific tasks.
- Data may reflect geographic, cultural, or platform-specific bias inherent to its YouTube source.
Provenance
- Source
- OpenGVLab
- Collection Method
- Collected from YouTube videos.
- Time Range
- null
- Freshness
- Last updated 2025-03-20 12:44:21; freshness should be verified.
- Geography
- null