OmniCorpus-YT: 10 Million Image-Text Documents from YouTube Videos

Name: OmniCorpus-YT: 10 Million Image-Text Documents from YouTube Videos
Creator: OpenGVLab
Published: 2024-08-30T06:16:15
Keywords: Multimodal Corpus, Computer Vision, Large Scale, Natural Language Processing, Image Text Interleaved, Multimodal

by OpenGVLabUpdated 1y ago

Available on 1 platform

Sign in to view source links and access this dataset

Description

OpenGVLab's OmniCorpus-YT is a large-scale multimodal dataset containing 10 million image-text interleaved documents collected from YouTube videos. The dataset is part of the broader OmniCorpus project, which encompasses billions of images, and was presented in an ICLR 2025 Spotlight paper. The repository was last updated on March 20, 2025.

Use Cases

Training large multimodal language models based on the interleaved image-text structure.
Researching cross-modal alignment and retrieval using paired visual and textual content from videos.
Benchmarking video understanding systems based on data derived from YouTube.
Developing models for image captioning or visual question answering based on the described image-text pairs.

Strengths

Contains 10 million image-text interleaved documents, indicating a substantial scale.
Part of a broader corpus encompassing billions of images, suggesting extensive source diversity.
Associated with a peer-reviewed ICLR 2025 Spotlight paper, indicating academic rigor.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count is unknown, which may limit suitability assessment for specific tasks.
Data may reflect geographic, cultural, or platform-specific bias inherent to its YouTube source.

Provenance

Source: OpenGVLab
Collection Method: Collected from YouTube videos.
Time Range: null
Freshness: Last updated 2025-03-20 12:44:21; freshness should be verified.
Geography: null

null

Multimodal Multimodal Corpus Computer Vision Large Scale Natural Language Processing Image Text Interleaved

Related Datasets

Quality Score

D39

Description

42

Source

41

Reputation

37

Access

26

Community

536 downloads

13 likes

0 views

Dataset Info

Author: OpenGVLab
Created: Aug 30, 2024
Updated: Mar 20, 2025
Last synced: May 14, 2026

Access

26

Community

536 downloads

13 likes

0 views

Dataset Info

Author: OpenGVLab
Created: Aug 30, 2024
Updated: Mar 20, 2025
Last synced: May 14, 2026

OmniCorpus-YT: 10 Million Image-Text Documents from YouTube Videos

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info