Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
6.5 million keyframe images are interleaved with 0.8 billion words of ASR text from instructional videos, forming a corpus for vision-language pretraining. The dataset was created by DAMO-NLP-SG for the research project '2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining' and was last updated in March 2025.
The full dataset description and access details are hosted externally at https://huggingface.co/datasets/DAMO-NLP-SG/multimodal_textbook. License information is not provided in the input.