Name: Multimodal Textbook of 6.5 Million Instructional Video Keyframes
Creator: DAMO-NLP-SG
Published: 2025-01-01T09:18:58
Keywords: Task Categoriestext Generation, Size Categories1 Mn10 M, Languageen, Task Categoriessummarization, Interleaved, Pretraining, Regionus, Reasoning, Arxiv250100958, Licenseapache 20

Description

6.5 million keyframe images are interleaved with 0.8 billion words of ASR text from instructional videos, forming a corpus for vision-language pretraining. The dataset was created by DAMO-NLP-SG for the research project '2.5 Years in Class: A Multimodal Textbook for Vision-Language Pretraining' and was last updated in March 2025.

Use Cases

Train vision-language models on interleaved keyframe images and ASR text sequences for multimodal understanding.
Fine-tune models for instructional video captioning using the 0.8 billion words of ASR transcripts as supervision.
Pretrain contrastive learning models using the 6.5 million keyframes paired with their corresponding video transcript text.
Develop models for temporal reasoning in instructional content using sequences of keyframes and their interleaved text.

Strengths

Large-scale corpus containing 6.5 million keyframe images.
Massive text component of 0.8 billion words from ASR transcripts.
Data is structured in an interleaved image-text format specifically for multimodal pretraining.
Focus on instructional videos provides a domain-specific and task-oriented data source.

Limitations

Specific column structure and data schema are not publicly documented.
The source and quality of the original instructional videos are not detailed, which may introduce bias.
No information on potential class imbalance or distribution across different instructional topics.

Provenance

Source: DAMO-NLP-SG.
Collection Method: Keyframes and ASR text extracted from online instructional videos.
Freshness: Last updated on 2025-03-17.

The full dataset description and access details are hosted externally at https://huggingface.co/datasets/DAMO-NLP-SG/multimodal_textbook. License information is not provided in the input.

Task Categoriestext Generation Size Categories1 Mn10 M Languageen Task Categoriessummarization Interleaved Pretraining Regionus Reasoning Arxiv250100958 Licenseapache 20

Multimodal Textbook of 6.5 Million Instructional Video Keyframes

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info