Sign in to view source links and access this dataset
Description
CoIN is a multimodal instruction tuning dataset compiled from publicly available vision-language benchmarks. It aggregates data from sources including VQAv2, VizWiz, ScienceQA, TextVQA, GQA, OCR-VQA, ImageNet, RefCOCO, RefCOCO+, and RefCOCOg. The dataset was created by Zacks-Chen and was last updated in March 2026.
Use Cases
Train models for visual question answering using image-question-answer triplets from VQAv2, VizWiz, and GQA.
Fine-tune models for text-in-image question answering using data from TextVQA and OCR-VQA.
Perform instruction tuning for scientific question answering using the ScienceQA subset.
Train models for referring expression comprehension tasks using bounding box and phrase data from RefCOCO, RefCOCO+, and RefCOCOg.
Integrate image classification tasks into instruction tuning pipelines using labels from the ImageNet subset.
Strengths
Aggregates data from 10 established public datasets including VQAv2 and ScienceQA.
Designed for continual learning, supporting sequential task training.
Constructed from publicly available instruction tuning datasets.
Freshness
Last updated March 2026.
Users must visit the Hugging Face dataset page for the full description and access instructions before use. License details are not specified in the provided metadata.