Name: Continual Instruction Tuning Dataset for Vision-Language Models
Creator: Zacks-Chen
Published: 2024-03-08T02:56:59
Keywords: Task Categoriesquestion Answering, Languageen, Vision Language, Question Answering, Licensecc By 40, Continual Learning, Regionus, Instruction Tuning, Multimodal

Description

CoIN is a multimodal instruction tuning dataset compiled from publicly available vision-language benchmarks. It aggregates data from sources including VQAv2, VizWiz, ScienceQA, TextVQA, GQA, OCR-VQA, ImageNet, RefCOCO, RefCOCO+, and RefCOCOg. The dataset was created by Zacks-Chen and was last updated in March 2026.

Use Cases

Train models for visual question answering using image-question-answer triplets from VQAv2, VizWiz, and GQA.
Fine-tune models for text-in-image question answering using data from TextVQA and OCR-VQA.
Perform instruction tuning for scientific question answering using the ScienceQA subset.
Train models for referring expression comprehension tasks using bounding box and phrase data from RefCOCO, RefCOCO+, and RefCOCOg.
Integrate image classification tasks into instruction tuning pipelines using labels from the ImageNet subset.

Strengths

Aggregates data from 10 established public datasets including VQAv2 and ScienceQA.
Designed for continual learning, supporting sequential task training.
Incorporates multiple task types: question answering, classification, and referring expression comprehension.

Limitations

Specific row counts, column names, and dataset size are not provided.
Potential for class imbalance or label noise inherited from the original source datasets.
Geographic and demographic biases may be present based on the source data collections.

Provenance

Source: Aggregated from VQAv2, VizWiz, ScienceQA, TextVQA, GQA, OCR-VQA, ImageNet, RefCOCO, RefCOCO+, RefCOCOg.
Collection Method: Constructed from publicly available instruction tuning datasets.
Freshness: Last updated March 2026.

Users must visit the Hugging Face dataset page for the full description and access instructions before use. License details are not specified in the provided metadata.

Multimodal Task Categoriesquestion Answering Languageen Vision Language Question Answering Licensecc By 40 Continual Learning Regionus Instruction Tuning

Continual Instruction Tuning Dataset for Vision-Language Models

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info