VisualWebInstruct is a large-scale multimodal instruction dataset containing approximately 900,000 question-answer pairs. It consists of 40% visual QA pairs linked to 163,743 unique images and 60% text-only QA pairs, designed to enhance vision-language reasoning. The dataset was created by TIGER-Lab and was last updated on February 1, 2026.
Use Cases
- Fine-tuning multimodal instruction-following models based on the described QA pairs.
- Training visual question answering models based on the 163,743 unique images.
- Benchmarking model reasoning capabilities on mixed text and visual instruction data.
- Augmenting training data for large language models with multimodal context.
Strengths
- Approximately 900,000 instruction QA pairs provide a substantial volume of training data.
- Dataset includes 163,743 unique images associated with visual QA pairs.
- Mix of 40% visual and 60% text-only QA pairs offers diverse modality coverage.
Limitations
- Column-level documentation is absent; field semantics must be inferred after download.
- Row count is unknown, which may limit suitability assessment.
- Last updated 2026-02-01 04:25:09; freshness should be verified.
Provenance
- Source
- TIGER-Lab
- Collection Method
- Scaling up multimodal instruction data through web search, as described.
- Freshness
- 2026-02-01 04:25:09