Built from open-ended questions paired with images, categorized by their requirement for vision, language, and commonsense reasoning. It provides a framework for testing multimodal understanding through tasks that cannot be solved by a single modality alone.
Use Cases
- Train multimodal transformers to predict answers using the question and image features
- Evaluate the reasoning capabilities of vision-language models like LXMERT on open-ended tasks
- Benchmark the alignment of visual and linguistic features using the provided image-question pairs
Strengths
- Includes open-ended questions requiring natural language answers
- Pairs visual image data with corresponding textual queries
- Designed to evaluate commonsense knowledge integration in multimodal models