265,016 images from MS COCO are paired with 1,105,904 questions and 11,059,040 ground-truth answers. The dataset is structured into balanced pairs where each question is associated with two similar images that result in different answers to minimize language bias.
Use Cases
- Train multimodal transformers to predict the multiple_choice_answer using the image_id and question text
- Benchmark model bias by evaluating performance on balanced pairs linked by the question_id
- Analyze reasoning capabilities across different linguistic categories using the question_type metadata
Strengths
- 1,105,904 questions across 265,016 images sourced from MS COCO
- 10 ground-truth answers per question to capture human response variance
- Categorization of entries into 'yes/no', 'number', and 'other' via the answer_type field
- Balanced image-question pairs designed to counteract language-only model shortcuts