VQA v2: Visual Question Answering Version 2

Available on 1 platform

Sign in to view source links and access this dataset

Description

265,016 images from MS COCO are paired with 1,105,904 questions and 11,059,040 ground-truth answers. The dataset is structured into balanced pairs where each question is associated with two similar images that result in different answers to minimize language bias.

Use Cases

Train multimodal transformers to predict the multiple_choice_answer using the image_id and question text
Benchmark model bias by evaluating performance on balanced pairs linked by the question_id
Analyze reasoning capabilities across different linguistic categories using the question_type metadata

Strengths

1,105,904 questions across 265,016 images sourced from MS COCO
10 ground-truth answers per question to capture human response variance
Categorization of entries into 'yes/no', 'number', and 'other' via the answer_type field
Balanced image-question pairs designed to counteract language-only model shortcuts

English Computer Vision Natural Language Processing

Related Datasets

Quality Score

D14

Description

5

Source

17

Reputation

18

Access

22

Community

0 views

Access

22

Community

0 views

VQA v2: Visual Question Answering Version 2

Description

Use Cases

Strengths

Related Topics

Related Datasets

Quality Score

Community

Community