3,700 question-answer pairs paired with images and a retrieval corpus of 1.5 million Wikipedia passages. The dataset focuses on entity-centric visual question answering, requiring models to identify visual entities and retrieve external knowledge to provide answers.
Use Cases
- Benchmark multimodal retrieval systems using the question and image inputs to query the Wikipedia passage corpus
- Train entity-linking models to map visual regions to specific Wikidata identifiers
- Develop end-to-end knowledge-based VQA pipelines that integrate visual recognition with external text evidence
Strengths
- 3,700 human-annotated question-answer pairs linked to visual content
- Retrieval corpus containing 1.5 million Wikipedia passages
- Entity-level ground truth annotations linking images to Wikidata identifiers