3,700 question-answer pairs linked to images and a knowledge base of 1.5 million Wikipedia entities. The dataset facilitates visual entity retrieval where answers are specific entities rather than generic object labels.
Use Cases
- Train entity-linking models that map visual regions to specific Wikipedia entries
- Benchmark retrieval-augmented generation (RAG) systems using the provided knowledge base and image queries
- Develop multimodal reasoning models that combine visual features with structured external knowledge
Strengths
- 3,700 human-annotated questions requiring external knowledge
- Knowledge base containing 1.5 million entities with associated text and images
- Answers are mapped to unique Wikipedia entity identifiers