7 visual reasoning tasks comprising geometric primitives designed to test the fundamental perception of Vision-Language Models. The dataset includes categories such as line intersections, circle overlaps, and nested shapes where models frequently fail despite human-level performance.
Use Cases
- Benchmark the spatial reasoning capabilities of multimodal models using the geometric task labels and ground truth coordinates.
- Identify systematic perception errors in vision encoders by comparing model outputs against the 'blindness' task categories.
- Develop improved vision-language alignment techniques by training models to recognize basic topological relationships like 'inside' or 'intersecting'.
Strengths
- Includes 7 distinct geometric task categories: line intersection, circle overlap, nested squares, counting, touching circles, overlapping circles, and line length.
- Features 2D geometric renderings that are trivial for human vision but challenging for state-of-the-art multimodal models.
- Provides a benchmark for evaluating topological and spatial relationship recognition independent of linguistic priors.