Description

ZeroBench is a visual reasoning benchmark containing fewer than 1,000 image-text pairs designed to challenge contemporary Large Multimodal Models (LMMs). Created by Jonathan Roberts and associated with Arxiv paper 2502.09696, the dataset was updated in December 2025 to include refined hierarchical question structures. It focuses on tasks that were considered nearly unsolvable for multimodal models at the time of its release.

Use Cases

Benchmarking Large Multimodal Model (LMM) performance on high-difficulty visual reasoning using question_text
Analyzing model failure modes through the provided hierarchical subquestions
Testing zero-shot instruction following with specific output constraints like 'kg' unit suffixes

Strengths

Expert-curated 'impossible' tasks designed to test the ceiling of LMM performance
Includes hierarchical subquestions for detailed error analysis
Linked to Arxiv 2502.09696 for peer-reviewed methodology

Limitations

Small sample size of fewer than 1,000 records
Extreme difficulty level may result in floor effects for less capable models
Strict formatting requirements in question_text may lead to failures based on syntax rather than reasoning

Provenance

Source: Jonathan Roberts (jonathan-roberts1), Arxiv 2502.09696
Collection Method: Expert curation and annotation for high-difficulty benchmarking
Freshness: Last updated in December 2025 with a v3 changelog for question refinements.

Users should refer to the v3 changelog (23/12/2025) for the most recent question structures; models must strictly follow formatting instructions in question_text to be evaluated correctly.

Parquet Task Categoriesimage Text To Text Librarypolars Size Categoriesn1 K Modalitytext Librarymlcroissant Modalityimage Librarydatasets Librarypandas Regionus Arxiv250209696

ZeroBench: High-Difficulty Visual Reasoning Tasks for LMM Evaluation

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info