Sign in to view source links and access this dataset
Description
A filtered subset of 101,951 training samples from the ULVR_v2_clean dataset. These samples were incorrectly answered by the Qwen2.5-VL-7B-Instruct model when given only an input image, but were answered correctly once intermediate images were also provided, as judged by a more capable model. The dataset was created by williamium and last updated on June 20, 2026.
Use Cases
Training vision-language models to better utilize intermediate visual reasoning steps based on the provided intermediate_image_* data.
Benchmarking model performance on complex visual reasoning tasks that require multiple image inputs.
Analyzing failure modes of smaller VLMs where access to intermediate visual information changes outcomes.
Studying the structure of visual reasoning problems across subsets like scene_graph, depth, and segmentation.
Strengths
Contains 101,951 specifically filtered training samples that highlight a distinct model failure mode.
Judgment of correctness was performed by a more capable model (Qwen3-VL-32B-Instruct), suggesting a rigorous evaluation process.
Maintains the same schema and train-split structure as its source dataset, ensuring compatibility.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality and the nature of intermediate images require manual inspection.
Provenance
Source
Filtered subset of RuoliuYang/ULVR_v2_clean dataset from Hugging Face.
Collection Method
Samples were filtered based on the performance difference of a Qwen2.5-VL-7B-Instruct model with and without intermediate images.
Freshness
Last updated 2026-06-20 01:13:41; freshness should be verified.
License is unknown; users must verify licensing terms before use.