Name: ULVR-filtered: Vision-Language Reasoning Cases Where Intermediate Images Help
Creator: williamium
Published: 2026-06-19T23:56:20
Keywords: Vision Language Reasoning, Model Evaluation, Multimodal Training, Computer Vision, Multimodal

Description

A filtered subset of 101,951 training samples from the ULVR_v2_clean dataset. These samples were incorrectly answered by the Qwen2.5-VL-7B-Instruct model when given only an input image, but were answered correctly once intermediate images were also provided, as judged by a more capable model. The dataset was created by williamium and last updated on June 20, 2026.

Use Cases

Training vision-language models to better utilize intermediate visual reasoning steps based on the provided intermediate_image_* data.
Benchmarking model performance on complex visual reasoning tasks that require multiple image inputs.
Analyzing failure modes of smaller VLMs where access to intermediate visual information changes outcomes.
Studying the structure of visual reasoning problems across subsets like scene_graph, depth, and segmentation.

Strengths

Contains 101,951 specifically filtered training samples that highlight a distinct model failure mode.
Judgment of correctness was performed by a more capable model (Qwen3-VL-32B-Instruct), suggesting a rigorous evaluation process.
Maintains the same schema and train-split structure as its source dataset, ensuring compatibility.

Limitations

Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is unknown, which may limit suitability assessment.
Description metadata is limited; actual data quality and the nature of intermediate images require manual inspection.

Provenance

Source: Filtered subset of RuoliuYang/ULVR_v2_clean dataset from Hugging Face.
Collection Method: Samples were filtered based on the performance difference of a Qwen2.5-VL-7B-Instruct model with and without intermediate images.
Freshness: Last updated 2026-06-20 01:13:41; freshness should be verified.

License is unknown; users must verify licensing terms before use.

Multimodal Vision Language Reasoning Model Evaluation Multimodal Training Computer Vision

ULVR-filtered: Vision-Language Reasoning Cases Where Intermediate Images Help

Description

Use Cases

Strengths

Limitations

Provenance

Related Topics

Related Datasets

Quality Score

Community

Dataset Info

Community

Dataset Info