Sign in to view source links and access this dataset
Description
Over 1.2 million samples across eight categories comprise ULVR_v2_clean, a cleaned dataset for visual reasoning. Each sample includes an input image and a question, with an assistant's response containing a visual token, intermediate steps, and a boxed answer. The dataset was created by RuoliuYang and was last updated on HuggingFace in June 2026.
Use Cases
Training visual question answering models based on image-question-answer triplets.
Developing models that generate intermediate visual reasoning steps.
Benchmarking multimodal AI performance on tasks like object detection, segmentation, and scene graph generation implied by the subset names.
Fine-tuning large language models to incorporate visual reasoning tokens and structured outputs.
Strengths
Dataset is organized into eight distinct subsets, including 'text_cot', 'bbox_highlight', and 'segmentation'.
Provides separate train and validation splits for each subset, with the largest training split ('helper_interleaved') containing over 340,000 samples.
The description specifies a structured output format for each sample, including intermediate visual steps.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count for the full dataset is not aggregated, which may limit suitability assessment.
The 'scene_graph' subset row count is truncated in the description, obscuring its full size.
Provenance
Source
HuggingFace dataset repository by RuoliuYang.
Collection Method
Method of gathering is not specified in the provided input.
Freshness
Last updated 2026-06-02 11:54:48; freshness should be verified.
License is unknown; users must verify terms of use before downloading.