Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
VSTAT is a video-based benchmark for evaluating the visual state tracking capability of Multimodal Large Language Models (MLLMs). It contains 834 video clips paired with 1,500 questions whose answers cannot be inferred from any single keyframe or short segment. The dataset was created by nyu-visionx and was last updated in June 2026.
License is unknown; users must verify terms of use before downloading.