Loading...
Loading...
Available on 1 platform
Sign in to view source links and access this dataset
VLM4D is a benchmark of approximately 1,000 real-world and synthetic videos designed to evaluate spatiotemporal reasoning in Vision Language Models. Developed by Shijie Zhou and researchers at UCLA in 2025, the dataset provides curated video-text pairs to test model awareness of motion and time.
Requires the VLM4D GitHub repository for standard evaluation scripts; released under the MIT license.