1044 samples of text prompts, conditioning images, and binary question-answer pairs comprise this benchmark for measuring world model progress. It covers target domains like autonomous vehicle driving, robotics, smart spaces, physics, and human common sense. The dataset was created by NVIDIA and last updated in June 2025.
Use Cases
- Benchmark world model performance on binary question-answering tasks using the provided text prompts and qa pairs.
- Evaluate multimodal reasoning by conditioning world models on the provided images before answering binary questions.
- Test model generalization across physical AI domains like autonomous vehicle scenarios and robotics using the domain-specific prompts.
- Assess common sense reasoning in physical contexts using the human and common sense question sets.
Strengths
- 1044 total samples provide a defined benchmark size.
- Covers 6 distinct Physical AI target domains for broad evaluation.
Limitations
- Limited to binary (Yes/No) questions, restricting answer complexity.
- Sample size of 1044 may be insufficient for training large models from scratch.
- Potential for bias in the selection of prompts and conditioning images across domains.
Provenance
- Source
- NVIDIA
- Collection Method
- null
- Time Range
- null
- Freshness
- Last updated June 2025.
- Geography
- null