Sign in to view source links and access this dataset
Description
A benchmark for evaluating embodied spatial understanding in Large Vision-Language Models, created by Phineas476 and last updated on June 23, 2024. It comprises 3,640 question-answer pairs automatically derived from embodied scenes, covering 294 object categories and 6 spatial relationships from an egocentric perspective. The associated EmbSpatial-SFT dataset provides instruction-tuning data for spatial tasks.
Use Cases
Benchmarking LVLM performance on embodied spatial understanding based on the 3,640 QA pairs.
Training or fine-tuning models for egocentric spatial reasoning based on the 6 defined relationships.
Analyzing model capabilities across diverse object categories based on the 294 covered categories.
Developing instruction-following models for spatial tasks using the associated EmbSpatial-SFT data.
Strengths
Contains 3,640 QA pairs, providing a substantial evaluation set.
Covers 294 object categories, suggesting diversity in visual concepts.
Focuses on 6 specific spatial relationships, enabling targeted analysis.
Automatically derived from embodied scenes, which may ensure scale and consistency.
Limitations
Column-level documentation is absent; field semantics must be inferred after download.
Row count is known for the benchmark but other details like file formats and sample data are unavailable.
Data may reflect bias inherent to the source embodied scenes and generation method.
Provenance
Source
huggingface
Collection Method
Automatically derived from embodied scenes.
Time Range
null
Freshness
Last updated 2024-06-23 17:35:21; freshness should be verified.
Geography
null
License is unknown; terms of use must be verified before download.