SpatialReasoner: Towards Explicit and Generalizable 3D Spatial Reasoning

Technical Report

Wufei Ma*Yu-Cheng Chou*Qihao Liu*Xingrui Wang
Celso de MeloJieneng ChenJianwen XieoAlan Yuille

Johns Hopkins University
DEVCOM Army Research LaboratoryoLambda Inc
*Equal contribution

we introduce SpatialReasoner, a novel large vision-language model (LVLM) that address 3D spatial reasoning with explicit 3D representations shared between stages -- 3D perception, computation, and reasoning. Explicit 3D representations provide a coherent interface that supports advanced 3D spatial reasoning and enable us to study the factual errors made by LVLMs.

Motivation
Figure 1. Comparing 3D spatial reasoning of our SpatialReasoner with previous state-of-the-art models.

Key Findings

Open Source

Coming soon.

SpatialReasoner codebase. Static Badge

Synthetic 3D data generation pipeline. Static Badge

SpatialReasoner models. Static Badge

SpatialReasoner data. Static Badge

Miscellaneous

License. Our SpatialReasoner and SpatialReasonerDataGen is released under the Creative Commons Attribution 4.0 license. By accessing and using our SpatialReasoner and SpatialReasonerDataGen, you agree to follow the terms of access specified here.

Ethics. We follow the ethics guidelines at Johns Hopkins University and obtained Institutional Review Board (IRB) approvals prior to the start of our work. We described potential risks to the annotators and explained the purpose of the study and how the collected data would be used. All annotators agreed to join this project voluntarily and were paid by a fair amount as required at our institution.

BibTeX

pending

Notes

This website template is adapted from Image Sculpting.