Understanding 3D Object Interaction from a Single Image

Shengyi Qian
David F. Fouhey

University of Michigan

arXiv 2023


Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.

3D Object Interaction Dataset



Category Links Details
Images images.tar.gz All images from Articulation, Epickitchen and Omnidata.
Annotations annotations.tar.gz Annotations for train, val and test split. The annotations of each split is stored in the pth file. Please check out our code about loading the data.


Interactive demo is available on Hugging Face.


This work was supported by the DARPA Machine Common Sense Program. This material is based upon work supported by the National Science Foundation under Grant No. 2142529.