Understanding 3D Object Interaction from a Single Image

Shengyi Qian
David F. Fouhey

University of Michigan

ICCV 2023


[OpenXLab demo (gpu)]
[HuggingFace demo (cpu)]

Humans can easily understand a single image as depicting multiple potential objects permitting interaction. We use this skill to plan our interactions with the world and accelerate understanding new objects without engaging in interaction. In this paper, we would like to endow machines with the similar ability, so that intelligent agents can better explore the 3D scene or manipulate objects. Our approach is a transformer-based model that predicts the 3D location, physical properties and affordance of objects. To power this model, we collect a dataset with Internet videos, egocentric videos and indoor images to train and validate our approach. Our model yields strong performance on our data, and generalizes well to robotics data.

3D Object Interaction Dataset



Category Links Details
Images images.tar.gz All images from Articulation, Epickitchen and Omnidata. The usage of Epickitchen and Omnidata are subject to their original license (Epickitchen and Omnidata).
Annotations 3doi_v1.tar.gz Annotations for train, val and test split. The annotations of each split is stored in the pth file. Please check out our code about loading the data.
Omnidata Ground Truth omnidata_filtered.tar.gz You can download full ground truth from Omnidata. However, we filter ground truth for images included in our data, and provide a separate download link.


Interactive demo is available on Hugging Face.


This work was supported by the DARPA Machine Common Sense Program. This material is based upon work supported by the National Science Foundation under Grant No. 2142529. We also acknowledge the GPU support by OpenXLab for the demo.