Understanding 3D Object Articulation in Internet Videos

Shengyi Qian
Linyi Jin
Chris Rockwell
Siyi Chen
David F. Fouhey

University of Michigan

CVPR 2022


Given an ordinary video, our system produces a 3D planar representation of the observed articulation. The 3D renderings illustrate how the microwave (in Pink) can be articulated in 3D space. We also show the predicted rotation axis using a Blue arrow.

We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos. While seemingly easy for humans, this problem poses many challenges for computers. We propose to approach this problem by combining a top-down detection system that finds planes that can be articulated along with an optimization approach that solves for a 3D plane that can explain a sequence of observed articulations. We show that this system can be trained on a combination of videos and 3D scan datasets. When tested on a dataset of challenging Internet videos and the Charades dataset, our approach obtains strong performance.


Category Links Details
Frames articulation_frames_v1.tar.gz Key frames pre-extracted for the dataset. We extract 9 key frames for each video clip, which has 90 frames (fps=30).
Annotations articulation_annotations_v1.tar.gz Articulation annotations. Surface normals are only available in the test split. We have preprocessed annotations to COCO format.


This work was supported by the DARPA Machine Common Sense Program and Toyota Research Institute. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.