Understanding 3D Object Articulation in Internet Videos

Shengyi Qian
Linyi Jin
Chris Rockwell
Siyi Chen
David F. Fouhey

University of Michigan

CVPR 2022


Given an ordinary video, our system produces a 3D planar representation of the observed articulation. The 3D renderings illustrate how the microwave (in Pink) can be articulated in 3D space. We also show the predicted rotation axis using a Blue arrow.

We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos. While seemingly easy for humans, this problem poses many challenges for computers. We propose to approach this problem by combining a top-down detection system that finds planes that can be articulated along with an optimization approach that solves for a 3D plane that can explain a sequence of observed articulations. We show that this system can be trained on a combination of videos and 3D scan datasets. When tested on a dataset of challenging Internet videos and the Charades dataset, our approach obtains strong performance.



Internet videos

Category Links Details
Video Clips pos_clips.tar.gz Articulation video clips. Each clip lasts 3 seconds.
Negative Clips neg_clips.tar.gz For each positive video clip, we try to sample a negative clip (no articulation) in the same scene with a hand motion. This is used for the recogition benchmark.
Frames articulation_frames_v1.tar.gz Key frames pre-extracted for the dataset. We extract 9 key frames for each video clip, which has 90 frames (fps=30).
Annotations articulation_annotations_v1.tar.gz Articulation annotations. Surface normals are only available in the test split. We have preprocessed annotations to COCO format.


Category Links Details
Annotations scannet_annotations.tar.gz ScanNet plane annotations. It is preprocessed by SparsePlanes.
SURREAL images scannet_surreal_imgs.tar.gz We render synthetic humans on around 98k ScanNet images. You can extract it to the ScanNet folder.
SURREAL annotations scannet_surreal_annotations.tar.gz The same plane annotations but we change image path to SURREAL images.


This work was supported by the DARPA Machine Common Sense Program and Toyota Research Institute. Toyota Research Institute (“TRI”) provided funds to assist the authors with their research but this article solely reflects the opinions and conclusions of its authors and not TRI or any other Toyota entity.