
I'm a final-year Ph.D student of Computer Science and Engineering at the University of Michigan, working with David Fouhey. My primary research interest lies in large-scale vision language models, especially for 3D scenes and downstream robotics applications.
Before that, I obtained my B.S.E. from both the University of Michigan and Shanghai Jiao Tong University, and I worked with Jia Deng in Vision & Learning Lab.
I'm currently interning at Nvidia Robotics Lab at Seattle, WA. And I'm actively looking for full-time positions in computer vision. Please feel free to get in touch if there are any opportunities!
News
- [2023/07] Both "Understanding 3D Object Interaction from a Single Image" and "Sound Localization from Motion" are accepted at ICCV 2023!
- [2023/06] SpotTarget is accepted by 19th International Workshop on Mining and Learning With Graphs.
- [2023/06] I'm a recipient of Rackham Doctoral Intern Fellowship 2023.
- [2023/05] I started my internship at Amazon Web Services at Seattle, WA.
- [2023/05] We have released the demo of "Chat with NeRF: Grounding 3D Objects in Neural Radiance Field through Dialog". Try the demo here!
- [2023/04] "Understanding 3D Object Interaction from a Single Image" will be presented at CVPR 2023 Workshop on 3D Vision and Robotics.
Work Experience



Demo

We explore how to combine LLMs and NeRF to enable a chat experience of 3D scenes.
[project page] [demo] [code]
Publications

We detect potential 3D object interaction from a single image and a set of query points. Building on Segment-Anything, our model can predict whether the object is movable, rigid, and 3D locations, affordance, articulation, etc.
[OpenXLab demo (with gpu)] [HF demo (cpu)]

We jointly learn to localize sound sources from audio and to estimate camera rotations from images. Our method is entirely self-supervised.
[project page] [paper] [code] [bibtex]

We address several common pitfalls in training graph neural networks for link prediction.
[paper]

We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos.
[project page] [paper] [code] [bibtex] [CVPR talk]

We propose ViewSeg, which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.
[project page] [paper] [code]

We create a planar reconstruction of a scene from two very distant camera viewpoints.
[project page] [paper] [code] [bibtex] [ICCV talk] [ICCV poster]

We present Associative3D, which addresses 3D volumetric reconstruction from two views of a scene with an unknown camera, by simultaneously reconstructing objects and figuring out their relationship.

We present Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of dense annotations of detailed 3D geometry for Internet images.

We propose a method to automatically generate training data for single-view depth through Structure-from-Motion (SfM) on Internet videos.
Teaching

IA with David Fouhey.
TA with Weikang Qian and Paul Weng.