I'm a final-year Ph.D student of Computer Science and Engineering at the University of Michigan, working with David Fouhey and Joyce Chai. My primary research interest lies in large-scale vision language models, especially for 3D scenes and downstream robotics applications.

Before that, I obtained my B.S.E. from both the University of Michigan and Shanghai Jiao Tong University, and I worked with Jia Deng in Vision & Learning Lab.

I'm actively looking for full-time positions in computer vision. Please feel free to get in touch if there are any opportunities!


  • [2024/01] "LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent" is accepted by ICRA 2024!
  • [2023/09] I started my internship at NVIDIA Robotics Lab at Seattle, WA.
  • [2023/07] Both "Understanding 3D Object Interaction from a Single Image" and "Sound Localization from Motion" are accepted at ICCV 2023!
  • [2023/06] SpotTarget is accepted by 19th International Workshop on Mining and Learning With Graphs.
  • [2023/06] I'm a recipient of Rackham Doctoral Intern Fellowship 2023.

Work Experience

NVIDIA Robotics Lab, 09/2023 - current.
Research Intern.
AWS AI, 05/2023 - 08/2023.
Grounding Affordance from Vision Language Models.
Applied Scientist Intern.
Facebook AI Research, 05/2021 - 12/2021.
3D scene recognition from novel viewpoints without 3D supervision.
Research Intern.


AffordanceLLM: Grounding Affordance from Vision Language Models
arXiv 2024.

We aim to enhance the generalization capability of affordance grounding to in-the-wild objects that are unseen during training, by developing a new approach AffordanceLLM, that takes the advantage of the rich knowledge from large-scale VLMs.

[project page] [paper]

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent
Jianing (Jed) Yang*, Xuweiyi Chen*, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David Fouhey, Joyce Y. Chai.
ICRA 2024.

Adding an LLM agent can be a simple and effective way to improve 3D grounding capabilities for zero-shot open-vocabulary methods, especially when the query is complex.

[project page] [paper] [demo] [code] [video]

The workshop version will be presented at CoRL 2023 Workshop on Language and Robot Learning.

Understanding 3D Object Interaction from a Single Image.
Shengyi Qian, David Fouhey.
ICCV 2023

We detect potential 3D object interaction from a single image and a set of query points. Building on Segment-Anything, our model can predict whether the object is movable, rigid, and 3D locations, affordance, articulation, etc.

[OpenXLab demo (with gpu)] [HF demo (cpu)]

The paper is also presented at CVPR 2023 Workshop on 3D Vision and Robotics.
Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation.
Ziyang Chen, Shengyi Qian, Andrew Owens.
ICCV 2023

We jointly learn to localize sound sources from audio and to estimate camera rotations from images. Our method is entirely self-supervised.

[project page] [paper] [code] [bibtex]

Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices
WSDM 2024

We address several common pitfalls in training graph neural networks for link prediction.

[paper] [code]

Understanding 3D Object Articulation in Internet Videos.
CVPR 2022

We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos.

[project page] [paper] [code] [bibtex] [CVPR talk]

Recognizing Scenes from Novel Viewpoints.
arXiv 2021

We propose ViewSeg, which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.

[project page] [paper] [code]

Planar Surface Reconstruction from Sparse Views.
ICCV 2021

We create a planar reconstruction of a scene from two very distant camera viewpoints.

[project page] [paper] [code] [bibtex] [ICCV talk] [ICCV poster]

Associative3D: Volumetric Reconstruction from Sparse Views.
Shengyi Qian*, Linyi Jin*, David Fouhey.
ECCV 2020

We present Associative3D, which addresses 3D volumetric reconstruction from two views of a scene with an unknown camera, by simultaneously reconstructing objects and figuring out their relationship.

[ECCV talk] [slides]

Invited presentation at ECCV 2020 Workshop Holistic Scene Structures for 3D Vision.
OASIS: A Large-Scale Dataset for Single-Image 3D in the Wild.
Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng.
CVPR 2020

We present Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of dense annotations of detailed 3D geometry for Internet images.

Learning Single-Image Depth from Videos using Quality Assessment Networks.
Weifeng Chen, Shengyi Qian, Jia Deng.
CVPR 2019

We propose a method to automatically generate training data for single-view depth through Structure-from-Motion (SfM) on Internet videos.