I'm a Research Scientist at Meta Fundamental AI Research (FAIR). I work on multimodal embodied AI agents, including 3D scene understanding and robotics.

Before that, I received a Ph.D. in Computer Science and Engineering at the University of Michigan, working with David Fouhey and Joyce Chai. Before that, I obtained my B.S.E. from both the University of Michigan and Shanghai Jiao Tong University, working with Jia Deng.

Google Scholar. CV. Github. Twitter. Linkedin.

Email: syqian [at] meta [dot] com

Office: 1101 Dexter Ave N, Seattle, WA 98109

Personal email: syqian [at] umich [dot] edu

News

[2025/04] Excited to announce the launch of Llama 4, a major leap in open-source AI! As part of the team supporting Llama 4 at FAIR, I’m proud to have contributed to these cutting-edge models! 🚀
[2025/02] 3D-MVP, MM-Graph and 3D-GRAND are accepted at CVPR 2025!
~~[2024/10] We're looking for research interns starting next year working on embodied agents and multimodal LLMs. If you are interested, please drop me an email and apply here.~~

Work Experience

Meta FAIR, 06/2024 - current

Research Scientist.

NVIDIA Robotics Lab, 09/2023 - 04/2024.

Research Intern.

Host: Kaichun Mo, Ankit Goyal, Valts Blukis, Dieter Fox.

AWS AI, 05/2023 - 08/2023.

Applied Scientist Intern.

Host: Weifeng Chen, Erran Li, Zhuowen Tu.

Meta FAIR, 05/2021 - 12/2021.

Research Intern.

Host: Georgia Gkioxari.

Publications

DISCO Balances the Scales: Adaptive Domain- and Difficulty-Aware Reinforcement Learning on Imbalanced Data

Yuhang Zhou*, Jing Zhu*, Shengyi Qian, Zhuokai Zhao, Xiyao Wang, Xiaoyu Liu, Ming Li, Paiheng Xu, Wei Ai, Furong Huang.

arXiv 2025.

[paper]

3D-MVP: 3D Multiview Pretraining for Robotic Manipulation

Shengyi Qian, Kaichun Mo, Valts Blukis, David Fouhey, Dieter Fox, Ankit Goyal.

CVPR 2025.

[project page] [paper]

Mosaic of Modalities: A Comprehensive Benchmark for Multimodal Graph Learning

Jing Zhu, Yuhang Zhou, Shengyi Qian, Zhongmou He, Tong Zhao, Neil Shah, Danai Koutra.

CVPR 2025.

[project page] [paper]

3D-GRAND: Towards Better Grounding and Less Hallucination for 3D-LLMs

Jianing (Jed) Yang*, Xuweiyi Chen*, Nikhil Madaan, Madhavan Iyengar, Shengyi Qian, David Fouhey, Joyce Y. Chai.

CVPR 2025.

[project page] [paper] [demo]

Multi-Object Hallucination in Vision-Language Models

Xuweiyi Chen*, Ziqiao Ma*, Xuejun Zhang*, Sihan Xu, Shengyi Qian, Jianing (Jed) Yang, David Fouhey, Joyce Y. Chai.

NeurIPS 2024.

[project page] [paper] [code] [dataset]

LinkGPT: Teaching Large Language Models To Predict Missing Links

Zhongmou He, Jing Zhu, Shengyi Qian, Joyce Y. Chai, Danai Koutra.

Adaptive Foundation Models @ NeurIPS 2024.

[paper]

AffordanceLLM: Grounding Affordance from Vision Language Models

Shengyi Qian, Weifeng Chen, Min Bai, Xiong Zhou, Zhuowen Tu, Li Erran Li.

OpenSUN3D @ CVPR 2024.

We aim to enhance the generalization capability of affordance grounding to in-the-wild objects that are unseen during training, by developing a new approach AffordanceLLM, that takes the advantage of the rich knowledge from large-scale VLMs.

[project page] [paper]

LLM-Grounder: Open-Vocabulary 3D Visual Grounding with Large Language Model as an Agent

Jianing (Jed) Yang*, Xuweiyi Chen*, Shengyi Qian, Nikhil Madaan, Madhavan Iyengar, David Fouhey, Joyce Y. Chai.

ICRA 2024.

Adding an LLM agent can be a simple and effective way to improve 3D grounding capabilities for zero-shot open-vocabulary methods, especially when the query is complex.

[project page] [paper] [demo] [code] [video]

The paper is also presented at CoRL 2023 Workshop on Language and Robot Learning.

Pitfalls in Link Prediction with Graph Neural Networks: Understanding the Impact of Target-link Inclusion & Better Practices

Jing Zhu*, Yuhang Zhou*, Vassilis N. Ioannidis, Shengyi Qian, Wei Ai, Xiang Song, Danai Koutra.

WSDM 2024 (Best paper award @ MLoG workshop)

We address several common pitfalls in training graph neural networks for link prediction.

[paper] [code]

The paper is also presented at 19th International Workshop on Mining and Learning With Graphs.

Understanding 3D Object Interaction from a Single Image.

Shengyi Qian, David Fouhey.

ICCV 2023

We detect potential 3D object interaction from a single image and a set of query points. Building on Segment-Anything, our model can predict whether the object is movable, rigid, and 3D locations, affordance, articulation, etc.

[project page] [paper] [code] [video]

[OpenXLab demo (with gpu)] [HF demo (cpu)]

The paper is also presented at CVPR 2023 Workshop on 3D Vision and Robotics.

Sound Localization from Motion: Jointly Learning Sound Direction and Camera Rotation.

Ziyang Chen, Shengyi Qian, Andrew Owens.

ICCV 2023

We jointly learn to localize sound sources from audio and to estimate camera rotations from images. Our method is entirely self-supervised.

[project page] [paper] [code] [bibtex]

Understanding 3D Object Articulation in Internet Videos.

Shengyi Qian, Linyi Jin, Chris Rockwell, Siyi Chen, David Fouhey.

CVPR 2022

We propose to investigate detecting and characterizing the 3D planar articulation of objects from ordinary videos.

[project page] [paper] [code] [bibtex] [CVPR talk]

Recognizing Scenes from Novel Viewpoints.

Shengyi Qian, Alexander Kirillov, Nikhila Ravi, Devendra Singh Chaplot, Justin Johnson, David Fouhey, Georgia Gkioxari.

arXiv 2021

We propose ViewSeg, which takes as input a few RGB images of a new scene and recognizes the scene from novel viewpoints by segmenting it into semantic categories.

[project page] [paper] [code]

Planar Surface Reconstruction from Sparse Views.

Linyi Jin, Shengyi Qian, Andrew Owens, David Fouhey.

ICCV 2021

We create a planar reconstruction of a scene from two very distant camera viewpoints.

[project page] [paper] [code] [bibtex] [ICCV talk] [ICCV poster]

Associative3D: Volumetric Reconstruction from Sparse Views.

Shengyi Qian*, Linyi Jin*, David Fouhey.

ECCV 2020

We present Associative3D, which addresses 3D volumetric reconstruction from two views of a scene with an unknown camera, by simultaneously reconstructing objects and figuring out their relationship.

[project page] [paper] [code] [bibtex] [ECCV 1-min overview]

[ECCV talk] [slides]

Invited presentation at ECCV 2020 Workshop Holistic Scene Structures for 3D Vision.

OASIS: A Large-Scale Dataset for Single-Image 3D in the Wild.

Weifeng Chen, Shengyi Qian, David Fan, Noriyuki Kojima, Max Hamilton, Jia Deng.

CVPR 2020

We present Open Annotations of Single Image Surfaces (OASIS), a dataset for single-image 3D in the wild consisting of dense annotations of detailed 3D geometry for Internet images.

[project page] [paper] [code] [bibtex]

Learning Single-Image Depth from Videos using Quality Assessment Networks.

Weifeng Chen, Shengyi Qian, Jia Deng.

CVPR 2019

We propose a method to automatically generate training data for single-view depth through Structure-from-Motion (SfM) on Internet videos.

[project page] [paper] [code] [bibtex]

Teaching

EECS 442 Computer Vision (Winter '19)

IA with David Fouhey.

VE280 Programming & Data Structures (Summer '19)

TA with Weikang Qian and Paul Weng.