AffordanceLLM: Grounding Affordance from Vision Language Models


Shengyi Qian
Weifeng Chen
Min Bai
Xiong Zhou
Zhuowen Tu
Li Erran Li

AWS AI, Amazon

OpenSun3D @ CVPR 2024

[pdf] [benchmark code] [(community) training code]


The input is a single image and the corresponding action (e.g, ``hold''). The output is a heatmap which highlights regions one can interact. We aim to enhance the generalization capability of affordance grounding to in-the-wild objects that are unseen during training, by developing a new approach, AffordanceLLM, that takes the advantage of the rich knowledge from large-scale vision language models beyond the supervision from the training images.

Affordance grounding refers to the task of finding the area of an object with which one can interact. It is a fundamental but challenging task, as a successful solution requires the comprehensive understanding of a scene in multiple aspects including detection, localization, and recognition of objects with their parts, of geo-spatial configuration/layout of the scene, of 3D shapes and physics, as well as of the functionality and potential interaction of the objects and humans. Much of the knowledge is hidden and beyond the image content with the supervised labels from a limited training set. In this paper, we make an attempt to improve the generalization capability of the current affordance grounding by taking the advantage of the rich world, abstract, and human-object-interaction knowledge from pretrained large-scale vision language models. Under the AGD20K benchmark, our proposed model demonstrates a significant performance gain over the competing methods for in-the-wild object affordance grounding. We further demonstrate it can ground affordance for objects from random Internet images, even if both objects and actions are unseen during training.


Benchmark

Category Links Details
Hard split agd20k_hard_split.tar.gz Our hard split of AGD20K. The original images can be downloaded from AGD20K.



Approach


The input is a single image and the corresponding action (e.g, ``hold''). The output is a heatmap which highlights regions one can interact. We aim to enhance the generalization capability of affordance grounding to in-the-wild objects that are unseen during training, by developing a new approach, AffordanceLLM, that takes the advantage of the rich knowledge from large-scale vision language models beyond the supervision from the training images.



Results





Generalization Results





Acknowledgements

This webpage template was borrowed from Nilesh Kulkarni, which originally come from some colorful folks. Thanks Wonjun for sharing the implementation of AffordanceLLM.