Command Palette
Search for a command to run...
Henghui Ding Song Tang Shuting He Chang Liu Zuxuan Wu Yu-Gang Jiang

Abstract
Multimodal referring segmentation aims to segment target objects in visualscenes, such as images, videos, and 3D scenes, based on referring expressionsin text or audio format. This task plays a crucial role in practicalapplications requiring accurate object perception based on user instructions.Over the past decade, it has gained significant attention in the multimodalcommunity, driven by advances in convolutional neural networks, transformers,and large language models, all of which have substantially improved multimodalperception capabilities. This paper provides a comprehensive survey ofmultimodal referring segmentation. We begin by introducing this field'sbackground, including problem definitions and commonly used datasets. Next, wesummarize a unified meta architecture for referring segmentation and reviewrepresentative methods across three primary visual scenes, including images,videos, and 3D scenes. We further discuss Generalized Referring Expression(GREx) methods to address the challenges of real-world complexity, along withrelated tasks and practical applications. Extensive performance comparisons onstandard benchmarks are also provided. We continually track related works athttps://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.