3 months ago

Multimodal Referring Segmentation: A Survey

Henghui Ding Song Tang Shuting He Chang Liu Zuxuan Wu Yu-Gang Jiang

Abstract

Multimodal referring segmentation aims to segment target objects in visualscenes, such as images, videos, and 3D scenes, based on referring expressionsin text or audio format. This task plays a crucial role in practicalapplications requiring accurate object perception based on user instructions.Over the past decade, it has gained significant attention in the multimodalcommunity, driven by advances in convolutional neural networks, transformers,and large language models, all of which have substantially improved multimodalperception capabilities. This paper provides a comprehensive survey ofmultimodal referring segmentation. We begin by introducing this field'sbackground, including problem definitions and commonly used datasets. Next, wesummarize a unified meta architecture for referring segmentation and reviewrepresentative methods across three primary visual scenes, including images,videos, and 3D scenes. We further discuss Generalized Referring Expression(GREx) methods to address the challenges of real-world complexity, along withrelated tasks and practical applications. Extensive performance comparisons onstandard benchmarks are also provided. We continually track related works athttps://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Multimodal Referring Segmentation: A Survey

Henghui Ding Song Tang Shuting He Chang Liu Zuxuan Wu Yu-Gang Jiang

Abstract

Build AI with AI

Hyper Newsletters