HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multimodal Referring Segmentation: A Survey

Henghui Ding Song Tang Shuting He Chang Liu Zuxuan Wu Yu-Gang Jiang

Multimodal Referring Segmentation: A Survey

Abstract

Multimodal referring segmentation aims to segment target objects in visualscenes, such as images, videos, and 3D scenes, based on referring expressionsin text or audio format. This task plays a crucial role in practicalapplications requiring accurate object perception based on user instructions.Over the past decade, it has gained significant attention in the multimodalcommunity, driven by advances in convolutional neural networks, transformers,and large language models, all of which have substantially improved multimodalperception capabilities. This paper provides a comprehensive survey ofmultimodal referring segmentation. We begin by introducing this field'sbackground, including problem definitions and commonly used datasets. Next, wesummarize a unified meta architecture for referring segmentation and reviewrepresentative methods across three primary visual scenes, including images,videos, and 3D scenes. We further discuss Generalized Referring Expression(GREx) methods to address the challenges of real-world complexity, along withrelated tasks and practical applications. Extensive performance comparisons onstandard benchmarks are also provided. We continually track related works athttps://github.com/henghuiding/Awesome-Multimodal-Referring-Segmentation.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Referring Segmentation: A Survey | Papers | HyperAI