HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Cross-Modal Progressive Comprehension for Referring Segmentation

Si Liu Tianrui Hui Shaofei Huang Yunchao Wei Bo Li Guanbin Li

Cross-Modal Progressive Comprehension for Referring Segmentation

Abstract

Given a natural language expression and an image/video, the goal of referring segmentation is to produce the pixel-level masks of the entities described by the subject of the expression. Previous approaches tackle this problem by implicit feature interaction and fusion between visual and linguistic modalities in a one-stage manner. However, human tends to solve the referring problem in a progressive manner based on informative words in the expression, i.e., first roughly locating candidate entities and then distinguishing the target one. In this paper, we propose a Cross-Modal Progressive Comprehension (CMPC) scheme to effectively mimic human behaviors and implement it as a CMPC-I (Image) module and a CMPC-V (Video) module to improve referring image and video segmentation models. For image data, our CMPC-I module first employs entity and attribute words to perceive all the related entities that might be considered by the expression. Then, the relational words are adopted to highlight the target entity as well as suppress other irrelevant ones by spatial graph reasoning. For video data, our CMPC-V module further exploits action words based on CMPC-I to highlight the correct entity matched with the action cues by temporal graph reasoning. In addition to the CMPC, we also introduce a simple yet effective Text-Guided Feature Exchange (TGFE) module to integrate the reasoned multimodal features corresponding to different levels in the visual backbone under the guidance of textual information. In this way, multi-level features can communicate with each other and be mutually refined based on the textual context. Combining CMPC-I or CMPC-V with TGFE can form our image or video version referring segmentation frameworks and our frameworks achieve new state-of-the-art performances on four referring image segmentation benchmarks and three referring video segmentation benchmarks respectively.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
referring-expression-segmentation-on-a2dCMPC-V (R2D)
AP: 0.351
IoU mean: 0.515
IoU overall: 0.649
Precision@0.5: 0.590
Precision@0.6: 0.527
Precision@0.7: 0.434
Precision@0.8: 0.284
Precision@0.9: 0.068
referring-expression-segmentation-on-a2dCMPC-V (I3D)
AP: 0.404
IoU mean: 0.573
IoU overall: 0.653
Precision@0.5: 0.655
Precision@0.6: 0.592
Precision@0.7: 0.506
Precision@0.8: 0.342
Precision@0.9: 0.098
referring-expression-segmentation-on-j-hmdbCMPC-V
AP: 0.342
IoU mean: 0.617
IoU overall: 0.616
Precision@0.5: 0.813
Precision@0.6: 0.657
Precision@0.7: 0.371
Precision@0.8: 0.07
Precision@0.9: 0.000

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Cross-Modal Progressive Comprehension for Referring Segmentation | Papers | HyperAI