HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Yang Zhao ; Wang Jiaqi ; Tang Yansong ; Chen Kai ; Zhao Hengshuang ; Torr Philip H. S.

LAVT: Language-Aware Vision Transformer for Referring Image Segmentation

Abstract

Referring image segmentation is a fundamental vision-language task that aimsto segment out an object referred to by a natural language expression from animage. One of the key challenges behind this task is leveraging the referringexpression for highlighting relevant positions in the image. A paradigm fortackling this problem is to leverage a powerful vision-language ("cross-modal")decoder to fuse features independently extracted from a vision encoder and alanguage encoder. Recent methods have made remarkable advancements in thisparadigm by exploiting Transformers as cross-modal decoders, concurrent to theTransformer's overwhelming success in many other vision-language tasks.Adopting a different approach in this work, we show that significantly bettercross-modal alignments can be achieved through the early fusion of linguisticand visual features in intermediate layers of a vision Transformer encodernetwork. By conducting cross-modal feature fusion in the visual featureencoding stage, we can leverage the well-proven correlation modeling power of aTransformer encoder for excavating helpful multi-modal context. This way,accurate segmentation results are readily harvested with a light-weight maskpredictor. Without bells and whistles, our method surpasses the previousstate-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.

Code Repositories

yz93/lavt-ris
Official
pytorch

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation | Papers | HyperAI