Command Palette
Search for a command to run...
LAVT: Language-Aware Vision Transformer for Referring Image Segmentation
Yang Zhao ; Wang Jiaqi ; Tang Yansong ; Chen Kai ; Zhao Hengshuang ; Torr Philip H. S.

Abstract
Referring image segmentation is a fundamental vision-language task that aimsto segment out an object referred to by a natural language expression from animage. One of the key challenges behind this task is leveraging the referringexpression for highlighting relevant positions in the image. A paradigm fortackling this problem is to leverage a powerful vision-language ("cross-modal")decoder to fuse features independently extracted from a vision encoder and alanguage encoder. Recent methods have made remarkable advancements in thisparadigm by exploiting Transformers as cross-modal decoders, concurrent to theTransformer's overwhelming success in many other vision-language tasks.Adopting a different approach in this work, we show that significantly bettercross-modal alignments can be achieved through the early fusion of linguisticand visual features in intermediate layers of a vision Transformer encodernetwork. By conducting cross-modal feature fusion in the visual featureencoding stage, we can leverage the well-proven correlation modeling power of aTransformer encoder for excavating helpful multi-modal context. This way,accurate segmentation results are readily harvested with a light-weight maskpredictor. Without bells and whistles, our method surpasses the previousstate-of-the-art methods on RefCOCO, RefCOCO+, and G-Ref by large margins.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| generalized-referring-expression-segmentation | LAVT | cIoU: 57.64 gIoU: 58.40 |
| referring-expression-segmentation-on-refcoco-3 | LAVT | Overall IoU: 62.14 |
| referring-expression-segmentation-on-refcoco-4 | LAVT | Overall IoU: 68.38 |
| referring-expression-segmentation-on-refcoco-5 | LAVT | Overall IoU: 55.1 |
| referring-expression-segmentation-on-refcocog | LAVT | Overall IoU: 61.24 |
| referring-expression-segmentation-on-refcocog-1 | LAVT (Swin-B) | Overall IoU: 62.09 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.