HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model

Yuxuan Zhang Tianheng Cheng Rui Hu ei Liu Heng Liu Longjin Ran Xiaoxin Chen Wenyu Liu Xinggang Wang

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything
  Model

Abstract

Segment Anything Model (SAM) has attracted widespread attention for itssuperior interactive segmentation capabilities with visual prompts whilelacking further exploration of text prompts. In this paper, we empiricallyinvestigate what text prompt encoders (e.g., CLIP or LLM) are good for adaptingSAM for referring expression segmentation and introduce the EarlyVision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effectivereferring segmentation method which exploits multimodal prompts (i.e., imageand text) and comprises a pre-trained vision-language model to generatereferring prompts and a SAM model for segmentation. Surprisingly, we observethat: (1) multimodal prompts and (2) vision-language models with early fusion(e.g., BEIT-3) are beneficial for prompting SAM for accurate referringsegmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3can obtain state-of-the-art performance on RefCOCO/+/g for referring expressionsegmentation and demonstrate the superiority of prompting SAM with earlyvision-language fusion. In addition, the proposed EVF-SAM with 1.32B parametersachieves remarkably higher performance while reducing nearly 82% of parameterscompared to previous SAM methods based on large multimodal models.

Code Repositories

hustvl/evf-sam
Official
pytorch
Mentioned in GitHub

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | Papers | HyperAI