8 months ago

Semantic Segmentation

Computer Vision

Computer Vision

Yuxuan Zhang Tianheng Cheng Rui Hu ei Liu Heng Liu Longjin Ran Xiaoxin Chen Wenyu Liu Xinggang Wang

Abstract

Segment Anything Model (SAM) has attracted widespread attention for itssuperior interactive segmentation capabilities with visual prompts whilelacking further exploration of text prompts. In this paper, we empiricallyinvestigate what text prompt encoders (e.g., CLIP or LLM) are good for adaptingSAM for referring expression segmentation and introduce the EarlyVision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effectivereferring segmentation method which exploits multimodal prompts (i.e., imageand text) and comprises a pre-trained vision-language model to generatereferring prompts and a SAM model for segmentation. Surprisingly, we observethat: (1) multimodal prompts and (2) vision-language models with early fusion(e.g., BEIT-3) are beneficial for prompting SAM for accurate referringsegmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3can obtain state-of-the-art performance on RefCOCO/+/g for referring expressionsegmentation and demonstrate the superiority of prompting SAM with earlyvision-language fusion. In addition, the proposed EVF-SAM with 1.32B parametersachieves remarkably higher performance while reducing nearly 82% of parameterscompared to previous SAM methods based on large multimodal models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Semantic Segmentation

Computer Vision

Computer Vision

Yuxuan Zhang Tianheng Cheng Rui Hu ei Liu Heng Liu Longjin Ran Xiaoxin Chen Wenyu Liu Xinggang Wang

Abstract

Segment Anything Model (SAM) has attracted widespread attention for itssuperior interactive segmentation capabilities with visual prompts whilelacking further exploration of text prompts. In this paper, we empiricallyinvestigate what text prompt encoders (e.g., CLIP or LLM) are good for adaptingSAM for referring expression segmentation and introduce the EarlyVision-language Fusion-based SAM (EVF-SAM). EVF-SAM is a simple yet effectivereferring segmentation method which exploits multimodal prompts (i.e., imageand text) and comprises a pre-trained vision-language model to generatereferring prompts and a SAM model for segmentation. Surprisingly, we observethat: (1) multimodal prompts and (2) vision-language models with early fusion(e.g., BEIT-3) are beneficial for prompting SAM for accurate referringsegmentation. Our experiments show that the proposed EVF-SAM based on BEIT-3can obtain state-of-the-art performance on RefCOCO/+/g for referring expressionsegmentation and demonstrate the superiority of prompting SAM with earlyvision-language fusion. In addition, the proposed EVF-SAM with 1.32B parametersachieves remarkably higher performance while reducing nearly 82% of parameterscompared to previous SAM methods based on large multimodal models.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

EVF-SAM: Early Vision-Language Fusion for Text-Prompted Segment Anything Model | Papers | HyperAI