5 months ago

In Defense of Lazy Visual Grounding for Open-Vocabulary Semantic Segmentation

Dahyun Kang; Minsu Cho

Abstract

We present lazy visual grounding, a two-stage approach of unsupervised object mask discovery followed by object grounding, for open-vocabulary semantic segmentation. Plenty of the previous art casts this task as pixel-to-text classification without object-level comprehension, leveraging the image-to-text classification capability of pretrained vision-and-language models. We argue that visual objects are distinguishable without the prior text information as segmentation is essentially a vision task. Lazy visual grounding first discovers object masks covering an image with iterative Normalized cuts and then later assigns text on the discovered objects in a late interaction manner. Our model requires no additional training yet shows great performance on five public datasets: Pascal VOC, Pascal Context, COCO-object, COCO-stuff, and ADE 20K. Especially, the visually appealing segmentation results demonstrate the model capability to localize objects precisely. Paper homepage: https://cvlab.postech.ac.kr/research/lazygrounding

Code Repositories

dahyun-kang/lazygrounding

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
open-vocabulary-semantic-segmentation-on-1	LaVG	mIoU: 34.7
open-vocabulary-semantic-segmentation-on-2	LaVG	mIoU: 15.8
open-vocabulary-semantic-segmentation-on-5	LaVG	mIoU: 82.5
open-vocabulary-semantic-segmentation-on-coco	LaVG	mIoU: 23.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette