Command Palette
Search for a command to run...
Learning to Generate Text-grounded Mask for Open-world Semantic Segmentation from Only Image-Text Pairs
Junbum Cha; Jonghwan Mun; Byungseok Roh

Abstract
We tackle open-world semantic segmentation, which aims at learning to segment arbitrary visual concepts in images, by using only image-text pairs without dense annotations. Existing open-world segmentation methods have shown impressive advances by employing contrastive learning (CL) to learn diverse visual concepts and transferring the learned image-level understanding to the segmentation task. However, these CL-based methods suffer from a train-test discrepancy, since it only considers image-text alignment during training, whereas segmentation requires region-text alignment during testing. In this paper, we proposed a novel Text-grounded Contrastive Learning (TCL) framework that enables a model to directly learn region-text alignment. Our method generates a segmentation mask for a given text, extracts text-grounded image embedding from the masked region, and aligns it with text embedding via TCL. By learning region-text alignment directly, our framework encourages a model to directly improve the quality of generated segmentation masks. In addition, for a rigorous and fair comparison, we present a unified evaluation protocol with widely used 8 semantic segmentation datasets. TCL achieves state-of-the-art zero-shot segmentation performances with large margins in all datasets. Code is available at https://github.com/kakaobrain/tcl.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| open-vocabulary-semantic-segmentation-on-1 | TCL | mIoU: 33.9 |
| open-vocabulary-semantic-segmentation-on-5 | TCL | mIoU: 83.2 |
| semantic-segmentation-on-cc3m-tagmask | TCL | mIoU: 60.4 |
| unsupervised-semantic-segmentation-with-10 | TCL | mIoU: 31.6 |
| unsupervised-semantic-segmentation-with-11 | TCL | mIoU: 55.0 |
| unsupervised-semantic-segmentation-with-3 | TCL | mIoU: 24.0 |
| unsupervised-semantic-segmentation-with-4 | TCL | Mean IoU (val): 17.1 |
| unsupervised-semantic-segmentation-with-7 | TCL | mIoU: 83.2 |
| unsupervised-semantic-segmentation-with-8 | TCL | mIoU: 33.9 |
| unsupervised-semantic-segmentation-with-9 | TCL | mIoU: 22.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.