Command Palette
Search for a command to run...

Abstract
We present X-Decoder, a generalized decoding model that can predict pixel-level segmentation and language tokens seamlessly. X-Decodert takes as input two types of queries: (i) generic non-semantic queries and (ii) semantic queries induced from text inputs, to decode different pixel-level and token-level outputs in the same semantic space. With such a novel design, X-Decoder is the first work that provides a unified way to support all types of image segmentation and a variety of vision-language (VL) tasks. Further, our design enables seamless interactions across tasks at different granularities and brings mutual benefits by learning a common and rich pixel-level visual-semantic understanding space, without any pseudo-labeling. After pretraining on a mixed set of a limited amount of segmentation data and millions of image-text pairs, X-Decoder exhibits strong transferability to a wide range of downstream tasks in both zero-shot and finetuning settings. Notably, it achieves (1) state-of-the-art results on open-vocabulary segmentation and referring segmentation on eight datasets; (2) better or competitive finetuned performance to other generalist and specialist models on segmentation and VL tasks; and (3) flexibility for efficient finetuning and novel task composition (e.g., referring captioning and image editing). Code, demo, video, and visualization are available at https://x-decoder-vl.github.io.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| instance-segmentation-on-ade20k-val | X-Decoder (L) | AP: 35.8 |
| instance-segmentation-on-ade20k-val | X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | AP: 38.7 APL: 59.6 APM: 43.3 APS: 18.9 |
| panoptic-segmentation-on-ade20k-val | X-Decoder (Davit-d5, Deform, single-scale, 1280x1280) | AP: 38.7 PQ: 52.4 mIoU: 59.1 |
| panoptic-segmentation-on-ade20k-val | X-Decoder (L) | AP: 35.8 PQ: 49.6 mIoU: 58.1 |
| referring-expression-segmentation-on-refcocog | X-Decoder (Davit-d5) | Overall IoU: 64.6 |
| zero-shot-segmentation-on-segmentation-in-the | SGinW_Team (X-Decoder-L) | Mean AP: 32.2 |
| zero-shot-segmentation-on-segmentation-in-the | SGinW_Team (X-Decoder-T) | Mean AP: 22.6 |
| zero-shot-segmentation-on-segmentation-in-the | SGinW_Team (X-Decoder-B) | Mean AP: 27.7 |
| zero-shot-segmentation-on-segmentation-in-the | SGinW_Team (X-Decoder-L-IN21K) | Mean AP: 26.6 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.