5 months ago

Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng; Ishan Misra; Alexander G. Schwing; Alexander Kirillov; Rohit Girdhar

Abstract

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Code Repositories

MindSpore-scientific/code-7/tree/main/Mask2Former

open-mmlab/mmdetection

pytorch

alibaba/EasyCV

pytorch

huggingface/transformers

pytorch

Mentioned in GitHub

DdeGeus/Mask2Former-IBS

pytorch

Mentioned in GitHub

facebookresearch/Mask2Former

Official

pytorch

Mentioned in GitHub

nihalsid/mask2former

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
instance-segmentation-on-ade20k-val	Mask2Former (Swin-L, single-scale)	AP: 34.9 APL: 54.7 APM: 40 APS: 16.3
instance-segmentation-on-ade20k-val	Mask2Former (ResNet-50)	APL: 43.1 APM: 28.9
instance-segmentation-on-ade20k-val	Mask2Former (ResNet50)	AP: 26.4 APS: 10.4
instance-segmentation-on-ade20k-val	Mask2Former (Swin-L + FAPN)	AP: 33.4 APL: 54.6 APM: 37.6 APS: 14.6
instance-segmentation-on-cityscapes-val	Mask2Former (Swin-L, single-scale)	mask AP: 43.7
instance-segmentation-on-cityscapes-val	Mask2Former (Swin-S)	mask AP: 41.8
instance-segmentation-on-cityscapes-val	Mask2Former (ResNet-101)	mask AP: 38.5
instance-segmentation-on-cityscapes-val	Mask2Former (Swin-B)	mask AP: 42
instance-segmentation-on-cityscapes-val	Mask2Former (Swin-T)	mask AP: 39.7
instance-segmentation-on-cityscapes-val	Mask2Former (ResNet-50)	mask AP: 37.4
instance-segmentation-on-coco	Mask2Former (Swin-L, single scale)	AP50: 74.9 AP75: 54.9 APL: 71.2 APM: 53.8 APS: 29.1 mask AP: 50.5
instance-segmentation-on-coco-minival	Mask2Former (Swin-L)	mask AP: 50.1
instance-segmentation-on-coco-val-panoptic	Mask2Former (Swin-L, single-scale)	AP: 49.1
panoptic-segmentation-on-ade20k-val	Mask2Former (Swin-L)	AP: 34.2 PQ: 48.1 mIoU: 54.5
panoptic-segmentation-on-ade20k-val	Mask2Former (ResNet-50, 640x640)	AP: 26.5 mIoU: 46.1
panoptic-segmentation-on-ade20k-val	Mask2Former (ResNet-50, 640x640)	PQ: 39.7
panoptic-segmentation-on-ade20k-val	Mask2Former (Swin-L + FAPN, 640x640)	AP: 33.2 PQ: 46.2 mIoU: 55.4
panoptic-segmentation-on-ade20k-val	Panoptic-DeepLab (SwideRNet)	PQ: 37.9 mIoU: 50
panoptic-segmentation-on-cityscapes-val	Mask2Former (Swin-L)	AP: 43.6 PQ: 66.6 mIoU: 82.9
panoptic-segmentation-on-coco-minival	Mask2Former (single-scale)	AP: 48.6 PQ: 57.8 PQst: 48.1 PQth: 64.2
panoptic-segmentation-on-coco-test-dev	Mask2Former (Swin-L)	PQ: 58.3 PQst: 48.1 PQth: 65.1
semantic-segmentation-on-ade20k	Mask2Former (SwinL-FaPN)	Validation mIoU: 57.7
semantic-segmentation-on-ade20k	Mask2Former (Swin-L-FaPN)	Validation mIoU: 56.4
semantic-segmentation-on-ade20k	Mask2Former (SwinL)	Validation mIoU: 57.3
semantic-segmentation-on-ade20k	Mask2Former(Swin-B)	Validation mIoU: 55.1
semantic-segmentation-on-ade20k-val	Mask2Former (Swin-L-FaPN, multiscale)	mIoU: 57.7
semantic-segmentation-on-ade20k-val	Mask2Former (Swin-L-FaPN)	mIoU: 56.4
semantic-segmentation-on-cityscapes-val	Mask2Former (Swin-L)	mIoU: 84.3
semantic-segmentation-on-coco-1	MaskFormer (Swin-L, single-scale)	mIoU: 64.8
semantic-segmentation-on-coco-1	Mask2Former (Swin-L, single-scale)	mIoU: 67.4
semantic-segmentation-on-mapillary-val	Mask2Former (Swin-L, multiscale)	mIoU: 64.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng; Ishan Misra; Alexander G. Schwing; Alexander Kirillov; Rohit Girdhar

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters