Command Palette
Search for a command to run...
Bowen Cheng; Ishan Misra; Alexander G. Schwing; Alexander Kirillov; Rohit Girdhar

Abstract
Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| instance-segmentation-on-ade20k-val | Mask2Former (Swin-L, single-scale) | AP: 34.9 APL: 54.7 APM: 40 APS: 16.3 |
| instance-segmentation-on-ade20k-val | Mask2Former (ResNet-50) | APL: 43.1 APM: 28.9 |
| instance-segmentation-on-ade20k-val | Mask2Former (ResNet50) | AP: 26.4 APS: 10.4 |
| instance-segmentation-on-ade20k-val | Mask2Former (Swin-L + FAPN) | AP: 33.4 APL: 54.6 APM: 37.6 APS: 14.6 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-L, single-scale) | mask AP: 43.7 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-S) | mask AP: 41.8 |
| instance-segmentation-on-cityscapes-val | Mask2Former (ResNet-101) | mask AP: 38.5 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-B) | mask AP: 42 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-T) | mask AP: 39.7 |
| instance-segmentation-on-cityscapes-val | Mask2Former (ResNet-50) | mask AP: 37.4 |
| instance-segmentation-on-coco | Mask2Former (Swin-L, single scale) | AP50: 74.9 AP75: 54.9 APL: 71.2 APM: 53.8 APS: 29.1 mask AP: 50.5 |
| instance-segmentation-on-coco-minival | Mask2Former (Swin-L) | mask AP: 50.1 |
| instance-segmentation-on-coco-val-panoptic | Mask2Former (Swin-L, single-scale) | AP: 49.1 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (Swin-L) | AP: 34.2 PQ: 48.1 mIoU: 54.5 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (ResNet-50, 640x640) | AP: 26.5 mIoU: 46.1 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (ResNet-50, 640x640) | PQ: 39.7 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (Swin-L + FAPN, 640x640) | AP: 33.2 PQ: 46.2 mIoU: 55.4 |
| panoptic-segmentation-on-ade20k-val | Panoptic-DeepLab (SwideRNet) | PQ: 37.9 mIoU: 50 |
| panoptic-segmentation-on-cityscapes-val | Mask2Former (Swin-L) | AP: 43.6 PQ: 66.6 mIoU: 82.9 |
| panoptic-segmentation-on-coco-minival | Mask2Former (single-scale) | AP: 48.6 PQ: 57.8 PQst: 48.1 PQth: 64.2 |
| panoptic-segmentation-on-coco-test-dev | Mask2Former (Swin-L) | PQ: 58.3 PQst: 48.1 PQth: 65.1 |
| semantic-segmentation-on-ade20k | Mask2Former (SwinL-FaPN) | Validation mIoU: 57.7 |
| semantic-segmentation-on-ade20k | Mask2Former (Swin-L-FaPN) | Validation mIoU: 56.4 |
| semantic-segmentation-on-ade20k | Mask2Former (SwinL) | Validation mIoU: 57.3 |
| semantic-segmentation-on-ade20k | Mask2Former(Swin-B) | Validation mIoU: 55.1 |
| semantic-segmentation-on-ade20k-val | Mask2Former (Swin-L-FaPN, multiscale) | mIoU: 57.7 |
| semantic-segmentation-on-ade20k-val | Mask2Former (Swin-L-FaPN) | mIoU: 56.4 |
| semantic-segmentation-on-cityscapes-val | Mask2Former (Swin-L) | mIoU: 84.3 |
| semantic-segmentation-on-coco-1 | MaskFormer (Swin-L, single-scale) | mIoU: 64.8 |
| semantic-segmentation-on-coco-1 | Mask2Former (Swin-L, single-scale) | mIoU: 67.4 |
| semantic-segmentation-on-mapillary-val | Mask2Former (Swin-L, multiscale) | mIoU: 64.7 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.