HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Masked-attention Mask Transformer for Universal Image Segmentation

Bowen Cheng; Ishan Misra; Alexander G. Schwing; Alexander Kirillov; Rohit Girdhar

Masked-attention Mask Transformer for Universal Image Segmentation

Abstract

Image segmentation is about grouping pixels with different semantics, e.g., category or instance membership, where each choice of semantics defines a task. While only the semantics of each task differ, current research focuses on designing specialized architectures for each task. We present Masked-attention Mask Transformer (Mask2Former), a new architecture capable of addressing any image segmentation task (panoptic, instance or semantic). Its key components include masked attention, which extracts localized features by constraining cross-attention within predicted mask regions. In addition to reducing the research effort by at least three times, it outperforms the best specialized architectures by a significant margin on four popular datasets. Most notably, Mask2Former sets a new state-of-the-art for panoptic segmentation (57.8 PQ on COCO), instance segmentation (50.1 AP on COCO) and semantic segmentation (57.7 mIoU on ADE20K).

Code Repositories

huggingface/transformers
pytorch
Mentioned in GitHub
DdeGeus/Mask2Former-IBS
pytorch
Mentioned in GitHub
facebookresearch/Mask2Former
Official
pytorch
Mentioned in GitHub
nihalsid/mask2former
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
instance-segmentation-on-ade20k-valMask2Former (Swin-L, single-scale)
AP: 34.9
APL: 54.7
APM: 40
APS: 16.3
instance-segmentation-on-ade20k-valMask2Former (ResNet-50)
APL: 43.1
APM: 28.9
instance-segmentation-on-ade20k-valMask2Former (ResNet50)
AP: 26.4
APS: 10.4
instance-segmentation-on-ade20k-valMask2Former (Swin-L + FAPN)
AP: 33.4
APL: 54.6
APM: 37.6
APS: 14.6
instance-segmentation-on-cityscapes-valMask2Former (Swin-L, single-scale)
mask AP: 43.7
instance-segmentation-on-cityscapes-valMask2Former (Swin-S)
mask AP: 41.8
instance-segmentation-on-cityscapes-valMask2Former (ResNet-101)
mask AP: 38.5
instance-segmentation-on-cityscapes-valMask2Former (Swin-B)
mask AP: 42
instance-segmentation-on-cityscapes-valMask2Former (Swin-T)
mask AP: 39.7
instance-segmentation-on-cityscapes-valMask2Former (ResNet-50)
mask AP: 37.4
instance-segmentation-on-cocoMask2Former (Swin-L, single scale)
AP50: 74.9
AP75: 54.9
APL: 71.2
APM: 53.8
APS: 29.1
mask AP: 50.5
instance-segmentation-on-coco-minivalMask2Former (Swin-L)
mask AP: 50.1
instance-segmentation-on-coco-val-panopticMask2Former (Swin-L, single-scale)
AP: 49.1
panoptic-segmentation-on-ade20k-valMask2Former (Swin-L)
AP: 34.2
PQ: 48.1
mIoU: 54.5
panoptic-segmentation-on-ade20k-valMask2Former (ResNet-50, 640x640)
AP: 26.5
mIoU: 46.1
panoptic-segmentation-on-ade20k-valMask2Former (ResNet-50, 640x640)
PQ: 39.7
panoptic-segmentation-on-ade20k-valMask2Former (Swin-L + FAPN, 640x640)
AP: 33.2
PQ: 46.2
mIoU: 55.4
panoptic-segmentation-on-ade20k-valPanoptic-DeepLab (SwideRNet)
PQ: 37.9
mIoU: 50
panoptic-segmentation-on-cityscapes-valMask2Former (Swin-L)
AP: 43.6
PQ: 66.6
mIoU: 82.9
panoptic-segmentation-on-coco-minivalMask2Former (single-scale)
AP: 48.6
PQ: 57.8
PQst: 48.1
PQth: 64.2
panoptic-segmentation-on-coco-test-devMask2Former (Swin-L)
PQ: 58.3
PQst: 48.1
PQth: 65.1
semantic-segmentation-on-ade20kMask2Former (SwinL-FaPN)
Validation mIoU: 57.7
semantic-segmentation-on-ade20kMask2Former (Swin-L-FaPN)
Validation mIoU: 56.4
semantic-segmentation-on-ade20kMask2Former (SwinL)
Validation mIoU: 57.3
semantic-segmentation-on-ade20kMask2Former(Swin-B)
Validation mIoU: 55.1
semantic-segmentation-on-ade20k-valMask2Former (Swin-L-FaPN, multiscale)
mIoU: 57.7
semantic-segmentation-on-ade20k-valMask2Former (Swin-L-FaPN)
mIoU: 56.4
semantic-segmentation-on-cityscapes-valMask2Former (Swin-L)
mIoU: 84.3
semantic-segmentation-on-coco-1MaskFormer (Swin-L, single-scale)
mIoU: 64.8
semantic-segmentation-on-coco-1Mask2Former (Swin-L, single-scale)
mIoU: 67.4
semantic-segmentation-on-mapillary-valMask2Former (Swin-L, multiscale)
mIoU: 64.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Masked-attention Mask Transformer for Universal Image Segmentation | Papers | HyperAI