
摘要
图像分割是指将具有不同语义(例如类别或实例归属)的像素进行分组,其中每种语义选择定义了一个任务。尽管各个任务之间的语义差异较大,当前的研究重点仍然在于为每个任务设计专门的架构。本文介绍了一种新的架构——掩码注意力掩码变换器(Mask2Former),该架构能够应对任何图像分割任务(全景分割、实例分割或语义分割)。其关键组件包括掩码注意力机制,通过在预测的掩码区域内限制交叉注意力来提取局部特征。除了至少将研究工作量减少三倍外,Mask2Former在四个流行数据集上的表现显著优于最佳的专用架构。尤为值得一提的是,Mask2Former在全景分割(COCO数据集上的PQ得分为57.8)、实例分割(COCO数据集上的AP得分为50.1)和语义分割(ADE20K数据集上的mIoU得分为57.7)方面均创下了新的最先进水平。
代码仓库
open-mmlab/mmdetection
pytorch
alibaba/EasyCV
pytorch
huggingface/transformers
pytorch
GitHub 中提及
DdeGeus/Mask2Former-IBS
pytorch
GitHub 中提及
facebookresearch/Mask2Former
官方
pytorch
GitHub 中提及
nihalsid/mask2former
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| instance-segmentation-on-ade20k-val | Mask2Former (Swin-L, single-scale) | AP: 34.9 APL: 54.7 APM: 40 APS: 16.3 |
| instance-segmentation-on-ade20k-val | Mask2Former (ResNet-50) | APL: 43.1 APM: 28.9 |
| instance-segmentation-on-ade20k-val | Mask2Former (ResNet50) | AP: 26.4 APS: 10.4 |
| instance-segmentation-on-ade20k-val | Mask2Former (Swin-L + FAPN) | AP: 33.4 APL: 54.6 APM: 37.6 APS: 14.6 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-L, single-scale) | mask AP: 43.7 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-S) | mask AP: 41.8 |
| instance-segmentation-on-cityscapes-val | Mask2Former (ResNet-101) | mask AP: 38.5 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-B) | mask AP: 42 |
| instance-segmentation-on-cityscapes-val | Mask2Former (Swin-T) | mask AP: 39.7 |
| instance-segmentation-on-cityscapes-val | Mask2Former (ResNet-50) | mask AP: 37.4 |
| instance-segmentation-on-coco | Mask2Former (Swin-L, single scale) | AP50: 74.9 AP75: 54.9 APL: 71.2 APM: 53.8 APS: 29.1 mask AP: 50.5 |
| instance-segmentation-on-coco-minival | Mask2Former (Swin-L) | mask AP: 50.1 |
| instance-segmentation-on-coco-val-panoptic | Mask2Former (Swin-L, single-scale) | AP: 49.1 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (Swin-L) | AP: 34.2 PQ: 48.1 mIoU: 54.5 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (ResNet-50, 640x640) | AP: 26.5 mIoU: 46.1 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (ResNet-50, 640x640) | PQ: 39.7 |
| panoptic-segmentation-on-ade20k-val | Mask2Former (Swin-L + FAPN, 640x640) | AP: 33.2 PQ: 46.2 mIoU: 55.4 |
| panoptic-segmentation-on-ade20k-val | Panoptic-DeepLab (SwideRNet) | PQ: 37.9 mIoU: 50 |
| panoptic-segmentation-on-cityscapes-val | Mask2Former (Swin-L) | AP: 43.6 PQ: 66.6 mIoU: 82.9 |
| panoptic-segmentation-on-coco-minival | Mask2Former (single-scale) | AP: 48.6 PQ: 57.8 PQst: 48.1 PQth: 64.2 |
| panoptic-segmentation-on-coco-test-dev | Mask2Former (Swin-L) | PQ: 58.3 PQst: 48.1 PQth: 65.1 |
| semantic-segmentation-on-ade20k | Mask2Former (SwinL-FaPN) | Validation mIoU: 57.7 |
| semantic-segmentation-on-ade20k | Mask2Former (Swin-L-FaPN) | Validation mIoU: 56.4 |
| semantic-segmentation-on-ade20k | Mask2Former (SwinL) | Validation mIoU: 57.3 |
| semantic-segmentation-on-ade20k | Mask2Former(Swin-B) | Validation mIoU: 55.1 |
| semantic-segmentation-on-ade20k-val | Mask2Former (Swin-L-FaPN, multiscale) | mIoU: 57.7 |
| semantic-segmentation-on-ade20k-val | Mask2Former (Swin-L-FaPN) | mIoU: 56.4 |
| semantic-segmentation-on-cityscapes-val | Mask2Former (Swin-L) | mIoU: 84.3 |
| semantic-segmentation-on-coco-1 | MaskFormer (Swin-L, single-scale) | mIoU: 64.8 |
| semantic-segmentation-on-coco-1 | Mask2Former (Swin-L, single-scale) | mIoU: 67.4 |
| semantic-segmentation-on-mapillary-val | Mask2Former (Swin-L, multiscale) | mIoU: 64.7 |