
摘要
在图像分割任务中,对图像变换器网络编码器部分的预训练主干网络进行微调一直是传统方法。然而,这种方法在编码阶段忽略了图像本身所蕴含的语义上下文信息。本文提出,通过在微调过程中将图像的语义信息融入预训练的分层Transformer主干网络,可显著提升模型性能。为实现这一目标,我们提出了SeMask——一种简单而有效的框架,通过引入语义注意力操作,将语义信息整合至编码器中。此外,在训练过程中,我们采用轻量级的语义解码器,为每一阶段的中间语义先验特征图提供监督信号。实验结果表明,引入语义先验可显著提升现有分层编码器的性能,同时仅带来少量额外的浮点运算量(FLOPs)。我们通过将SeMask集成至Swin Transformer和Mix Transformer等主干网络,并搭配多种解码器进行验证,提供了充分的实证支持。所提框架在ADE20K数据集上取得了58.25%的mIoU新最优成绩,在Cityscapes数据集上的mIoU指标提升超过3%。相关代码与模型权重已公开发布于:https://github.com/Picsart-AI-Research/SeMask-Segmentation。
代码仓库
Picsart-AI-Research/SeMask-Segmentation
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-L MSFaPN-Mask2Former) | Validation mIoU: 58.2 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-S FPN) | Params (M): 56 Validation mIoU: 47.63 |
| semantic-segmentation-on-ade20k | SeMask(SeMask Swin-L MSFaPN-Mask2Former, single-scale) | Validation mIoU: 57.0 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-L Mask2Former) | Validation mIoU: 57.5 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-B FPN) | Params (M): 96 Validation mIoU: 50.98 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-L FPN) | Validation mIoU: 53.52 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-T FPN) | Params (M): 35 Validation mIoU: 43.16 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-L MaskFormer) | Validation mIoU: 56.2 |
| semantic-segmentation-on-ade20k | SeMask (SeMask Swin-L FaPN-Mask2Former) | Validation mIoU: 58.2 |
| semantic-segmentation-on-ade20k-val | SeMask (SeMask Swin-L FaPN-Mask2Former) | mIoU: 58.2 |
| semantic-segmentation-on-ade20k-val | SeMask (SeMask Swin-L MSFaPN-Mask2Former, single-scale) | mIoU: 57.0 |
| semantic-segmentation-on-ade20k-val | SeMask (SeMask Swin-L MaskFormer) | mIoU: 56.2 |
| semantic-segmentation-on-ade20k-val | SeMask (SeMask Swin-L FPN) | mIoU: 53.5 |
| semantic-segmentation-on-ade20k-val | SeMask (SeMask Swin-L Mask2Former) | mIoU: 57.5 |
| semantic-segmentation-on-ade20k-val | SeMask (SeMask Swin-L MSFaPN-Mask2Former) | mIoU: 58.2 |
| semantic-segmentation-on-cityscapes-val | SeMask (SeMask Swin-L FPN) | mIoU: 80.39 |
| semantic-segmentation-on-cityscapes-val | SeMask (SeMask Swin-L Mask2Former) | mIoU: 84.98 |