
摘要
通用图像分割并不是一个新概念。过去几十年中,为了统一图像分割所做的尝试包括场景解析、全景分割,以及最近提出的一些新的全景架构。然而,这些全景架构并未真正实现图像分割的统一,因为它们需要分别在语义分割、实例分割或全景分割上进行单独训练才能达到最佳性能。理想情况下,一个真正的通用框架应该只需一次训练就能在所有三种图像分割任务上实现最先进(SOTA)的性能。为此,我们提出了OneFormer,这是一种采用多任务一次性训练设计的通用图像分割框架。首先,我们提出了一种任务条件下的联合训练策略,该策略能够在单一的多任务训练过程中同时利用每个领域的地面真值(语义分割、实例分割和全景分割)。其次,我们引入了任务标记(task token),以使模型能够根据当前的任务进行动态调整,从而支持多任务训练和推理。最后,我们在训练过程中采用了查询-文本对比损失(query-text contrastive loss),以建立更好的跨任务和跨类别的区分度。值得注意的是,我们的单个OneFormer模型在ADE20K、CityScapes和COCO数据集上的所有三项分割任务中均优于专门的Mask2Former模型,尽管后者是在每项任务上分别使用三倍资源进行单独训练的。通过使用新的ConvNeXt和DiNAT主干网络,我们观察到了更高的性能提升。我们认为OneFormer是朝着使图像分割更加通用和易用迈出的重要一步。为了支持进一步的研究,我们已将代码和模型开源至https://github.com/SHI-Labs/OneFormer。
代码仓库
SHI-Labs/OneFormer
官方
pytorch
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| instance-segmentation-on-ade20k-val | OneFormer (DiNAT-L, single-scale) | AP: 36.0 |
| instance-segmentation-on-ade20k-val | OneFormer (Swin-L, single-scale) | AP: 35.9 |
| instance-segmentation-on-ade20k-val | OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-pretrain) | AP: 40.2 APL: 59.7 APM: 44.4 APS: 19.2 |
| instance-segmentation-on-ade20k-val | OneFormer (InternImage-H, emb_dim=1024, single-scale, 896x896, COCO-Pretrained) | AP: 44.2 APL: 64.3 APM: 49.9 APS: 23.7 |
| instance-segmentation-on-cityscapes-val | OneFormer (ConvNeXt-L, single-scale, Mapillary-Pretrained) | mask AP: 48.7 |
| instance-segmentation-on-cityscapes-val | OneFormer (Swin-L, single-scale) | mask AP: 45.6 |
| instance-segmentation-on-cityscapes-val | OneFormer (DiNAT-L, single-scale) | mask AP: 45.6 |
| instance-segmentation-on-coco-val-panoptic | OneFormer (Swin-L, single-scale) | AP: 49.0 |
| instance-segmentation-on-coco-val-panoptic | OneFormer (InternImage-H, emb_dim=1024, single-scale) | AP: 52.0 |
| instance-segmentation-on-coco-val-panoptic | OneFormer (DiNAT-L, single-scale) | AP: 49.2 |
| panoptic-segmentation-on-ade20k-val | OneFormer (ConvNeXt-L, single-scale, 640x640) | AP: 36.2 PQ: 50.0 mIoU: 56.6 |
| panoptic-segmentation-on-ade20k-val | OneFormer (DiNAT-L, single-scale, 640x640) | AP: 36.0 PQ: 50.5 mIoU: 58.3 |
| panoptic-segmentation-on-ade20k-val | OneFormer (InternImage-H, emb_dim=256, single-scale, 896x896) | AP: 40.2 PQ: 54.5 mIoU: 60.4 |
| panoptic-segmentation-on-ade20k-val | OneFormer (DiNAT-L, single-scale, 1280x1280, COCO-Pretrain) | PQ: 53.4 mIoU: 58.9 |
| panoptic-segmentation-on-ade20k-val | OneFormer (ConvNeXt-XL, single-scale, 640x640) | AP: 36.3 PQ: 50.1 mIoU: 57.4 |
| panoptic-segmentation-on-ade20k-val | OneFormer (DiNAT-L, single-scale, 1280x1280) | AP: 37.1 PQ: 51.5 mIoU: 58.3 |
| panoptic-segmentation-on-ade20k-val | OneFormer (Swin-L, single-scale, 1280x1280) | AP: 37.8 PQ: 51.4 mIoU: 57.0 |
| panoptic-segmentation-on-ade20k-val | OneFormer (Swin-L, single-scale, 640x640) | AP: 35.9 PQ: 49.8 mIoU: 57.0 |
| panoptic-segmentation-on-cityscapes-test | OneFormer (ConvNeXt-L, single-scale, Mapillary Vistas-Pretrained) | PQ: 68.0 |
| panoptic-segmentation-on-cityscapes-val | OneFormer (ConvNeXt-XL, single-scale) | AP: 46.7 PQ: 68.4 mIoU: 83.6 |
| panoptic-segmentation-on-cityscapes-val | OneFormer (Swin-L, single-scale) | AP: 45.6 PQ: 67.2 mIoU: 83.0 |
| panoptic-segmentation-on-cityscapes-val | OneFormer (DiNAT-L, single-scale) | AP: 45.6 PQ: 67.6 mIoU: 83.1 |
| panoptic-segmentation-on-cityscapes-val | OneFormer (ConvNeXt-L, single-scale, 512x1024, Mapillary Vistas-pretrained) | AP: 48.7 PQ: 70.1 PQst: 74.1 PQth: 64.6 mIoU: 84.6 |
| panoptic-segmentation-on-cityscapes-val | OneFormer (ConvNeXt-L, single-scale) | AP: 46.5 PQ: 68.51 mIoU: 83.0 |
| panoptic-segmentation-on-coco-minival | OneFormer (InternImage-H,single-scale) | AP: 52.0 PQ: 60.0 PQst: 49.2 PQth: 67.1 mIoU: 68.8 |
| panoptic-segmentation-on-coco-minival | OneFormer (Swin-L, single-scale) | AP: 49.0 PQ: 57.9 PQst: 48.0 PQth: 64.4 mIoU: 67.4 |
| panoptic-segmentation-on-coco-minival | OneFormer (DiNAT-L, single-scale) | AP: 49.2 PQ: 58.0 PQst: 48.4 PQth: 64.3 mIoU: 68.1 |
| panoptic-segmentation-on-mapillary-val | OneFormer (DiNAT-L, single-scale) | PQ: 46.7 PQst: 54.9 PQth: 40.5 mIoU: 61.7 |
| panoptic-segmentation-on-mapillary-val | OneFormer (ConvNeXt-L, single-scale) | PQ: 46.4 PQst: 54.0 PQth: 40.6 mIoU: 61.6 |
| semantic-segmentation-on-ade20k-val | OneFormer (InternImage-H, emb_dim=256, multi-scale, 896x896) | mIoU: 60.8 |
| semantic-segmentation-on-ade20k-val | OneFormer (Swin-L, multi-scale, 640x640) | mIoU: 57.7 |
| semantic-segmentation-on-ade20k-val | OneFormer (DiNAT-L, multi-scale, 896x896) | mIoU: 58.6 |
| semantic-segmentation-on-ade20k-val | OneFormer (Swin-L, multi-scale, 896x896) | mIoU: 58.3 |
| semantic-segmentation-on-ade20k-val | OneFormer (DiNAT-L, multi-scale, 640x640) | mIoU: 58.4 |
| semantic-segmentation-on-cityscapes-val | OneFormer (ConvNeXt-XL, multi-scale) | mIoU: 84.6 |
| semantic-segmentation-on-cityscapes-val | OneFormer (Swin-L, multi-scale) | mIoU: 84.4 |
| semantic-segmentation-on-cityscapes-val | OneFormer (ConvNeXt-XL, Mapillary, multi-scale) | mIoU: 85.8 |
| semantic-segmentation-on-coco-1 | OneFormer (InternImage-H, emb_dim=1024, single-scale) | mIoU: 68.8 |
| semantic-segmentation-on-coco-1 | OneFormer (Swin-L, single-scale) | mIoU: 67.4 |
| semantic-segmentation-on-coco-1 | OneFormer (DiNAT-L, single-scale) | mIoU: 68.1 |
| semantic-segmentation-on-mapillary-val | OneFormer (DiNAT-L, multi-scale) | mIoU: 64.9 |