
摘要
Transformer 正迅速成为跨模态、跨领域和多任务场景下应用最广泛的深度学习架构之一。在计算机视觉领域,除了对普通 Transformer 的持续探索外,层次化 Transformer 也因其优异的性能以及与现有框架的良好兼容性而受到广泛关注。这类模型通常采用局部注意力机制,例如滑动窗口式的邻域注意力(Neighborhood Attention, NA),或 Swin Transformer 中的移位窗口自注意力(Shifted Window Self Attention)。尽管这些局部注意力机制有效降低了自注意力的二次方复杂度,但其削弱了自注意力机制最核心的两个优势:长距离依赖建模能力与全局感受野。本文提出了一种自然、灵活且高效的 NA 扩展方法——扩张邻域注意力(Dilated Neighborhood Attention, DiNA),该方法能够在不增加计算成本的前提下,显著捕捉更广泛的全局上下文信息,并实现感受野的指数级扩展。NA 的局部注意力与 DiNA 的稀疏全局注意力相互补充,因此我们进一步提出了扩张邻域注意力 Transformer(Dilated Neighborhood Attention Transformer, DiNAT),一种基于两者构建的新一代层次化视觉 Transformer。DiNAT 各类变体在多个强基线模型(如 NAT、Swin 和 ConvNeXt)上均取得了显著性能提升。其中,我们的大型模型在 COCO 目标检测任务中比 Swin 模型快 1.6% 的框平均精度(box AP),在 COCO 实例分割任务中提升 1.4% 的掩码平均精度(mask AP),在 ADE20K 语义分割任务中提升 1.4% 的平均交并比(mIoU)。结合新型框架,我们的大型版本在 COCO 和 ADE20K 上分别成为新的全景分割(panoptic segmentation)最先进模型(COCO: 58.5 PQ;ADE20K: 49.4 PQ),并在 Cityscapes 和 ADE20K 上成为实例分割的最先进模型(Cityscapes: 45.1 AP;ADE20K: 35.4 AP),且未使用额外数据。此外,该模型在 ADE20K 语义分割任务上达到与当前最先进专用模型相当的性能(58.1 mIoU),并在 Cityscapes 上位列第二(84.5 mIoU),同样未依赖额外训练数据。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-imagenet | DiNAT-Base | GFLOPs: 13.7 Number of params: 90M Top 1 Accuracy: 84.4% |
| image-classification-on-imagenet | DiNAT_s-Large (224x224; Pretrained on ImageNet-22K @ 224x224) | GFLOPs: 34.5 Top 1 Accuracy: 86.5% |
| image-classification-on-imagenet | DiNAT-Mini | GFLOPs: 2.7 Number of params: 20M Top 1 Accuracy: 81.8% |
| image-classification-on-imagenet | DiNAT-Small | GFLOPs: 7.8 Number of params: 51M Top 1 Accuracy: 83.8% |
| image-classification-on-imagenet | DiNAT-Large (384x384; Pretrained on ImageNet-22K @ 224x224) | GFLOPs: 89.7 Top 1 Accuracy: 87.4% |
| image-classification-on-imagenet | DiNAT-Large (11x11ks; 384res; Pretrained on IN22K@224) | GFLOPs: 92.4 Number of params: 200M Top 1 Accuracy: 87.5% |
| image-classification-on-imagenet | DiNAT_s-Large (384res; Pretrained on IN22K@224) | GFLOPs: 101.5 Number of params: 197M Top 1 Accuracy: 87.4% |
| image-classification-on-imagenet | DiNAT-Tiny | GFLOPs: 4.3 Number of params: 28M Top 1 Accuracy: 82.7% |
| instance-segmentation-on-ade20k-val | DiNAT-L (Mask2Former, single-scale) | AP: 35.4 APL: 55.5 APM: 39.0 APS: 16.3 |
| instance-segmentation-on-cityscapes-val | DiNAT-L (single-scale, Mask2Former) | AP50: 72.6 mask AP: 45.1 |
| instance-segmentation-on-coco-minival | DiNAT-L (single-scale, Mask2Former) | AP50: 75.0 mask AP: 50.8 |
| panoptic-segmentation-on-ade20k-val | DiNAT-L (Mask2Former, 640x640) | AP: 35.0 PQ: 49.4 mIoU: 56.3 |
| panoptic-segmentation-on-cityscapes-val | DiNAT-L (Mask2Former) | AP: 44.5 PQ: 67.2 mIoU: 83.4 |
| panoptic-segmentation-on-coco-minival | DiNAT-L (single-scale, Mask2Former) | AP: 49.2 PQ: 58.5 PQst: 48.8 PQth: 64.9 mIoU: 68.3 |
| semantic-segmentation-on-ade20k | DiNAT-Base (UperNet) | Validation mIoU: 50.4 |
| semantic-segmentation-on-ade20k | DiNAT-Tiny (UperNet) | Validation mIoU: 48.8 |
| semantic-segmentation-on-ade20k | DiNAT_s-Large (UperNet) | Validation mIoU: 54.6 |
| semantic-segmentation-on-ade20k | DiNAT-Small (UperNet) | Validation mIoU: 49.9 |
| semantic-segmentation-on-ade20k | DiNAT-Large (UperNet) | Validation mIoU: 54.9 |
| semantic-segmentation-on-ade20k | DiNAT-L (Mask2Former) | Validation mIoU: 58.1 |
| semantic-segmentation-on-ade20k | DiNAT-Mini (UperNet) | Validation mIoU: 47.2 |
| semantic-segmentation-on-ade20k-val | DiNAT-L (Mask2Former) | mIoU: 58.1 |
| semantic-segmentation-on-cityscapes-val | DiNAT-L (Mask2Former) | mIoU: 84.5 |