
摘要
尽管最初专为自然语言处理任务设计,自注意力机制近年来已在计算机视觉多个领域引发广泛关注。然而,图像的二维特性为自注意力机制在计算机视觉中的应用带来了三大挑战:(1)将图像视为一维序列会忽略其固有的二维结构;(2)二次方复杂度对高分辨率图像而言计算开销过大;(3)仅能捕捉空间上的自适应性,而忽略了通道层面的自适应能力。为此,本文提出一种新型线性注意力机制——大核注意力(Large Kernel Attention, LKA),能够在保持自适应性和长距离依赖建模能力的同时,有效规避上述缺陷。进一步地,我们基于LKA构建了一种新型神经网络结构,称为视觉注意力网络(Visual Attention Network, VAN)。尽管结构极为简洁,VAN在多项视觉任务中均显著超越了同等规模的视觉Transformer(ViT)与卷积神经网络(CNN),涵盖图像分类、目标检测、语义分割、全景分割、姿态估计等任务。例如,VAN-B6在ImageNet基准上达到87.8%的准确率,并在全景分割任务上创下58.2 PQ的新纪录,刷新了当前最先进水平。此外,VAN-B2在ADE20K语义分割任务上以50.1%的mIoU超越Swin-T模型4个百分点(46.1%),在COCO目标检测任务上以48.8%的AP表现领先Swin-T模型2.6个百分点(46.2%)。该工作为社区提供了一种新颖的方法与一个简洁而强大的基准模型。代码已开源,地址为:https://github.com/Visual-Attention-Network。
代码仓库
Visual-Attention-Network/VAN-Segmentation
pytorch
GitHub 中提及
Westlake-AI/openmixup
pytorch
GitHub 中提及
PaddlePaddle/PaddleClas
paddle
lucasjinreal/yolov7_d2
pytorch
GitHub 中提及
Visual-Attention-Network/VAN-Classification
官方
pytorch
GitHub 中提及
pwc-1/Paper-9/tree/main/5/van
mindspore
open-mmlab/mmclassification
pytorch
EMalagoli92/VAN-Classification-TensorFlow
tf
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
flytocc/PaddleClas
paddle
DarshanDeshpande/jax-models
jax
GitHub 中提及
chengtan9907/simvpv2
pytorch
GitHub 中提及
Jittor-Image-Models/Jittor-Image-Models
pytorch
GitHub 中提及
sithu31296/semantic-segmentation
pytorch
GitHub 中提及
facebookresearch/xformers
pytorch
GitHub 中提及
MindCode-4/code-1/tree/main/van
mindspore
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-imagenet | VAN-B6 (22K) | GFLOPs: 38.9 Number of params: 200M Top 1 Accuracy: 86.9% |
| image-classification-on-imagenet | VAN-B4 (22K, 384res) | GFLOPs: 35.9 Number of params: 60M Top 1 Accuracy: 86.6% |
| image-classification-on-imagenet | VAN-B5 (22K, 384res) | GFLOPs: 50.6 Top 1 Accuracy: 87% |
| image-classification-on-imagenet | VAN-B2 | GFLOPs: 5 Number of params: 26.6M Top 1 Accuracy: 82.8% |
| image-classification-on-imagenet | VAN-B5 (22K) | GFLOPs: 17.2 Number of params: 90M Top 1 Accuracy: 86.3% |
| image-classification-on-imagenet | VAN-B1 | GFLOPs: 2.5 Number of params: 13.9M Top 1 Accuracy: 81.1% |
| image-classification-on-imagenet | VAN-B4 (22K) | GFLOPs: 12.2 Top 1 Accuracy: 85.7% |
| image-classification-on-imagenet | VAN-B6 (22K, 384res) | GFLOPs: 114.3 Number of params: 200M Top 1 Accuracy: 87.8% |
| image-classification-on-imagenet | VAN-B0 | GFLOPs: 0.9 Number of params: 4.1M Top 1 Accuracy: 75.4% |
| panoptic-segmentation-on-coco-minival | Visual Attention Network (VAN-B6 + Mask2Former) | PQ: 58.2 PQst: 48.2 PQth: 64.8 |
| panoptic-segmentation-on-coco-panoptic | VAN-B6* | PQ: 58.2 |
| semantic-segmentation-on-ade20k | VAN-Large | Params (M): 49 Validation mIoU: 48.1 |
| semantic-segmentation-on-ade20k | VAN-Tiny | Params (M): 8 Validation mIoU: 38.5 |
| semantic-segmentation-on-ade20k | VAN-Small | Params (M): 18 Validation mIoU: 42.9 |
| semantic-segmentation-on-ade20k | VAN-B6 | Validation mIoU: 54.7 |
| semantic-segmentation-on-ade20k | VAN-Base (Semantic-FPN) | Validation mIoU: 46.7 |
| semantic-segmentation-on-ade20k | VAN-Large (HamNet) | Params (M): 55 Validation mIoU: 50.2 |