
摘要
由于残差连接中的深度退化效应,许多依赖堆叠层进行信息交互的高效视觉Transformer模型往往难以实现充分的信息混合,从而导致视觉感知不自然。为解决这一问题,本文提出一种基于仿生设计的令牌混合机制——聚合注意力(Aggregated Attention),该机制模拟生物视网膜中央凹视觉与连续眼动过程,使特征图上的每个令牌(token)均具备全局感知能力。此外,我们引入可学习令牌,使其与传统的查询(query)和键(key)进行交互,从而在生成亲和矩阵时突破仅依赖查询与键之间相似性的局限,进一步丰富了注意力机制的表达能力。本方法不依赖堆叠结构进行信息交换,有效避免了深度退化问题,实现了更自然的视觉感知。同时,本文提出卷积门控线性单元(Convolutional GLU),一种新型通道混合机制,旨在弥合传统GLU与SE(Squeeze-and-Excitation)模块之间的差距。该机制使每个令牌能够基于其最近邻图像特征实现通道注意力,显著增强了模型的局部建模能力与鲁棒性。我们将聚合注意力与卷积GLU相结合,构建了一种全新的视觉主干网络——TransNeXt。大量实验证明,TransNeXt在多种模型规模下均达到当前最优性能。在 $224^2$ 分辨率下,TransNeXt-Tiny模型在ImageNet上取得了84.0%的准确率,相较于参数量多出69%的ConvNeXt-B,性能更优;在 $384^2$ 分辨率下,TransNeXt-Base模型在ImageNet上达到86.2%的准确率,ImageNet-A上达到61.6%的准确率,COCO目标检测任务上mAP达57.1,ADE20K语义分割任务上mIoU达到54.7,全面超越现有先进模型。
代码仓库
daishiresearch/transnext
官方
pytorch
GitHub 中提及
Westlake-AI/openmixup
pytorch
GitHub 中提及
chenller/mmseg-extension
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| domain-generalization-on-imagenet-a | TransNeXt-Base (IN-1K supervised, 224) | Number of params: 89.7M Top-1 accuracy %: 50.6 |
| domain-generalization-on-imagenet-a | TransNeXt-Small (IN-1K supervised, 224) | Number of params: 49.7M Top-1 accuracy %: 47.1 |
| domain-generalization-on-imagenet-a | TransNeXt-Small (IN-1K supervised, 384) | Number of params: 49.7M Top-1 accuracy %: 58.3 |
| domain-generalization-on-imagenet-a | TransNeXt-Base (IN-1K supervised, 384) | Number of params: 89.7M Top-1 accuracy %: 61.6 |
| image-classification-on-imagenet | TransNeXt-Micro (IN-1K supervised, 224) | GFLOPs: 2.7 Number of params: 12.8M Top 1 Accuracy: 82.5% |
| image-classification-on-imagenet | TransNeXt-Small (IN-1K supervised, 384) | GFLOPs: 32.1 Number of params: 49.7M Top 1 Accuracy: 86.0% |
| image-classification-on-imagenet | TransNeXt-Base (IN-1K supervised, 384) | GFLOPs: 56.3 Number of params: 89.7M Top 1 Accuracy: 86.2% |
| image-classification-on-imagenet | TransNeXt-Small (IN-1K supervised, 224) | GFLOPs: 10.3 Number of params: 49.7M Top 1 Accuracy: 84.7% |
| image-classification-on-imagenet | TransNeXt-Tiny (IN-1K supervised, 224) | GFLOPs: 5.7 Number of params: 28.2M Top 1 Accuracy: 84.0% |
| object-detection-on-coco-minival | TransNeXt-Tiny (IN-1K pretrain, DINO 1x) | box AP: 55.7 |
| object-detection-on-coco-minival | TransNeXt-Base (IN-1K pretrain, DINO 1x) | box AP: 57.1 |
| object-detection-on-coco-minival | TransNeXt-Small (IN-1K pretrain, DINO 1x) | box AP: 56.6 |
| semantic-segmentation-on-ade20k | TransNeXt-Base (IN-1K pretrain, Mask2Former, 512) | Params (M): 109 Validation mIoU: 54.7 |
| semantic-segmentation-on-ade20k | TransNeXt-Tiny (IN-1K pretrain, Mask2Former, 512) | Params (M): 47.5 Validation mIoU: 53.4 |
| semantic-segmentation-on-ade20k | TransNeXt-Small (IN-1K pretrain, Mask2Former, 512) | Params (M): 69 Validation mIoU: 54.1 |