
摘要
受自然语言处理(NLP)领域中Transformer架构成功应用的启发,研究者们开始尝试将其引入视觉任务,涌现出一系列代表性工作(如ViT和DeiT)。然而,纯Transformer架构在视觉任务中通常需要大量训练数据或额外的监督信号,才能达到与卷积神经网络(CNN)相当的性能。为克服这一局限,本文深入分析了直接将NLP中的Transformer架构迁移至视觉领域所存在的潜在缺陷,并提出一种新型的卷积增强型图像Transformer(Convolution-enhanced image Transformer, CeiT)。该模型融合了CNN在提取低层特征、强化局部性方面的优势,以及Transformer在建模长距离依赖关系方面的优势。为实现这一目标,本文对原始Transformer结构进行了三项关键改进:1) 不再采用对原始输入图像进行直接分块(tokenization)的方式,而是设计了一个图像到令牌(Image-to-Tokens, I2T) 模块,该模块从生成的低层特征图中提取图像块,从而增强初始表示的语义丰富性与空间相关性;2) 将每个编码器块中的前馈网络(Feed-Forward Network)替换为一种局部增强前馈(Locally-enhanced Feed-Forward, LeFF) 层,该层通过加强空间维度上相邻令牌之间的关联性,提升局部特征建模能力;3) 在Transformer顶层引入一种逐层类别令牌注意力(Layer-wise Class token Attention, LCA) 机制,利用多层级的特征表示,进一步增强分类决策的判别能力。在ImageNet分类任务以及七个下游视觉任务上的实验结果表明,CeiT在性能与泛化能力方面均显著优于先前的Transformer模型及当前最先进的CNN架构,且无需依赖大规模训练数据或额外的CNN教师模型进行知识蒸馏。此外,CeiT模型在训练过程中展现出更优的收敛性,仅需原始训练迭代次数的三分之一即可达到相近甚至更优的性能,从而大幅降低了训练成本。\footnote{代码与模型将在论文被接收后公开发布。}
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-cifar-10 | CeiT-S (384 finetune resolution) | Percentage correct: 99.1 |
| image-classification-on-cifar-10 | CeiT-S | Percentage correct: 99 |
| image-classification-on-cifar-10 | CeiT-T | Percentage correct: 98.5 |
| image-classification-on-cifar-100 | CeiT-T | Percentage correct: 89.4 |
| image-classification-on-cifar-100 | CeiT-S (384 finetune resolution) | Percentage correct: 91.8 |
| image-classification-on-cifar-100 | CeiT-T (384 finetune resolution) | Percentage correct: 88 |
| image-classification-on-cifar-100 | CeiT-S | Percentage correct: 91.8 |
| image-classification-on-flowers-102 | CeiT-S (384 finetune resolution) | Accuracy: 98.6 |
| image-classification-on-flowers-102 | CeiT-T | Accuracy: 96.9 |
| image-classification-on-flowers-102 | CeiT-T (384 finetune resolution) | Accuracy: 97.8 |
| image-classification-on-flowers-102 | CeiT-S | Accuracy: 98.2 |
| image-classification-on-imagenet | CeiT-T | GFLOPs: 1.2 Number of params: 6.4M Top 1 Accuracy: 76.4% |
| image-classification-on-imagenet | CeiT-S | GFLOPs: 4.5 Top 1 Accuracy: 82% |
| image-classification-on-imagenet | CeiT-S (384 finetune res) | GFLOPs: 12.9 Number of params: 24.2M Top 1 Accuracy: 83.3% |
| image-classification-on-imagenet | CeiT-T (384 finetune res) | GFLOPs: 3.6 Top 1 Accuracy: 78.8% |
| image-classification-on-imagenet-real | CeiT-T | Accuracy: 83.6% |
| image-classification-on-imagenet-real | CeiT-S (384 finetune res) | Accuracy: 88.1% |
| image-classification-on-imagenet-real | CeiT-S | Accuracy: 87.3% |
| image-classification-on-inaturalist-2018 | CeiT-T (384 finetune resolution) | Top-1 Accuracy: 72.2% |
| image-classification-on-inaturalist-2018 | CeiT-S (384 finetune resolution) | Top-1 Accuracy: 79.4% |
| image-classification-on-inaturalist-2018 | CeiT-T | Top-1 Accuracy: 64.3% |
| image-classification-on-inaturalist-2018 | CeiT-S | Top-1 Accuracy: 73.3% |
| image-classification-on-inaturalist-2019 | CeiT-S | Top-1 Accuracy: 78.9 |
| image-classification-on-inaturalist-2019 | CeiT-S (384 finetune resolution) | Top-1 Accuracy: 82.7 |
| image-classification-on-inaturalist-2019 | CeiT-T | Top-1 Accuracy: 72.8 |
| image-classification-on-inaturalist-2019 | CeiT-T (384 finetune resolution) | Top-1 Accuracy: 77.9 |
| image-classification-on-oxford-iiit-pets-1 | CeiT-T (384 finetune resolution) | Accuracy: 94.5 |
| image-classification-on-oxford-iiit-pets-1 | CeiT-T | Accuracy: 93.8 |
| image-classification-on-oxford-iiit-pets-1 | CeiT-S | Accuracy: 94.6 |
| image-classification-on-oxford-iiit-pets-1 | CeiT-S (384 finetune resolution) | Accuracy: 94.9 |
| image-classification-on-stanford-cars | CeiT-S | Accuracy: 93.2 |
| image-classification-on-stanford-cars | CeiT-S (384 finetune resolution) | Accuracy: 94.1 |
| image-classification-on-stanford-cars | CeiT-T | Accuracy: 90.5 |
| image-classification-on-stanford-cars | CeiT-T (384 finetune resolution) | Accuracy: 93 |