3 个月前

将卷积设计融入视觉Transformer

将卷积设计融入视觉Transformer

摘要

受自然语言处理(NLP)领域中Transformer架构成功应用的启发,研究者们开始尝试将其引入视觉任务,涌现出一系列代表性工作(如ViT和DeiT)。然而,纯Transformer架构在视觉任务中通常需要大量训练数据或额外的监督信号,才能达到与卷积神经网络(CNN)相当的性能。为克服这一局限,本文深入分析了直接将NLP中的Transformer架构迁移至视觉领域所存在的潜在缺陷,并提出一种新型的卷积增强型图像Transformer(Convolution-enhanced image Transformer, CeiT)。该模型融合了CNN在提取低层特征、强化局部性方面的优势,以及Transformer在建模长距离依赖关系方面的优势。为实现这一目标,本文对原始Transformer结构进行了三项关键改进:1) 不再采用对原始输入图像进行直接分块(tokenization)的方式,而是设计了一个图像到令牌(Image-to-Tokens, I2T) 模块,该模块从生成的低层特征图中提取图像块,从而增强初始表示的语义丰富性与空间相关性;2) 将每个编码器块中的前馈网络(Feed-Forward Network)替换为一种局部增强前馈(Locally-enhanced Feed-Forward, LeFF) 层,该层通过加强空间维度上相邻令牌之间的关联性,提升局部特征建模能力;3) 在Transformer顶层引入一种逐层类别令牌注意力(Layer-wise Class token Attention, LCA) 机制,利用多层级的特征表示,进一步增强分类决策的判别能力。在ImageNet分类任务以及七个下游视觉任务上的实验结果表明,CeiT在性能与泛化能力方面均显著优于先前的Transformer模型及当前最先进的CNN架构,且无需依赖大规模训练数据或额外的CNN教师模型进行知识蒸馏。此外,CeiT模型在训练过程中展现出更优的收敛性,仅需原始训练迭代次数的三分之一即可达到相近甚至更优的性能,从而大幅降低了训练成本。\footnote{代码与模型将在论文被接收后公开发布。}

基准测试

基准方法指标
image-classification-on-cifar-10CeiT-S (384 finetune resolution)
Percentage correct: 99.1
image-classification-on-cifar-10CeiT-S
Percentage correct: 99
image-classification-on-cifar-10CeiT-T
Percentage correct: 98.5
image-classification-on-cifar-100CeiT-T
Percentage correct: 89.4
image-classification-on-cifar-100CeiT-S (384 finetune resolution)
Percentage correct: 91.8
image-classification-on-cifar-100CeiT-T (384 finetune resolution)
Percentage correct: 88
image-classification-on-cifar-100CeiT-S
Percentage correct: 91.8
image-classification-on-flowers-102CeiT-S (384 finetune resolution)
Accuracy: 98.6
image-classification-on-flowers-102CeiT-T
Accuracy: 96.9
image-classification-on-flowers-102CeiT-T (384 finetune resolution)
Accuracy: 97.8
image-classification-on-flowers-102CeiT-S
Accuracy: 98.2
image-classification-on-imagenetCeiT-T
GFLOPs: 1.2
Number of params: 6.4M
Top 1 Accuracy: 76.4%
image-classification-on-imagenetCeiT-S
GFLOPs: 4.5
Top 1 Accuracy: 82%
image-classification-on-imagenetCeiT-S (384 finetune res)
GFLOPs: 12.9
Number of params: 24.2M
Top 1 Accuracy: 83.3%
image-classification-on-imagenetCeiT-T (384 finetune res)
GFLOPs: 3.6
Top 1 Accuracy: 78.8%
image-classification-on-imagenet-realCeiT-T
Accuracy: 83.6%
image-classification-on-imagenet-realCeiT-S (384 finetune res)
Accuracy: 88.1%
image-classification-on-imagenet-realCeiT-S
Accuracy: 87.3%
image-classification-on-inaturalist-2018CeiT-T (384 finetune resolution)
Top-1 Accuracy: 72.2%
image-classification-on-inaturalist-2018CeiT-S (384 finetune resolution)
Top-1 Accuracy: 79.4%
image-classification-on-inaturalist-2018CeiT-T
Top-1 Accuracy: 64.3%
image-classification-on-inaturalist-2018CeiT-S
Top-1 Accuracy: 73.3%
image-classification-on-inaturalist-2019CeiT-S
Top-1 Accuracy: 78.9
image-classification-on-inaturalist-2019CeiT-S (384 finetune resolution)
Top-1 Accuracy: 82.7
image-classification-on-inaturalist-2019CeiT-T
Top-1 Accuracy: 72.8
image-classification-on-inaturalist-2019CeiT-T (384 finetune resolution)
Top-1 Accuracy: 77.9
image-classification-on-oxford-iiit-pets-1CeiT-T (384 finetune resolution)
Accuracy: 94.5
image-classification-on-oxford-iiit-pets-1CeiT-T
Accuracy: 93.8
image-classification-on-oxford-iiit-pets-1CeiT-S
Accuracy: 94.6
image-classification-on-oxford-iiit-pets-1CeiT-S (384 finetune resolution)
Accuracy: 94.9
image-classification-on-stanford-carsCeiT-S
Accuracy: 93.2
image-classification-on-stanford-carsCeiT-S (384 finetune resolution)
Accuracy: 94.1
image-classification-on-stanford-carsCeiT-T
Accuracy: 90.5
image-classification-on-stanford-carsCeiT-T (384 finetune resolution)
Accuracy: 93

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
将卷积设计融入视觉Transformer | 论文 | HyperAI超神经