
摘要
Transformer已成为视觉识别领域中一种强大的工具。除了在多种视觉基准测试中展现出具有竞争力的性能外,近期研究还指出,与卷积神经网络(CNNs)相比,Transformer具有更强的鲁棒性。然而,令人惊讶的是,我们发现这些结论源于不公平的实验设置:在不同规模下比较Transformer与CNN,并采用了不同的训练框架。本文旨在首次提供Transformer与CNN之间公平且深入的对比,重点关注鲁棒性评估。在统一的训练设置下,我们首先挑战了“Transformer在对抗鲁棒性方面优于CNN”这一既有观点。更令人意外的是,当CNN采用Transformer的训练策略(training recipes)时,其对抗攻击防御能力可轻易达到与Transformer相当的水平。在处理分布外(out-of-distribution)样本的泛化能力方面,我们发现,对(外部)大规模数据集进行预训练并非使Transformer优于CNN的必要条件。进一步的消融实验表明,Transformer所展现出的更强泛化能力,主要源于其自注意力(self-attention)类架构本身的特性,而非其他训练设置带来的影响。我们希望本研究能够帮助学术界更准确地理解并评估Transformer与CNN在鲁棒性方面的实际表现。相关代码与模型已公开发布于:https://github.com/ytongbai/ViTs-vs-CNNs。
代码仓库
ytongbai/ViTs-vs-CNNs
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| adversarial-robustness-on-imagenet | ResNet-50 (SGD, Cosine) | Accuracy: 77.4 |
| adversarial-robustness-on-imagenet | ResNet-50 (AdamW, Cosine) | Accuracy: 76.4 |
| adversarial-robustness-on-imagenet | ResNet-50 (SGD, Step) | Accuracy: 76.9 |
| adversarial-robustness-on-imagenet | DeiT-S (AdamW, Cosine) | Accuracy: 76.8 |
| adversarial-robustness-on-imagenet-a | ResNet-50 (AdamW, Cosine) | Accuracy: 3.1 |
| adversarial-robustness-on-imagenet-a | ResNet-50 (SGD, Cosine) | Accuracy: 3.3 |
| adversarial-robustness-on-imagenet-a | ResNet-50 (SGD, Step) | Accuracy: 3.2 |
| adversarial-robustness-on-imagenet-a | DeiT-S (AdamW, Cosine) | Accuracy: 12.2 |
| adversarial-robustness-on-imagenet-c | DeiT-S (AdamW, Cosine) | mean Corruption Error (mCE): 48.0 |
| adversarial-robustness-on-imagenet-c | ResNet-50 (SGD, Step) | mean Corruption Error (mCE): 57.9 |
| adversarial-robustness-on-imagenet-c | ResNet-50 (SGD, Cosine) | mean Corruption Error (mCE): 56.9 |
| adversarial-robustness-on-imagenet-c | ResNet-50 (AdamW, Cosine) | mean Corruption Error (mCE): 59.3 |
| adversarial-robustness-on-stylized-imagenet | ResNet-50 (AdamW, Cosine) | Accuracy: 8.1 |
| adversarial-robustness-on-stylized-imagenet | ResNet-50 (SGD, Cosine) | Accuracy: 8.4 |
| adversarial-robustness-on-stylized-imagenet | ResNet-50 (SGD, Step) | Accuracy: 8.3 |
| adversarial-robustness-on-stylized-imagenet | DeiT-S (AdamW, Cosine) | Accuracy: 13.0 |