
摘要
Transformer 的扩展推动了语言模型能力的突破。目前,最大的大规模语言模型(LLMs)包含超过 1000 亿个参数。视觉 Transformer(ViT)将相同的架构引入图像和视频建模,但这些模型尚未成功扩展到类似的程度;最大的密集型 ViT 包含 40 亿个参数(Chen 等,2022)。我们提出了一种高效且稳定的 220 亿参数 ViT(ViT-22B)训练方法,并对生成的模型进行了多种实验。在下游任务中评估时(通常是在冻结特征上使用轻量级线性模型),ViT-22B 随着规模的增加表现出性能提升。我们还观察到了其他有趣的规模化优势,包括公平性和性能之间的改进权衡、在形状/纹理偏差方面达到人类视觉感知的最先进水平以及增强的鲁棒性。ViT-22B 展现了视觉领域实现“类似 LLM”的扩展潜力,并为实现这一目标提供了关键步骤。
代码仓库
lucidrains/flash-cosine-sim-attention
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | ViT-22B | Acc@1: 88.0 |
| image-classification-on-imagenet | ViT-B/16 | Number of params: 86M Top 1 Accuracy: 88.6% |
| image-classification-on-imagenet | ViT-L/16 (384res, distilled from ViT-22B) | Number of params: 307M Top 1 Accuracy: 89.6% |
| object-recognition-on-shape-bias | ViT-22B-384 | shape bias: 86.4 |
| object-recognition-on-shape-bias | ViT-22B-560 | shape bias: 83.8 |
| object-recognition-on-shape-bias | ViT-22B-224 | shape bias: 78.0 |
| zero-shot-transfer-image-classification-on-1 | LiT-22B | Accuracy (Private): 85.9 |
| zero-shot-transfer-image-classification-on-3 | LiT-22B | Accuracy (Private): 80.9 |
| zero-shot-transfer-image-classification-on-4 | LiT-22B | Accuracy: 96.0 |
| zero-shot-transfer-image-classification-on-5 | LiT-22B | Accuracy (Private): 90.1 |
| zero-shot-transfer-image-classification-on-6 | LiT-22B | Accuracy (Private): 87.6 |