
摘要
当前在微调预训练模型时普遍采用的方法是更新整个骨干网络的所有参数,即全量微调(full fine-tuning)。本文提出了一种名为视觉提示调优(Visual Prompt Tuning, VPT)的新方法,作为大规模Transformer视觉模型中全量微调的高效且有效的替代方案。受近期大语言模型高效微调技术的启发,VPT仅在输入空间中引入少量可训练参数(少于模型总参数的1%),同时保持模型骨干网络冻结不变。通过在多种下游识别任务上的大量实验验证,我们表明VPT相较于其他参数高效微调方法显著提升了性能。尤为重要的是,VPT在不同模型规模和训练数据量级下,多数情况下甚至超越了全量微调的性能表现,同时大幅降低了每个任务所需的存储开销。
代码仓库
KMnP/vpt
官方
pytorch
GitHub 中提及
wgcban/apt
pytorch
GitHub 中提及
heekhero/DTL
pytorch
GitHub 中提及
TooTouch/VPT
pytorch
GitHub 中提及
unites-lab/vpns
pytorch
Yiming-M/CLIP-EBC
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| long-tail-learning-on-cifar-100-lt-r-10 | VPT | Error Rate: 10.4 |
| long-tail-learning-on-cifar-100-lt-r-100 | VPT | Error Rate: 19 |
| long-tail-learning-on-cifar-100-lt-r-50 | VPT | Error Rate: 15.2 |
| prompt-engineering-on-imagenet-21k | VPT | Accuracy: 24.8 |
| visual-prompt-tuning-on-fgvc | VPT-Deep (ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 72.02 |
| visual-prompt-tuning-on-fgvc | VPT-Shallow (ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 57.84 |
| visual-prompt-tuning-on-fgvc | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 83.12 |
| visual-prompt-tuning-on-fgvc | VPT-Shallow (ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 79.26 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 67.34 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 39.96 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 70.27 |
| visual-prompt-tuning-on-vtab-1k-natural-7 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 36.02 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 60.61 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 69.65 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 83.04 |
| visual-prompt-tuning-on-vtab-1k-specialized-4 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 82.26 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Shallow(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 37.55 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Deep(ViT-B/16_MoCo_v3_pretrained_ImageNet-1K) | Mean Accuracy: 42.38 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Deep(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 26.57 |
| visual-prompt-tuning-on-vtab-1k-structured-8 | VPT-Shallow(ViT-B/16_MAE_pretrained_ImageNet-1K) | Mean Accuracy: 27.50 |