3 个月前

DINOv2:无监督学习鲁棒视觉特征

DINOv2:无监督学习鲁棒视觉特征

摘要

近年来,自然语言处理领域在大规模数据上进行模型预训练方面的突破,为计算机视觉领域类似的基础模型发展开辟了道路。这类模型能够通过生成通用视觉特征(即无需微调即可在不同图像分布和任务间通用的特征),极大简化系统中图像的使用。本研究证明,若在来自多样化来源的充分筛选数据上进行训练,现有的预训练方法,尤其是自监督学习方法,完全可以生成此类通用视觉特征。我们重新审视了现有方法,并整合多种技术,实现了在数据量和模型规模上的可扩展预训练。大部分技术贡献旨在加速并稳定大规模训练过程。在数据方面,我们提出了一种自动化流程,构建了一个专用于训练、具有多样性且经过精心筛选的图像数据集,而非像以往自监督学习研究中通常采用的未经筛选的数据。在模型方面,我们训练了一个参数量达10亿(1B)的视觉Transformer模型(ViT,Dosovitskiy et al., 2020),并将其知识蒸馏至一系列更小的模型,这些模型在图像级和像素级的多数基准测试中,性能超越了当前最佳的通用视觉特征模型——OpenCLIP(Ilharco et al., 2021)。

基准测试

基准方法指标
depth-estimation-on-nyu-depth-v2DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
RMS: 0.279
domain-generalization-on-imagenet-cDINOv2 (ViT-S/14, frozen model, linear eval)
Number of params: 21M
mean Corruption Error (mCE): 54.4
domain-generalization-on-imagenet-cDINOv2 (ViT-g/14, frozen model, linear eval)
Number of params: 1100M
mean Corruption Error (mCE): 28.2
domain-generalization-on-imagenet-cDINOv2 (ViT-B/14, frozen model, linear eval)
Number of params: 85M
mean Corruption Error (mCE): 42.7
domain-generalization-on-imagenet-cDINOv2 (ViT-L/14, frozen model, linear eval)
Number of params: 307M
mean Corruption Error (mCE): 31.5
fine-grained-image-classification-on-oxford-1DINOv2 (ViT-g/14, frozen model, linear eval)
Accuracy: 96.7
image-classification-on-cifar-10DINOv2 (ViT-g/14, frozen model, linear eval)
Percentage correct: 99.5
image-retrieval-on-amstertimeDINOv2 distilled (ViT-S/14 frozen)
mAP: 43.5
image-retrieval-on-amstertimeDINOv2 (ViT-g/14 frozen)
mAP: 46.7
image-retrieval-on-amstertimeDINOv2 distilled (ViT-B/14 frozen)
mAP: 45.6
image-retrieval-on-amstertimeDINOv2 distilled (ViT-L/14 frozen)
mAP: 50.0
monocular-depth-estimation-on-kitti-eigenDINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Delta u003c 1.25: 0.968
Delta u003c 1.25^2: 0.997
Delta u003c 1.25^3: 0.9993
RMSE: 2.1128
RMSE log: 0.0882
Sq Rel: 0.1797
absolute relative error: 0.0652
monocular-depth-estimation-on-nyu-depth-v2DINOv2 (ViT-g/14 frozen, w/ DPT decoder)
Delta u003c 1.25: 0.9497
Delta u003c 1.25^2: 0.996
Delta u003c 1.25^3: 0.9994
RMSE: 0.279
absolute relative error: 0.0907
log 10: 0.0371
self-supervised-image-classification-onDINOv2 distilled (ViT-S/14)
Number of Params: 21M
Top 1 Accuracy: 81.1%
self-supervised-image-classification-onDINOv2 distilled (ViT-B/14)
Number of Params: 85M
Top 1 Accuracy: 84.5%
self-supervised-image-classification-onDINOv2 (ViT-g/14 @448)
Number of Params: 1100M
Top 1 Accuracy: 86.7%
self-supervised-image-classification-onDINOv2 distilled (ViT-L/14)
Number of Params: 307M
Top 1 Accuracy: 86.3%
self-supervised-image-classification-onDINOv2 (ViT-g/14)
Number of Params: 1100M
Top 1 Accuracy: 86.5%
self-supervised-image-classification-on-1DINOv2 (ViT-g/14, 448)
Number of Params: 1100M
Top 1 Accuracy: 88.9%
self-supervised-image-classification-on-1DINOv2 (ViT-g/14)
Number of Params: 1100M
Top 1 Accuracy: 88.5%
semantic-segmentation-on-ade20kDINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)
Params (M): 1080
Validation mIoU: 60.2
visual-place-recognition-on-17-placesDINOv2
Recall@1: 61.82
visual-place-recognition-on-baidu-mallDINOv2
Recall@1: 49.21
visual-place-recognition-on-gardens-pointDINOv2
Recall@1: 71.50
visual-place-recognition-on-hawkinsDINOv2
Recall@1: 27.97
visual-place-recognition-on-laurel-cavernsDINOv2
Recall@1: 40.18
visual-place-recognition-on-mid-atlanticDINOv2
Recall@1: 24.75
visual-place-recognition-on-nardo-airDINOv2
Recall@1: 73.24
visual-place-recognition-on-nardo-air-rDINOv2
Recall@1: 71.83
visual-place-recognition-on-oxford-robotcar-4DINOv2
Recall@1: 39.79
visual-place-recognition-on-pittsburgh-30kDINOv2
Recall@1: 78.32
visual-place-recognition-on-st-luciaDINOv2
Recall@1: 78.62
visual-place-recognition-on-vp-airDINOv2
Recall@1: 45.23

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
DINOv2:无监督学习鲁棒视觉特征 | 论文 | HyperAI超神经