Maxime OquabTimothée DarcetThéo MoutakanniHuy VoMarc SzafraniecVasil KhalidovPierre FernandezDaniel HazizaFrancisco MassaAlaaeldin El-NoubyMahmoud AssranNicolas BallasWojciech GalubaRussell HowesPo-Yao HuangShang-Wen LiIshan MisraMichael RabbatVasu SharmaGabriel SynnaeveHu XuHervé JegouJulien MairalPatrick LabatutArmand JoulinPiotr Bojanowski

摘要
近年来,自然语言处理领域在大规模数据上进行模型预训练方面的突破,为计算机视觉领域类似的基础模型发展开辟了道路。这类模型能够通过生成通用视觉特征(即无需微调即可在不同图像分布和任务间通用的特征),极大简化系统中图像的使用。本研究证明,若在来自多样化来源的充分筛选数据上进行训练,现有的预训练方法,尤其是自监督学习方法,完全可以生成此类通用视觉特征。我们重新审视了现有方法,并整合多种技术,实现了在数据量和模型规模上的可扩展预训练。大部分技术贡献旨在加速并稳定大规模训练过程。在数据方面,我们提出了一种自动化流程,构建了一个专用于训练、具有多样性且经过精心筛选的图像数据集,而非像以往自监督学习研究中通常采用的未经筛选的数据。在模型方面,我们训练了一个参数量达10亿(1B)的视觉Transformer模型(ViT,Dosovitskiy et al., 2020),并将其知识蒸馏至一系列更小的模型,这些模型在图像级和像素级的多数基准测试中,性能超越了当前最佳的通用视觉特征模型——OpenCLIP(Ilharco et al., 2021)。
代码仓库
roboflow/rf-detr
pytorch
GitHub 中提及
PaddlePaddle/PASSL
paddle
beneroth13/dinov2
pytorch
GitHub 中提及
pwc-1/Paper-8/tree/main/dinov2
mindspore
mohammedsb/dinov2formedical
pytorch
GitHub 中提及
bespontaneous/proteus-pytorch
pytorch
GitHub 中提及
buyeah1109/finc
pytorch
GitHub 中提及
gorkaydemir/DINOSAUR
pytorch
GitHub 中提及
marrlab/dinobloom
pytorch
GitHub 中提及
fabio-sim/Depth-Anything-ONNX
pytorch
GitHub 中提及
buyeah1109/KEN
pytorch
GitHub 中提及
zhu-xlab/softcon
pytorch
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
facebookresearch/dinov2
官方
pytorch
GitHub 中提及
ByungKwanLee/Causal-Unsupervised-Segmentation
pytorch
GitHub 中提及
JHKim-snu/PGA
pytorch
GitHub 中提及
BurguerJohn/torch-felix
pytorch
open-edge-platform/geti
pytorch
GitHub 中提及
seatizendoi/dinovdeau
pytorch
GitHub 中提及
facebookresearch/highrescanopyheight
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| depth-estimation-on-nyu-depth-v2 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) | RMS: 0.279 |
| domain-generalization-on-imagenet-c | DINOv2 (ViT-S/14, frozen model, linear eval) | Number of params: 21M mean Corruption Error (mCE): 54.4 |
| domain-generalization-on-imagenet-c | DINOv2 (ViT-g/14, frozen model, linear eval) | Number of params: 1100M mean Corruption Error (mCE): 28.2 |
| domain-generalization-on-imagenet-c | DINOv2 (ViT-B/14, frozen model, linear eval) | Number of params: 85M mean Corruption Error (mCE): 42.7 |
| domain-generalization-on-imagenet-c | DINOv2 (ViT-L/14, frozen model, linear eval) | Number of params: 307M mean Corruption Error (mCE): 31.5 |
| fine-grained-image-classification-on-oxford-1 | DINOv2 (ViT-g/14, frozen model, linear eval) | Accuracy: 96.7 |
| image-classification-on-cifar-10 | DINOv2 (ViT-g/14, frozen model, linear eval) | Percentage correct: 99.5 |
| image-retrieval-on-amstertime | DINOv2 distilled (ViT-S/14 frozen) | mAP: 43.5 |
| image-retrieval-on-amstertime | DINOv2 (ViT-g/14 frozen) | mAP: 46.7 |
| image-retrieval-on-amstertime | DINOv2 distilled (ViT-B/14 frozen) | mAP: 45.6 |
| image-retrieval-on-amstertime | DINOv2 distilled (ViT-L/14 frozen) | mAP: 50.0 |
| monocular-depth-estimation-on-kitti-eigen | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) | Delta u003c 1.25: 0.968 Delta u003c 1.25^2: 0.997 Delta u003c 1.25^3: 0.9993 RMSE: 2.1128 RMSE log: 0.0882 Sq Rel: 0.1797 absolute relative error: 0.0652 |
| monocular-depth-estimation-on-nyu-depth-v2 | DINOv2 (ViT-g/14 frozen, w/ DPT decoder) | Delta u003c 1.25: 0.9497 Delta u003c 1.25^2: 0.996 Delta u003c 1.25^3: 0.9994 RMSE: 0.279 absolute relative error: 0.0907 log 10: 0.0371 |
| self-supervised-image-classification-on | DINOv2 distilled (ViT-S/14) | Number of Params: 21M Top 1 Accuracy: 81.1% |
| self-supervised-image-classification-on | DINOv2 distilled (ViT-B/14) | Number of Params: 85M Top 1 Accuracy: 84.5% |
| self-supervised-image-classification-on | DINOv2 (ViT-g/14 @448) | Number of Params: 1100M Top 1 Accuracy: 86.7% |
| self-supervised-image-classification-on | DINOv2 distilled (ViT-L/14) | Number of Params: 307M Top 1 Accuracy: 86.3% |
| self-supervised-image-classification-on | DINOv2 (ViT-g/14) | Number of Params: 1100M Top 1 Accuracy: 86.5% |
| self-supervised-image-classification-on-1 | DINOv2 (ViT-g/14, 448) | Number of Params: 1100M Top 1 Accuracy: 88.9% |
| self-supervised-image-classification-on-1 | DINOv2 (ViT-g/14) | Number of Params: 1100M Top 1 Accuracy: 88.5% |
| semantic-segmentation-on-ade20k | DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former) | Params (M): 1080 Validation mIoU: 60.2 |
| visual-place-recognition-on-17-places | DINOv2 | Recall@1: 61.82 |
| visual-place-recognition-on-baidu-mall | DINOv2 | Recall@1: 49.21 |
| visual-place-recognition-on-gardens-point | DINOv2 | Recall@1: 71.50 |
| visual-place-recognition-on-hawkins | DINOv2 | Recall@1: 27.97 |
| visual-place-recognition-on-laurel-caverns | DINOv2 | Recall@1: 40.18 |
| visual-place-recognition-on-mid-atlantic | DINOv2 | Recall@1: 24.75 |
| visual-place-recognition-on-nardo-air | DINOv2 | Recall@1: 73.24 |
| visual-place-recognition-on-nardo-air-r | DINOv2 | Recall@1: 71.83 |
| visual-place-recognition-on-oxford-robotcar-4 | DINOv2 | Recall@1: 39.79 |
| visual-place-recognition-on-pittsburgh-30k | DINOv2 | Recall@1: 78.32 |
| visual-place-recognition-on-st-lucia | DINOv2 | Recall@1: 78.62 |
| visual-place-recognition-on-vp-air | DINOv2 | Recall@1: 45.23 |