3 months ago

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab Timothée Darcet Théo Moutakanni Huy Vo Marc Szafraniec Vasil Khalidov Pierre Fernandez Daniel Haziza Francisco Massa Alaaeldin El-Nouby

Abstract

The recent breakthroughs in natural language processing for model pretraining on large quantities of data have opened the way for similar foundation models in computer vision. These models could greatly simplify the use of images in any system by producing all-purpose visual features, i.e., features that work across image distributions and tasks without finetuning. This work shows that existing pretraining methods, especially self-supervised methods, can produce such features if trained on enough curated data from diverse sources. We revisit existing approaches and combine different techniques to scale our pretraining in terms of data and model size. Most of the technical contributions aim at accelerating and stabilizing the training at scale. In terms of data, we propose an automatic pipeline to build a dedicated, diverse, and curated image dataset instead of uncurated data, as typically done in the self-supervised literature. In terms of models, we train a ViT model (Dosovitskiy et al., 2020) with 1B parameters and distill it into a series of smaller models that surpass the best available all-purpose features, OpenCLIP (Ilharco et al., 2021) on most of the benchmarks at image and pixel levels.

Code Repositories

OML-Team/open-metric-learning

pytorch

roboflow/rf-detr

pytorch

Mentioned in GitHub

PaddlePaddle/PASSL

paddle

beneroth13/dinov2

pytorch

Mentioned in GitHub

pwc-1/Paper-8/tree/main/dinov2

mindspore

2024-MindSpore-1/Code2/tree/main/model-1/dinov2

mindspore

leondgarse/keras_cv_attention_models/tree/main/keras_cv_attention_models/beit

mohammedsb/dinov2formedical

pytorch

Mentioned in GitHub

bespontaneous/proteus-pytorch

pytorch

Mentioned in GitHub

buyeah1109/finc

pytorch

Mentioned in GitHub

gorkaydemir/DINOSAUR

pytorch

Mentioned in GitHub

marrlab/dinobloom

pytorch

Mentioned in GitHub

fabio-sim/Depth-Anything-ONNX

pytorch

Mentioned in GitHub

buyeah1109/KEN

pytorch

Mentioned in GitHub

zhu-xlab/softcon

pytorch

Mentioned in GitHub

open-edge-platform/training_extensions

pytorch

huggingface/transformers

pytorch

Mentioned in GitHub

https://gitlab.com/birder/birder

pytorch

facebookresearch/dinov2

Official

pytorch

Mentioned in GitHub

ByungKwanLee/Causal-Unsupervised-Segmentation

pytorch

Mentioned in GitHub

JHKim-snu/PGA

pytorch

Mentioned in GitHub

BurguerJohn/global_perceptual_similarity_loss

pytorch

BurguerJohn/torch-felix

pytorch

open-edge-platform/geti

pytorch

Mentioned in GitHub

seatizendoi/dinovdeau

pytorch

Mentioned in GitHub

facebookresearch/highrescanopyheight

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
depth-estimation-on-nyu-depth-v2	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)	RMS: 0.279
domain-generalization-on-imagenet-c	DINOv2 (ViT-S/14, frozen model, linear eval)	Number of params: 21M mean Corruption Error (mCE): 54.4
domain-generalization-on-imagenet-c	DINOv2 (ViT-g/14, frozen model, linear eval)	Number of params: 1100M mean Corruption Error (mCE): 28.2
domain-generalization-on-imagenet-c	DINOv2 (ViT-B/14, frozen model, linear eval)	Number of params: 85M mean Corruption Error (mCE): 42.7
domain-generalization-on-imagenet-c	DINOv2 (ViT-L/14, frozen model, linear eval)	Number of params: 307M mean Corruption Error (mCE): 31.5
fine-grained-image-classification-on-oxford-1	DINOv2 (ViT-g/14, frozen model, linear eval)	Accuracy: 96.7
image-classification-on-cifar-10	DINOv2 (ViT-g/14, frozen model, linear eval)	Percentage correct: 99.5
image-retrieval-on-amstertime	DINOv2 distilled (ViT-S/14 frozen)	mAP: 43.5
image-retrieval-on-amstertime	DINOv2 (ViT-g/14 frozen)	mAP: 46.7
image-retrieval-on-amstertime	DINOv2 distilled (ViT-B/14 frozen)	mAP: 45.6
image-retrieval-on-amstertime	DINOv2 distilled (ViT-L/14 frozen)	mAP: 50.0
monocular-depth-estimation-on-kitti-eigen	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)	Delta u003c 1.25: 0.968 Delta u003c 1.25^2: 0.997 Delta u003c 1.25^3: 0.9993 RMSE: 2.1128 RMSE log: 0.0882 Sq Rel: 0.1797 absolute relative error: 0.0652
monocular-depth-estimation-on-nyu-depth-v2	DINOv2 (ViT-g/14 frozen, w/ DPT decoder)	Delta u003c 1.25: 0.9497 Delta u003c 1.25^2: 0.996 Delta u003c 1.25^3: 0.9994 RMSE: 0.279 absolute relative error: 0.0907 log 10: 0.0371
self-supervised-image-classification-on	DINOv2 distilled (ViT-S/14)	Number of Params: 21M Top 1 Accuracy: 81.1%
self-supervised-image-classification-on	DINOv2 distilled (ViT-B/14)	Number of Params: 85M Top 1 Accuracy: 84.5%
self-supervised-image-classification-on	DINOv2 (ViT-g/14 @448)	Number of Params: 1100M Top 1 Accuracy: 86.7%
self-supervised-image-classification-on	DINOv2 distilled (ViT-L/14)	Number of Params: 307M Top 1 Accuracy: 86.3%
self-supervised-image-classification-on	DINOv2 (ViT-g/14)	Number of Params: 1100M Top 1 Accuracy: 86.5%
self-supervised-image-classification-on-1	DINOv2 (ViT-g/14, 448)	Number of Params: 1100M Top 1 Accuracy: 88.9%
self-supervised-image-classification-on-1	DINOv2 (ViT-g/14)	Number of Params: 1100M Top 1 Accuracy: 88.5%
semantic-segmentation-on-ade20k	DINOv2 (ViT-g/14 frozen model, w/ ViT-Adapter + Mask2former)	Params (M): 1080 Validation mIoU: 60.2
visual-place-recognition-on-17-places	DINOv2	Recall@1: 61.82
visual-place-recognition-on-baidu-mall	DINOv2	Recall@1: 49.21
visual-place-recognition-on-gardens-point	DINOv2	Recall@1: 71.50
visual-place-recognition-on-hawkins	DINOv2	Recall@1: 27.97
visual-place-recognition-on-laurel-caverns	DINOv2	Recall@1: 40.18
visual-place-recognition-on-mid-atlantic	DINOv2	Recall@1: 24.75
visual-place-recognition-on-nardo-air	DINOv2	Recall@1: 73.24
visual-place-recognition-on-nardo-air-r	DINOv2	Recall@1: 71.83
visual-place-recognition-on-oxford-robotcar-4	DINOv2	Recall@1: 39.79
visual-place-recognition-on-pittsburgh-30k	DINOv2	Recall@1: 78.32
visual-place-recognition-on-st-lucia	DINOv2	Recall@1: 78.62
visual-place-recognition-on-vp-air	DINOv2	Recall@1: 45.23

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab Timothée Darcet Théo Moutakanni Huy Vo Marc Szafraniec Vasil Khalidov Pierre Fernandez Daniel Haziza Francisco Massa Alaaeldin El-Nouby16 more

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters

Maxime Oquab Timothée Darcet Théo Moutakanni Huy Vo Marc Szafraniec Vasil Khalidov Pierre Fernandez Daniel Haziza Francisco Massa Alaaeldin El-Nouby