| DINOv2+reg (ViT-g/14) | 1100M | 87.1 | Vision Transformers Need Registers | |
| DINOv2 distilled (ViT-L/14) | 307M | 86.3% | DINOv2: Learning Robust Visual Features without Supervision | |
| DINOv2 distilled (ViT-B/14) | 85M | 84.5% | DINOv2: Learning Robust Visual Features without Supervision | |
| iBOT (ViT-L/16) (IN22k) | 307M | 82.3% | iBOT: Image BERT Pre-Training with Online Tokenizer | |
| DINOv2 distilled (ViT-S/14) | 21M | 81.1% | DINOv2: Learning Robust Visual Features without Supervision | |