Command Palette
Search for a command to run...
Mahmoud Assran Mathilde Caron Ishan Misra Piotr Bojanowski Florian Bordes Pascal Vincent Armand Joulin Michael Rabbat Nicolas Ballas

Abstract
We propose Masked Siamese Networks (MSN), a self-supervised learning framework for learning image representations. Our approach matches the representation of an image view containing randomly masked patches to the representation of the original unmasked image. This self-supervised pre-training strategy is particularly scalable when applied to Vision Transformers since only the unmasked patches are processed by the network. As a result, MSNs improve the scalability of joint-embedding architectures, while producing representations of a high semantic level that perform competitively on low-shot image classification. For instance, on ImageNet-1K, with only 5,000 annotated images, our base MSN model achieves 72.4% top-1 accuracy, and with 1% of ImageNet-1K labels, we achieve 75.7% top-1 accuracy, setting a new state-of-the-art for self-supervised learning on this benchmark. Our code is publicly available.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| self-supervised-image-classification-on | MSN (ViT-L/7) | Number of Params: 306M Top 1 Accuracy: 80.7% |
| semi-supervised-image-classification-on-1 | MSN (ViT-B/4) | Top 1 Accuracy: 75.7% |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.