| MAE (ViT-H/14, 448) | 632M | 87.8% | Masked Autoencoders Are Scalable Vision Learners | |
| MAE + AugSub finetune (ViT-H/14) | 632M | 87.2% | Masking meets Supervision: A Strong Learning Alliance | |
| SimMIM (SwinV2-H, 512) | 658M | 87.1% | SimMIM: A Simple Framework for Masked Image Modeling | |
| TEC_MAE (ViT-L/16, 224) | - | 86.5% | Towards Sustainable Self-supervised Learning | |
| MAE + AugSub finetune (ViT-L/16) | 304M | 86.1% | Masking meets Supervision: A Strong Learning Alliance | |