| Meta Pseudo Labels (EfficientNet-L2) | 95040G | 480M | | 90.2% | Meta Pseudo Labels | |
| InternImage-DCNv3-G (M3I Pre-training) | - | 3000M | - | 90.1% | InternImage: Exploring Large-Scale Vision Foundation Models with Deformable Convolutions | |
| RevCol-H | - | 2158M | - | 90.0% | Reversible Column Networks | |
| Meta Pseudo Labels (EfficientNet-B6-Wide) | - | 390M | - | 90% | Meta Pseudo Labels | |
| M3I Pre-training (InternImage-H) | - | - | - | 89.6% | Towards All-in-one Pre-training via Maximizing Multi-modal Mutual
Information | |
| ViT-L/16 (384res, distilled from ViT-22B) | - | 307M | - | 89.6% | Scaling Vision Transformers to 22 Billion Parameters | |
| MaxViT-XL (512res, JFT) | - | - | - | 89.53% | MaxViT: Multi-Axis Vision Transformer | |