| CAR-FT (CLIP, ViT-L/14@336px) | 81.5 | Context-Aware Robust Fine-Tuning | - |
| CAFormer-B36 (IN-21K, 384) | 79.5 | MetaFormer Baselines for Vision | |
| FAN-Hybrid-L(IN-21K, 384) | 74.5 | Understanding The Robustness in Vision Transformers | |
| ConvFormer-B36 (IN-21K, 384) | 73.5 | MetaFormer Baselines for Vision | |
| CAFormer-B36 (IN-21K) | 69.4 | MetaFormer Baselines for Vision | |
| ConvFormer-B36 (IN-21K) | 63.3 | MetaFormer Baselines for Vision | |
| Pyramid Adversarial Training Improves ViT (Im21k) | 62.44 | Pyramid Adversarial Training Improves ViT Performance | |
| CAFormer-B36 (384) | 61.9 | MetaFormer Baselines for Vision | |
| TransNeXt-Base (IN-1K supervised, 384) | 61.6 | TransNeXt: Robust Foveal Visual Perception for Vision Transformers | |
| TransNeXt-Small (IN-1K supervised, 384) | 58.3 | TransNeXt: Robust Foveal Visual Perception for Vision Transformers | |
| ConvFormer-B36 (384) | 55.3 | MetaFormer Baselines for Vision | |
| TransNeXt-Base (IN-1K supervised, 224) | 50.6 | TransNeXt: Robust Foveal Visual Perception for Vision Transformers | |