| GRIT (No VL pretraining - base) | 84.2 | 42.4 | 144.2 | 30.6 | 60.7 | 24.3 | GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features | |
| ExpansionNet v2 (No VL pretraining) | 83.5 | 42.7 | 143.7 | 30.6 | 61.1 | 24.7 | Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning | |
| Prompt Tuning | - | 41.81 | 141.4 | 31.51 | - | 24.42 | Prompt Tuning for Generative Multimodal Pretrained Models | |