| IcoCap (ViT-B/16) | 47.0 | 60.2 | 31.1 | 64.9 | IcoCap: Improving Video Captioning by Compounding Images | - |
| IcoCap (ViT-B/32) | 46.1 | 59.1 | 30.3 | 64.3 | IcoCap: Improving Video Captioning by Compounding Images | - |
| CoCap (ViT/L14) | 44.4 | 57.2 | 30.3 | 63.4 | Accurate and Fast Compressed Video Captioning | |
| VASTA (Vatex-backbone) | 44.21 | 56.08 | 30.24 | 62.9 | Diverse Video Captioning by Adaptive Spatio-temporal Attention | |