Image Captioning On Nocaps Out Of Domain
评估指标
CIDEr
SPICE
评测结果
各个模型在此基准测试上的表现结果
| Paper Title | Repository | |||
|---|---|---|---|---|
| PaLI | 126.67 | 15.49 | PaLI: A Jointly-Scaled Multilingual Language-Image Model | |
| GIT2, Single Model | 122.27 | 15.62 | GIT: A Generative Image-to-text Transformer for Vision and Language | |
| GIT, Single Model | 122.04 | 15.7 | GIT: A Generative Image-to-text Transformer for Vision and Language | |
| CoCa - Google Brain | 121.69 | 15.13 | - | - | 
| Microsoft Cognitive Services team | 110.14 | 13.74 | VIVO: Visual Vocabulary Pre-Training for Novel Object Captioning | - | 
| Single Model | 109.49 | 13.89 | SimVLM: Simple Visual Language Model Pretraining with Weak Supervision | |
| FudanFVL | 106.55 | 14.21 | - | - | 
| FudanWYZ | 103.75 | 13.75 | - | - | 
| Human | 91.62 | 14.21 | - | - | 
| firethehole | 88.54 | 13.87 | - | - | 
| IEDA-LAB | 87.51 | 12.52 | - | - | 
| icgp2ssi1_coco_si_0.02_5_test | 87.15 | 11.43 | - | - | 
| evertyhing | 85.18 | 11.18 | - | - | 
| vll@mk514 | 78.91 | 12.14 | - | - | 
| VinVL (Microsoft Cognitive Services + MSR) | 78.01 | 11.48 | VinVL: Revisiting Visual Representations in Vision-Language Models | |
| MD | 77.39 | 11.59 | - | - | 
| RCAL | 75.39 | 10.68 | - | - | 
| Oscar | 73.75 | 9.72 | - | - | 
| GRIT (zero-shot, no CBS, no VL pretraining, single model) | 72.6 | 11.1 | GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features | |
| ViTCAP-CIDEr-136.7-ENC-DEC-ViTbfocal10-test-CBS | 72.13 | 11.53 | - | - | 
0 of 40 row(s) selected.