Image Captioning On Coco Captions

评估指标

BLEU-1
BLEU-4
CIDER
METEOR
ROUGE-L
SPICE

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
mPLUG-46.5155.132.0-26.0mPLUG: Effective and Efficient Vision-Language Learning by Cross-modal Skip-connections
OFA-44.9154.932.5-26.6OFA: Unifying Architectures, Tasks, and Modalities Through a Simple Sequence-to-Sequence Learning Framework
VALOR--152.5--25.7VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
GIT-44.1151.1 32.2-26.3GIT: A Generative Image-to-text Transformer for Vision and Language
VAST--149.0--27.0VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
BLIP-2 ViT-G OPT 2.7B (zero-shot)-43.7145.8---BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
LEMON-42.6145.531.4-25.5Scaling Up Vision-Language Pre-training for Image Captioning-
BLIP-2 ViT-G OPT 6.7B (zero-shot)-43.5145.2---BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
BLIP-2 ViT-G FlanT5 XL (zero-shot)-42.4144.5---BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
GRIT (No VL pretraining - base)84.242.4144.230.660.724.3GRIT: Faster and Better Image captioning Transformer Using Dual Visual Features
ExpansionNet v2 (No VL pretraining)83.542.7143.730.661.124.7Exploiting Multiple Sequence Lengths in Fast End to End Training for Image Captioning
CoCa-40.9143.633.9-24.7CoCa: Contrastive Captioners are Image-Text Foundation Models
SimVLM-40.6143.333.4-25.4SimVLM: Simple Visual Language Model Pretraining with Weak Supervision
Xmodal-Ctx + OSCAR-41.3142.2--24.9Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Prompt Tuning-41.81141.431.51-24.42Prompt Tuning for Generative Multimodal Pretrained Models
VinVL-41.0140.931.1-25.2VinVL: Revisiting Visual Representations in Vision-Language Models
X-VLM (base)-41.3140.8---Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Oscar-41.714030.6-24.5Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
Xmodal-Ctx83.441.4139.930.460.424.0Beyond a Pre-Trained Object Detector: Cross-Modal Textual and Visual Context for Image Captioning
Prismer-40.4136.531.4-24.4Prismer: A Vision-Language Model with Multi-Task Experts
0 of 40 row(s) selected.
Image Captioning On Coco Captions | SOTA | HyperAI超神经