Video Captioning On Msr Vtt 1

评估指标

BLEU-4

CIDEr

METEOR

ROUGE-L

评测结果

各个模型在此基准测试上的表现结果

					Paper Title	Repository
mPLUG-2	57.8	80.0	34.9	70.1	mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VAST	56.7	78.0	-	-	VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
GIT2	54.8	75.9	33.1	68.2	GIT: A Generative Image-to-text Transformer for Vision and Language
VLAB	54.6	74.9	33.4	68.3	VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending	-
COSA	53.7	74.7	-	-	COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
VALOR	54.4	74.0	32.9	68.0	VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
MaMMUT (ours)	-	73.6	-	-	MaMMUT: A Simple Architecture for Joint Learning for MultiModal Tasks
VideoCoCa	53.8	73.2	-	68.0	VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners	-
RTQ	49.6	69.3	-	66.1	RTQ: Rethinking Video-language Understanding Based on Image-text Model
HowToCaption	49.8	65.3	32.2	66.3	HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
HiTeA	49.2	65.1	30.7	65.0	HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training	-
Vid2Seq	-	64.6	30.8	-	Vid2Seq: Large-Scale Pretraining of a Visual Language Model for Dense Video Captioning
TextKG	46.6	60.8	30.5	64.8	Text with Knowledge Graph Augmented Transformer for Video Captioning	-
IcoCap (ViT-B/16)	47.0	60.2	31.1	64.9	IcoCap: Improving Video Captioning by Compounding Images	-
MV-GPT	48.9	60.0	38.7	64.0	End-to-end Generative Pretraining for Multimodal Video Captioning	-
IcoCap (ViT-B/32)	46.1	59.1	30.3	64.3	IcoCap: Improving Video Captioning by Compounding Images	-
CLIP-DCD	48.2	58.7	31.3	64.8	CLIP Meets Video Captioning: Concept-Aware Representation Learning Does Matter
VIOLETv2	-	58	-	-	An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
CoCap (ViT/L14)	44.4	57.2	30.3	63.4	Accurate and Fast Compressed Video Captioning
VASTA (Vatex-backbone)	44.21	56.08	30.24	62.9	Diverse Video Captioning by Adaptive Spatio-temporal Attention

0 of 24 row(s) selected.