Zero Shot Video Retrieval On Msvd

评估指标

text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text R@1
video-to-text R@10
video-to-text R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
InternVideo2-6B59.389.684.483.197.094.2InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
InternVideo2-1B58.188.483.083.396.994.3InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST, HowToCaption-finetuned54.887.280.9---HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
LanguageBind(ViT-L/14)54.188.181.169.797.991.8LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind(ViT-H/14)53.987.880.472.096.391.4LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
vid-TLDR (UMT-L)50.085.577.675.795.190.0vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)49.084.776.974.592.889.7Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HowToCaption44.582.173.3---HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
MILES44.487.076.2-----
Y. Ge et. al.43.684.974.9---Bridging Video-text Retrieval with Multiple Choice Questions
InternVideo43.4--67.6--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP4Clip38.576.866.9---CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
LaT36.981.068.634.479.269.0--
SSML13.6647.7435.7---Noise Estimation Using Density Estimation for Self-Supervised Multimodal Learning
0 of 14 row(s) selected.
Zero Shot Video Retrieval On Msvd | SOTA | HyperAI超神经