Zero Shot Video Retrieval On Msr Vtt

评估指标

text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
video-to-text Median Rank
video-to-text R@1
video-to-text R@10
video-to-text R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
InternVideo2-6B-55.985.178.3-53.784.177.5InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
GRAM-54.883.9--52.982.9-Gramian Multimodal Representation Learning and Alignment
InternVideo2-1B-51.982.575.3-50.981.873.4InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST, HowToCaption-finetuned15081.473.2----HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
VAST-49.373.968.3----VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
mPLUG-2-47.179.069.7----mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
LanguageBind(ViT-H/14)244.878.770.02.40.975.766.4LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
LanguageBind(ViT-L/14)2.042.876.067.53.038.377.865.8LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
UMT-L (ViT-L/16)-42.673.164.4-38.669.659.8Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)-42.172.463.9-37.769.459.8vid-TLDR: Training Free Token merging for Light-weight Video Transformer
BT-Adapter-40.973.564.7----BT-Adapter: Video Conversation is Feasible Without Video Instruction Tuning
InternVideo-40.7---39.6--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
HowToCaption337.673.362----HowToCaption: Prompting LLMs to Transform Video Annotations at Scale
Florence-37.672.663.8----Florence: A New Foundation Model for Computer Vision
ImageBind-36.870.061.8----ImageBind: One Embedding Space To Bind Them All
OmniVL-34.666.658.4----OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
HiTeA-17M-34.469.960.0----HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
Singularity-17M-34.066.756.7----Revealing Single Frame Bias for Video-and-Language Learning
CLIP4Clip432.066.957.0----CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
Yatai Ji et. al.-30.965.054.4------
0 of 39 row(s) selected.
Zero Shot Video Retrieval On Msr Vtt | SOTA | HyperAI超神经