Video Retrieval On Msr Vtt

评估指标

text-to-video R@1
text-to-video R@10
text-to-video R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
GRAM6489.3-Gramian Multimodal Representation Learning and Alignment
VAST63.989.684.3VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
InternVideo2-6B62.8--InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VALOR59.989.683.5VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
UMT-L (ViT-L/16)58.887.181.0Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)58.181.681.0vid-TLDR: Training Free Token merging for Light-weight Video Transformer
COSA57.9--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
InternVideo55.2--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VLAB55.187.678.8VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending-
Aurora (ours, r=64)52.48273.9--
TEFAL5286.176.6Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment-
UCoFiA49.483.572.1Unified Coarse-to-Fine Alignment for Video-Text Retrieval
OmniVL47.883.874.2OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
CLIP4Clip-seqTransf44.581.671.4CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
All-in-one + MELTR38.684.774.4MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VIOLETv237.275.864.8An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
HD-VILA35.67865.3Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
VideoCoCa (zero-shot)34.367.057.8VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners-
MDMMT-233.770.860.5MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization-
VIOLET + MELTR33.677.863.7MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 40 row(s) selected.
Video Retrieval On Msr Vtt | SOTA | HyperAI超神经