Video Retrieval On Didemo

评估指标

text-to-video R@1
text-to-video R@10
text-to-video R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
InternVideo2-6B74.2--InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)72.394.291.2vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VAST72.091.489.0VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA70.5--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)70.493.590.1Unmasked Teacher: Towards Training-Efficient Video Foundation Models
GRAM67.390.1-Gramian Multimodal Representation Learning and Alignment
VALOR61.590.485.3VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VindLU61.291.085.8VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)61.291.587.2TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
InternVideo57.9--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
RTQ57.689.984.1RTQ: Rethinking Video-language Understanding Based on Image-text Model
VLAB56.888.781.6VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending-
MuLTI56.587.080.2MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling-
HiTeA56.589.781.7HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
mPLUG-256.485.279.1mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CLIP-ViP55.389.382CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN54.685.178.4Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Singularity53.986.979.4Revealing Single Frame Bias for Video-and-Language Learning
HunYuan_tvr (huge)52.785.277.8Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
DMAE (ViT-B/32)52.786.679.3Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
0 of 40 row(s) selected.
Video Retrieval On Didemo | SOTA | HyperAI超神经