Video Retrieval On Lsmdc

评估指标

text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
InternVideo2-6B-46.4--InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)-43.171.464.5vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)-43.073.065.5Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HunYuan_tvr (huge)3.940.492.880.1Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
COSA-39.4--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
mPLUG-2-34.465.155.2mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VALOR-34.264.156.0VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
InternVideo-34.0--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP-ViP-30.760.651.4CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
HunYuan_tvr56.429.755.446.4Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
STAN-29.258.849.5Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
HiTeA-28.759.050.3HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
MDMMT-248.026.955.946.7MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization-
X-CLIP-26.1--X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
CAMoE54.425.953.746.1Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
EMCL-Net++-25.9-46.4Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
X-Pool53.225.253.543.7X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
Clover-24.854.544Clover: Towards A Unified Video-Language Alignment and Fusion Model
DiffusionRet40.724.454.343.1DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
CenterCLIP (ViT-B/16)47.324.255.946.2CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
0 of 38 row(s) selected.
Video Retrieval On Lsmdc | SOTA | HyperAI超神经