Video Retrieval On Msr Vtt 1Ka

评估指标

text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
HunYuan_tvr (huge)1.062.990.884.5Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
OmniVec--89.4-OmniVec: Learning robust representations with cross modal sharing-
CLIP-ViP1.057.788.280.5CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN154.187.879.5Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
PIDRo1.055.987.679.8PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval-
DRL153.387.680.3Disentangled Representation Learning for Text-Video Retrieval
TS2-Net-54.087.479.3TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
DMAE (ViT-B/16)1.055.587.179.4Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
EERCF-54.186.978.8Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
CLIP2TV152.986.578.5CLIP2TV: Align, Match and Distill for Video-Text Retrieval-
MuLTI-54.786.077.7MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling-
EMCL-Net++-51.685.378.1Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
CAMoE248.885.375.6Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
X-CLIP2.049.384.875.8X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
mPLUG-2-53.184.777.6mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
RTQ-53.484.476.1RTQ: Rethinking Video-language Understanding Based on Image-text Model
Side4Video1.052.384.275.5Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
X2-VLM (large)-49.684.276.7X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)-47.684.274.1X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Cap4Video151.483.975.7Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
0 of 63 row(s) selected.
Video Retrieval On Msr Vtt 1Ka | SOTA | HyperAI超神经