Video Retrieval On Activitynet

评估指标

text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
InternVideo2-6B-74.1--InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST-70.590.9-VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VALOR-70.190.8-VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
GRAM-69.9--Gramian Multimodal Representation Learning and Alignment
COSA-67.3--COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)-66.889.1-Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)-66.788.6-vid-TLDR: Training Free Token merging for Light-weight Video Transformer
InternVideo-62.2--InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP-ViP161.485.7-CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
HunYuan_tvr157.384.8-Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations-
VindLU-55.0 81.4-VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)-54.880.8-TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
RTQ-53.581.4-RTQ: Rethinking Video-language Understanding Based on Image-text Model
DMAE (ViT-B/32)1.053.480.7-Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
CAMoE151.077.7-Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
EMCL-Net++-50.678.798.1Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
HiTeA-49.777.1-HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training-
DiffusionRet+QB-Norm2.048.1--DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Singularity-47.175.5-Revealing Single Frame Bias for Video-and-Language Learning
X-CLIP-46.275.5-X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
0 of 31 row(s) selected.
Video Retrieval On Activitynet | SOTA | HyperAI超神经