HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
视频检索
Video Retrieval On Didemo
Video Retrieval On Didemo
评估指标
text-to-video R@1
text-to-video R@10
text-to-video R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
InternVideo2-6B
74.2
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)
72.3
94.2
91.2
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
VAST
72.0
91.4
89.0
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
COSA
70.5
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)
70.4
93.5
90.1
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
GRAM
67.3
90.1
-
Gramian Multimodal Representation Learning and Alignment
VALOR
61.5
90.4
85.3
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
VindLU
61.2
91.0
85.8
VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)
61.2
91.5
87.2
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
InternVideo
57.9
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
RTQ
57.6
89.9
84.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
VLAB
56.8
88.7
81.6
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
-
MuLTI
56.5
87.0
80.2
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
-
HiTeA
56.5
89.7
81.7
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
mPLUG-2
56.4
85.2
79.1
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
CLIP-ViP
55.3
89.3
82
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN
54.6
85.1
78.4
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
Singularity
53.9
86.9
79.4
Revealing Single Frame Bias for Video-and-Language Learning
HunYuan_tvr (huge)
52.7
85.2
77.8
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
DMAE (ViT-B/32)
52.7
86.6
79.3
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
0 of 40 row(s) selected.
Previous
Next
Video Retrieval On Didemo | SOTA | HyperAI超神经