HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
视频检索
Video Retrieval On Lsmdc
Video Retrieval On Lsmdc
评估指标
text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
text-to-video Mean Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
InternVideo2-6B
-
46.4
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
vid-TLDR (UMT-L)
-
43.1
71.4
64.5
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
UMT-L (ViT-L/16)
-
43.0
73.0
65.5
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
HunYuan_tvr (huge)
3.9
40.4
92.8
80.1
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
COSA
-
39.4
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
mPLUG-2
-
34.4
65.1
55.2
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
VALOR
-
34.2
64.1
56.0
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
InternVideo
-
34.0
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP-ViP
-
30.7
60.6
51.4
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
HunYuan_tvr
56.4
29.7
55.4
46.4
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
STAN
-
29.2
58.8
49.5
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
HiTeA
-
28.7
59.0
50.3
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
MDMMT-2
48.0
26.9
55.9
46.7
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
-
X-CLIP
-
26.1
-
-
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
CAMoE
54.4
25.9
53.7
46.1
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
EMCL-Net++
-
25.9
-
46.4
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
X-Pool
53.2
25.2
53.5
43.7
X-Pool: Cross-Modal Language-Video Attention for Text-Video Retrieval
Clover
-
24.8
54.5
44
Clover: Towards A Unified Video-Language Alignment and Fusion Model
DiffusionRet
40.7
24.4
54.3
43.1
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
CenterCLIP (ViT-B/16)
47.3
24.2
55.9
46.2
CenterCLIP: Token Clustering for Efficient Text-Video Retrieval
0 of 38 row(s) selected.
Previous
Next
Video Retrieval On Lsmdc | SOTA | HyperAI超神经