HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
视频检索
Video Retrieval On Msr Vtt 1Ka
Video Retrieval On Msr Vtt 1Ka
评估指标
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
text-to-video Median Rank
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
HunYuan_tvr (huge)
1.0
62.9
90.8
84.5
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
OmniVec
-
-
89.4
-
OmniVec: Learning robust representations with cross modal sharing
-
CLIP-ViP
1.0
57.7
88.2
80.5
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
STAN
1
54.1
87.8
79.5
Revisiting Temporal Modeling for CLIP-based Image-to-Video Knowledge Transferring
PIDRo
1.0
55.9
87.6
79.8
PIDRo: Parallel Isomeric Attention with Dynamic Routing for Text-Video Retrieval
-
DRL
1
53.3
87.6
80.3
Disentangled Representation Learning for Text-Video Retrieval
TS2-Net
-
54.0
87.4
79.3
TS2-Net: Token Shift and Selection Transformer for Text-Video Retrieval
DMAE (ViT-B/16)
1.0
55.5
87.1
79.4
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
EERCF
-
54.1
86.9
78.8
Towards Efficient and Effective Text-to-Video Retrieval with Coarse-to-Fine Visual Representation Learning
CLIP2TV
1
52.9
86.5
78.5
CLIP2TV: Align, Match and Distill for Video-Text Retrieval
-
MuLTI
-
54.7
86.0
77.7
MuLTI: Efficient Video-and-Language Understanding with Text-Guided MultiWay-Sampler and Multiple Choice Modeling
-
EMCL-Net++
-
51.6
85.3
78.1
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
CAMoE
2
48.8
85.3
75.6
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
X-CLIP
2.0
49.3
84.8
75.8
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
mPLUG-2
-
53.1
84.7
77.6
mPLUG-2: A Modularized Multi-modal Foundation Model Across Text, Image and Video
RTQ
-
53.4
84.4
76.1
RTQ: Rethinking Video-language Understanding Based on Image-text Model
Side4Video
1.0
52.3
84.2
75.5
Side4Video: Spatial-Temporal Side Network for Memory-Efficient Image-to-Video Transfer Learning
X2-VLM (large)
-
49.6
84.2
76.7
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
X2-VLM (base)
-
47.6
84.2
74.1
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
Cap4Video
1
51.4
83.9
75.7
Cap4Video: What Can Auxiliary Captions Do for Text-Video Retrieval?
0 of 63 row(s) selected.
Previous
Next
Video Retrieval On Msr Vtt 1Ka | SOTA | HyperAI超神经