HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
视频检索
Video Retrieval On Msr Vtt
Video Retrieval On Msr Vtt
评估指标
text-to-video R@1
text-to-video R@10
text-to-video R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
text-to-video R@1
text-to-video R@10
text-to-video R@5
Paper Title
Repository
GRAM
64
89.3
-
Gramian Multimodal Representation Learning and Alignment
VAST
63.9
89.6
84.3
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
InternVideo2-6B
62.8
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VALOR
59.9
89.6
83.5
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
UMT-L (ViT-L/16)
58.8
87.1
81.0
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)
58.1
81.6
81.0
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
COSA
57.9
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
InternVideo
55.2
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
VLAB
55.1
87.6
78.8
VLAB: Enhancing Video Language Pre-training by Feature Adapting and Blending
-
Aurora (ours, r=64)
52.4
82
73.9
-
-
TEFAL
52
86.1
76.6
Audio-Enhanced Text-to-Video Retrieval using Text-Conditioned Feature Alignment
-
UCoFiA
49.4
83.5
72.1
Unified Coarse-to-Fine Alignment for Video-Text Retrieval
OmniVL
47.8
83.8
74.2
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
CLIP4Clip-seqTransf
44.5
81.6
71.4
CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval
All-in-one + MELTR
38.6
84.7
74.4
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
VIOLETv2
37.2
75.8
64.8
An Empirical Study of End-to-End Video-Language Transformers with Masked Visual Modeling
HD-VILA
35.6
78
65.3
Advancing High-Resolution Video-Language Representation with Large-Scale Video Transcriptions
VideoCoCa (zero-shot)
34.3
67.0
57.8
VideoCoCa: Video-Text Modeling with Zero-Shot Transfer from Contrastive Captioners
-
MDMMT-2
33.7
70.8
60.5
MDMMT-2: Multidomain Multimodal Transformer for Video Retrieval, One More Step Towards Generalization
-
VIOLET + MELTR
33.6
77.8
63.7
MELTR: Meta Loss Transformer for Learning to Fine-tune Video Foundation Models
0 of 40 row(s) selected.
Previous
Next
Video Retrieval On Msr Vtt | SOTA | HyperAI超神经