HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
视频检索
Video Retrieval On Activitynet
Video Retrieval On Activitynet
评估指标
text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
text-to-video Median Rank
text-to-video R@1
text-to-video R@5
text-to-video R@50
Paper Title
Repository
InternVideo2-6B
-
74.1
-
-
InternVideo2: Scaling Foundation Models for Multimodal Video Understanding
VAST
-
70.5
90.9
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
VALOR
-
70.1
90.8
-
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
GRAM
-
69.9
-
-
Gramian Multimodal Representation Learning and Alignment
COSA
-
67.3
-
-
COSA: Concatenated Sample Pretrained Vision-Language Foundation Model
UMT-L (ViT-L/16)
-
66.8
89.1
-
Unmasked Teacher: Towards Training-Efficient Video Foundation Models
vid-TLDR (UMT-L)
-
66.7
88.6
-
vid-TLDR: Training Free Token merging for Light-weight Video Transformer
InternVideo
-
62.2
-
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
CLIP-ViP
1
61.4
85.7
-
CLIP-ViP: Adapting Pre-trained Image-Text Model to Video-Language Representation Alignment
HunYuan_tvr
1
57.3
84.8
-
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
-
VindLU
-
55.0
81.4
-
VindLU: A Recipe for Effective Video-and-Language Pretraining
TESTA (ViT-B/16)
-
54.8
80.8
-
TESTA: Temporal-Spatial Token Aggregation for Long-form Video-Language Understanding
RTQ
-
53.5
81.4
-
RTQ: Rethinking Video-language Understanding Based on Image-text Model
DMAE (ViT-B/32)
1.0
53.4
80.7
-
Dual-Modal Attention-Enhanced Text-Video Retrieval with Triplet Partial Margin Contrastive Learning
CAMoE
1
51.0
77.7
-
Improving Video-Text Retrieval by Multi-Stream Corpus Alignment and Dual Softmax Loss
EMCL-Net++
-
50.6
78.7
98.1
Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations
HiTeA
-
49.7
77.1
-
HiTeA: Hierarchical Temporal-Aware Video-Language Pre-training
-
DiffusionRet+QB-Norm
2.0
48.1
-
-
DiffusionRet: Generative Text-Video Retrieval with Diffusion Model
Singularity
-
47.1
75.5
-
Revealing Single Frame Bias for Video-and-Language Learning
X-CLIP
-
46.2
75.5
-
X-CLIP: End-to-End Multi-grained Contrastive Learning for Video-Text Retrieval
0 of 31 row(s) selected.
Previous
Next
Video Retrieval On Activitynet | SOTA | HyperAI超神经