HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
跨模态检索
Cross Modal Retrieval On Coco 2014
Cross Modal Retrieval On Coco 2014
评估指标
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper Title
Repository
VAST
-
-
-
68.0
92.8
87.7
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
X2-VLM (large)
84.4
98.5
96.5
67.7
92.5
87.5
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
BEiT-3
84.8
98.3
96.5
67.2
87.7
92.8
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
XFM (base)
84.2
98.4
96.4
67.0
92.4
87.2
Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
X2-VLM (base)
83.5
98.5
96.3
66.2
92.2
87.1
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
PTP-BLIP (14M)
81.5
97.9
95.9
64.9
92.2
87.4
Position-guided Text Prompt for Vision-Language Pre-training
OmniVL (14M)
82.1
98.1
95.9
64.8
91.6
86.1
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
VSE-Gradient
81.4
97.9
95.6
63.6
91.5
86.0
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
X-VLM (base)
81.2
98.2
95.6
63.4
91.5
85.8
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Florence
81.8
-
95.2
63.2
-
85.7
Florence: A New Foundation Model for Computer Vision
VK-OOD
80.7
96.8
95.1
62.9
92.8
84.8
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
-
Aurora (ours, r=128)
80.7
97.8
95.3
62.8
91
84.8
-
-
DSMD
48.0
84.5
75.6
62.1
92.0
85.9
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
VALOR
-
-
-
61.4
90.9
84.4
VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
ALBEF
77.6
97.2
94.3
60.7
90.5
84.3
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
ALIGN
77
96.9
93.5
59.9
89.8
83.3
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
ERNIE-ViL 2.0
77.4
97.1
93.6
59.5
90.1
83.4
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
TCL
75.6
96.7
92.8
59.0
89.9
83.2
Vision-Language Pre-Training with Triple Contrastive Learning
Oscar
73.5
96.0
92.2
57.5
89.8
82.8
Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
METER
76.16
96.82
93.16
57.08
90.07
82.66
An Empirical Study of Training End-to-End Vision-and-Language Transformers
0 of 36 row(s) selected.
Previous
Next
Cross Modal Retrieval On Coco 2014 | SOTA | HyperAI超神经