HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
跨模态检索
Cross Modal Retrieval On Flickr30K
Cross Modal Retrieval On Flickr30K
评估指标
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper Title
Repository
ERNIE-ViL 2.0
97.2
100.0
100.0
93.3
99.8
99.4
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
X2-VLM (large)
98.8
100
100
91.8
99.5
98.6
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
VAST
-
-
-
91.0
99.5
98.5
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
X2-VLM (base)
98.5
100
100
90.4
99.3
98.2
X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
BEiT-3
98.0
100.0
100.0
90.3
99.5
98.7
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
OmniVL (14M)
97.3
100
99.9
87.9
99.1
97.8
OmniVL:One Foundation Model for Image-Language and Video-Language Tasks
-
X-VLM (base)
97.1
100.0
100.0
86.9
98.7
97.3
Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Aurora (ours, r=128)
97.2
100
100
86.8
98.9
97.6
-
-
VSE-Gradient
97.0
100
99.6
86.3
99.0
97.4
Dissecting Deep Metric Learning Losses for Image-Text Retrieval
ALIGN
95.3
100
99.8
84.9
98.6
97.4
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
IAIS
88.3
99.4
98.4
76.86
95.72
93.3
Learning Relation Alignment for Calibrated Cross-modal Retrieval
ViSTA
89.5
99.6
98.4
75.8
96.9
94.2
ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval
-
3SHNet
87.1
99.2
98.2
69.5
94.7
91.0
3SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
DSMD
82.5
97.7
95.5
68.4
94.4
90.8
Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
ViLT-B/32
83.5
98.6
96.7
64.4
93.8
88.7
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
RCAR
82.3
98.4
96.0
62.6
91.1
85.8
Plug-and-Play Regulators for Image-Text Matching
NAPReg
79.6
-
-
60.0
-
-
NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings
-
SGRAF
77.8
97.4
94.1
58.5
88.8
83.0
Similarity Reasoning and Filtration for Image-Text Matching
GSMN
76.4
97.3
94.3
57.4
89.0
82.3
Graph Structured Network for Image-Text Matching
Pearl
75.3
97.3
93.4
54.98
88.26
81.3
-
-
0 of 27 row(s) selected.
Previous
Next
Cross Modal Retrieval On Flickr30K | SOTA | HyperAI超神经