HyperAI
HyperAI超神经
首页
算力平台
文档
资讯
论文
教程
数据集
百科
SOTA
LLM 模型天梯
GPU 天梯
顶会
开源项目
全站搜索
关于
中文
HyperAI
HyperAI超神经
Toggle sidebar
全站搜索…
⌘
K
全站搜索…
⌘
K
首页
SOTA
零样本跨模态检索
Zero Shot Cross Modal Retrieval On Flickr30K
Zero Shot Cross Modal Retrieval On Flickr30K
评估指标
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
评测结果
各个模型在此基准测试上的表现结果
Columns
模型名称
Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5
Paper Title
Repository
M2-Encoder
91.2
99.6
99.2
92.2
99.7
99.5
M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
VAST
-
-
-
90.4
-
-
VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
InternVL-G
95.7
99.9
99.7
85.0
98.6
97.0
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL-C
94.7
99.9
99.6
81.7
98.2
96.0
InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
BEiT-3
94.9
100.0
99.9
81.5
97.8
95.6
Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
RO-ViT
92.1
99.7
99.4
80.7
97.7
96.1
Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
CoCa
92.5
99.9
99.5
80.4
97.7
95.7
CoCa: Contrastive Captioners are Image-Text Foundation Models
COSMOS ViT-B/16
92.9
99.9
99.4
80.3
97.6
95.3
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Flamingo
89.3
99.7
98.8
79.5
97.9
95.3
Flamingo: a Visual Language Model for Few-Shot Learning
ERNIE-ViL 2.0
91.2
99.8
99.1
77.4
96.4
93.8
ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
VK-OOD
89.0
99.8
99.2
77.2
98.2
94.3
Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis
-
ALBEF
90.5
99.7
98.8
76.8
96.7
93.7
Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Florence
90.9
-
99.1
76.7
-
93.6
Florence: A New Foundation Model for Computer Vision
COSMOS ViT-B/32
89.9
99.3
98.8
76.1
96.2
92.8
COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
ALIGN
88.6
99.7
98.7
75.7
96.8
93.8
Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
PTP-BLIP (14M)
87.1
99.3
98.4
73.1
94.8
91.0
Position-guided Text Prompt for Vision-Language Pre-training
AltCLIP
86
99.1
98
72.5
95.4
91.6
AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
CLIP
88.0
99.4
98.7
68.7
95.2
90.6
Learning Transferable Visual Models From Natural Language Supervision
UNITER
80.7
98.0
95.7
66.2
92.9
88.4
UNITER: UNiversal Image-TExt Representation Learning
ViLT-B/32
73.2
96.5
93.6
55
89.8
82.5
ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
0 of 22 row(s) selected.
Previous
Next
Zero Shot Cross Modal Retrieval On Flickr30K | SOTA | HyperAI超神经