Zero Shot Cross Modal Retrieval On Flickr30K

评估指标

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
M2-Encoder91.299.699.292.299.799.5M2-Encoder: Advancing Bilingual Image-Text Understanding by Large-scale Efficient Pretraining
VAST---90.4--VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
InternVL-G95.799.999.785.098.697.0InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
InternVL-C94.799.999.681.798.296.0InternVL: Scaling up Vision Foundation Models and Aligning for Generic Visual-Linguistic Tasks
BEiT-394.9100.099.981.597.895.6Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
RO-ViT92.199.799.480.797.796.1Region-Aware Pretraining for Open-Vocabulary Object Detection with Vision Transformers
CoCa92.599.999.580.497.795.7CoCa: Contrastive Captioners are Image-Text Foundation Models
COSMOS ViT-B/1692.999.999.480.397.695.3COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
Flamingo89.399.798.879.597.995.3Flamingo: a Visual Language Model for Few-Shot Learning
ERNIE-ViL 2.091.299.899.177.496.493.8ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
VK-OOD89.099.899.277.298.294.3Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis-
ALBEF90.599.798.876.896.793.7Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
Florence90.9-99.176.7-93.6Florence: A New Foundation Model for Computer Vision
COSMOS ViT-B/3289.999.398.876.196.292.8COSMOS: Cross-Modality Self-Distillation for Vision Language Pre-training
ALIGN88.699.798.775.796.893.8Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
PTP-BLIP (14M)87.199.398.473.194.891.0Position-guided Text Prompt for Vision-Language Pre-training
AltCLIP8699.19872.595.491.6AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities
CLIP88.099.498.768.795.290.6Learning Transferable Visual Models From Natural Language Supervision
UNITER80.798.095.766.292.988.4UNITER: UNiversal Image-TExt Representation Learning
ViLT-B/3273.296.593.65589.882.5ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
0 of 22 row(s) selected.
Zero Shot Cross Modal Retrieval On Flickr30K | SOTA | HyperAI超神经