Cross Modal Retrieval On Coco 2014

评估指标

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
VAST---68.092.887.7VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
X2-VLM (large)84.498.596.567.792.587.5X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
BEiT-384.898.396.567.287.792.8Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
XFM (base)84.298.496.467.092.487.2Toward Building General Foundation Models for Language, Vision, and Vision-Language Understanding Tasks
X2-VLM (base)83.598.596.366.292.287.1X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
PTP-BLIP (14M)81.597.995.964.992.287.4Position-guided Text Prompt for Vision-Language Pre-training
OmniVL (14M)82.198.195.964.891.686.1OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
VSE-Gradient81.497.995.663.691.586.0Dissecting Deep Metric Learning Losses for Image-Text Retrieval
X-VLM (base)81.298.295.663.491.585.8Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Florence81.8-95.263.2-85.7Florence: A New Foundation Model for Computer Vision
VK-OOD80.796.895.162.992.884.8Implicit Differentiable Outlier Detection Enable Robust Deep Multimodal Analysis-
Aurora (ours, r=128)80.797.895.362.89184.8--
DSMD48.084.575.662.192.085.9Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
VALOR---61.490.984.4VALOR: Vision-Audio-Language Omni-Perception Pretraining Model and Dataset
ALBEF77.697.294.360.790.584.3Align before Fuse: Vision and Language Representation Learning with Momentum Distillation
ALIGN7796.993.559.989.883.3Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
ERNIE-ViL 2.077.497.193.659.590.183.4ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
TCL75.696.792.859.089.983.2Vision-Language Pre-Training with Triple Contrastive Learning
Oscar73.596.092.257.589.882.8Oscar: Object-Semantics Aligned Pre-training for Vision-Language Tasks
METER76.1696.8293.1657.0890.0782.66An Empirical Study of Training End-to-End Vision-and-Language Transformers
0 of 36 row(s) selected.
Cross Modal Retrieval On Coco 2014 | SOTA | HyperAI超神经