Cross Modal Retrieval On Flickr30K

评估指标

Image-to-text R@1
Image-to-text R@10
Image-to-text R@5
Text-to-image R@1
Text-to-image R@10
Text-to-image R@5

评测结果

各个模型在此基准测试上的表现结果

Paper TitleRepository
ERNIE-ViL 2.097.2100.0100.093.399.899.4ERNIE-ViL 2.0: Multi-view Contrastive Learning for Image-Text Pre-training
X2-VLM (large)98.810010091.899.598.6X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
VAST---91.099.598.5VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foundation Model and Dataset
X2-VLM (base)98.510010090.499.398.2X$^2$-VLM: All-In-One Pre-trained Model For Vision-Language Tasks
BEiT-398.0100.0100.090.399.598.7Image as a Foreign Language: BEiT Pretraining for All Vision and Vision-Language Tasks
OmniVL (14M)97.310099.987.999.197.8OmniVL:One Foundation Model for Image-Language and Video-Language Tasks-
X-VLM (base)97.1100.0100.086.998.797.3Multi-Grained Vision Language Pre-Training: Aligning Texts with Visual Concepts
Aurora (ours, r=128)97.210010086.898.997.6--
VSE-Gradient97.010099.686.399.097.4Dissecting Deep Metric Learning Losses for Image-Text Retrieval
ALIGN95.310099.884.998.697.4Scaling Up Visual and Vision-Language Representation Learning With Noisy Text Supervision
IAIS88.399.498.476.8695.7293.3Learning Relation Alignment for Calibrated Cross-modal Retrieval
ViSTA89.599.698.475.896.994.2ViSTA: Vision and Scene Text Aggregation for Cross-Modal Retrieval-
3SHNet87.199.298.269.594.791.03SHNet: Boosting Image-Sentence Retrieval via Visual Semantic-Spatial Self-Highlighting
DSMD82.597.795.568.494.490.8Dynamic Self-adaptive Multiscale Distillation from Pre-trained Multimodal Large Model for Efficient Cross-modal Representation Learning
ViLT-B/3283.598.696.764.493.888.7ViLT: Vision-and-Language Transformer Without Convolution or Region Supervision
RCAR82.398.496.062.691.185.8Plug-and-Play Regulators for Image-Text Matching
NAPReg79.6--60.0--NAPReg: Nouns As Proxies Regularization for Semantically Aware Cross-Modal Embeddings-
SGRAF77.897.494.158.588.883.0Similarity Reasoning and Filtration for Image-Text Matching
GSMN76.497.394.357.489.082.3Graph Structured Network for Image-Text Matching
Pearl75.397.393.454.9888.2681.3--
0 of 27 row(s) selected.
Cross Modal Retrieval On Flickr30K | SOTA | HyperAI超神经