5 months ago

Dual-Path Convolutional Image-Text Embeddings with Instance Loss

Zheng Zhedong ; Zheng Liang ; Garrett Michael ; Yang Yi ; Xu Mingliang ; Shen Yi-Dong

Abstract

Matching images and sentences demands a fine understanding of bothmodalities. In this paper, we propose a new system to discriminatively embedthe image and text to a shared visual-textual space. In this field, mostexisting works apply the ranking loss to pull the positive image / text pairsclose and push the negative pairs apart from each other. However, directlydeploying the ranking loss is hard for network learning, since it starts fromthe two heterogeneous features to build inter-modal relationship. To addressthis problem, we propose the instance loss which explicitly considers theintra-modal data distribution. It is based on an unsupervised assumption thateach image / text group can be viewed as a class. So the network can learn thefine granularity from every image/text group. The experiment shows that theinstance loss offers better weight initialization for the ranking loss, so thatmore discriminative embeddings can be learned. Besides, existing works usuallyapply the off-the-shelf features, i.e., word2vec and fixed visual feature. Soin a minor contribution, this paper constructs an end-to-end dual-pathconvolutional network to learn the image and text representations. End-to-endlearning allows the system to directly learn from the data and fully utilizethe supervision. On two generic retrieval datasets (Flickr30k and MSCOCO),experiments demonstrate that our method yields competitive accuracy compared tostate-of-the-art methods. Moreover, in language based person retrieval, weimprove the state of the art by a large margin. The code has been made publiclyavailable.

Code Repositories

pshroff04/Dual_Path_CNN

pytorch

Mentioned in GitHub

layumi/Image-Text-Embedding

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
cross-modal-retrieval-on-cuhk-pedes	Dual Path	Text-to-image Medr: 2
cross-modal-retrieval-on-flickr30k	Dual-Path (ResNet)	Image-to-text R@10: 89.5 Text-to-image R@1: 39.1 Text-to-image R@10: 80.9 Text-to-image R@5: 69.2
cross-modal-retrieval-on-flickr30k	Dual-Path (ResNet)	Image-to-text R@1: 55.6 Image-to-text R@5: 81.9
cross-modal-retrieval-on-mscoco-1k	Dual-path CNN	Image-to-text R@1: 41.2 Text-to-image R@1: 25.3
nlp-based-person-retrival-on-cuhk-pedes	Dual Path	R@1: 44.4 R@10: 75.07 R@5: 66.26

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette