HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Lin Ma; Zhengdong Lu; Lifeng Shang; Hang Li

Multimodal Convolutional Neural Networks for Matching Image and Sentence

Abstract

In this paper, we propose multimodal convolutional neural networks (m-CNNs) for matching image and sentence. Our m-CNN provides an end-to-end framework with convolutional architectures to exploit image representation, word composition, and the matching relations between the two modalities. More specifically, it consists of one image CNN encoding the image content, and one matching CNN learning the joint representation of image and sentence. The matching CNN composes words to different semantic fragments and learns the inter-modal relations between image and the composed fragments at different levels, thus fully exploit the matching relations between image and sentence. Experimental results on benchmark databases of bidirectional image and sentence retrieval demonstrate that the proposed m-CNNs can effectively capture the information necessary for image and sentence matching. Specifically, our proposed m-CNNs for bidirectional image and sentence retrieval on Flickr30K and Microsoft COCO databases achieve the state-of-the-art performances.

Benchmarks

BenchmarkMethodologyMetrics
image-retrieval-on-flickr30k-1k-testmCNN
R@1: 26.2
R@10: 69.6
R@5: 56.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Convolutional Neural Networks for Matching Image and Sentence | Papers | HyperAI