8 months ago

Multimodal Representation

Visual Document Retrieval

Zheyuan Liu Cristian Rodriguez-Opazo Damien Teney Stephen Gould

Abstract

We extend the task of composed image retrieval, where an input query consistsof an image and short textual description of how to modify the image. Existingmethods have only been applied to non-complex images within narrow domains,such as fashion products, thereby limiting the scope of study on in-depthvisual reasoning in rich image and language contexts. To address this issue, wecollect the Compose Image Retrieval on Real-life images (CIRR) dataset, whichconsists of over 36,000 pairs of crowd-sourced, open-domain images withhuman-generated modifying text. To extend current methods to the open-domain,we propose CIRPLANT, a transformer based model that leverages rich pre-trainedvision-and-language (V&L) knowledge for modifying visual features conditionedon natural language. Retrieval is then done by nearest neighbor lookup on themodified features. We demonstrate that with a relatively simple architecture,CIRPLANT outperforms existing methods on open-domain images, while matchingstate-of-the-art accuracy on the existing narrow datasets, such as fashion.Together with the release of CIRR, we believe this work will inspire furtherresearch on composed image retrieval.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Visual Document Retrieval

Zheyuan Liu Cristian Rodriguez-Opazo Damien Teney Stephen Gould

Abstract

We extend the task of composed image retrieval, where an input query consistsof an image and short textual description of how to modify the image. Existingmethods have only been applied to non-complex images within narrow domains,such as fashion products, thereby limiting the scope of study on in-depthvisual reasoning in rich image and language contexts. To address this issue, wecollect the Compose Image Retrieval on Real-life images (CIRR) dataset, whichconsists of over 36,000 pairs of crowd-sourced, open-domain images withhuman-generated modifying text. To extend current methods to the open-domain,we propose CIRPLANT, a transformer based model that leverages rich pre-trainedvision-and-language (V&L) knowledge for modifying visual features conditionedon natural language. Retrieval is then done by nearest neighbor lookup on themodified features. We demonstrate that with a relatively simple architecture,CIRPLANT outperforms existing methods on open-domain images, while matchingstate-of-the-art accuracy on the existing narrow datasets, such as fashion.Together with the release of CIRR, we believe this work will inspire furtherresearch on composed image retrieval.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Image Retrieval on Real-life Images with Pre-trained Vision-and-Language Models | Papers | HyperAI