HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Junjie Zhou Zheng Liu Shitao Xiao Bo Zhao Yongping Xiong

VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval

Abstract

Multi-modal retrieval becomes increasingly popular in practice. However, the existing retrievers are mostly text-oriented, which lack the capability to process visual information. Despite the presence of vision-language models like CLIP, the current methods are severely limited in representing the text-only and image-only data. In this work, we present a new embedding model VISTA for universal multi-modal retrieval. Our work brings forth threefold technical contributions. Firstly, we introduce a flexible architecture which extends a powerful text encoder with the image understanding capability by introducing visual token embeddings. Secondly, we develop two data generation strategies, which bring high-quality composed image-text to facilitate the training of the embedding model. Thirdly, we introduce a multi-stage training algorithm, which first aligns the visual token embedding with the text encoder using massive weakly labeled data, and then develops multi-modal representation capability using the generated composed image-text data. In our experiments, VISTA achieves superior performances across a variety of multi-modal retrieval tasks in both zero-shot and supervised settings. Our model, data, and source code are available at https://github.com/FlagOpen/FlagEmbedding.

Code Repositories

flagopen/flagembedding
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
image-retrieval-on-cirrVISTA (base)
(Recall@5+Recall_subset@1)/2: 75.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VISTA: Visualized Text Embedding For Universal Multi-Modal Retrieval | Papers | HyperAI