HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Weizhe Lin Jinghong Chen Jingbiao Mei Alexandru Coca Bill Byrne

Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering

Abstract

Knowledge-based Visual Question Answering (KB-VQA) requires VQA systems to utilize knowledge from external knowledge bases to answer visually-grounded questions. Retrieval-Augmented Visual Question Answering (RA-VQA), a strong framework to tackle KB-VQA, first retrieves related documents with Dense Passage Retrieval (DPR) and then uses them to answer questions. This paper proposes Fine-grained Late-interaction Multi-modal Retrieval (FLMR) which significantly improves knowledge retrieval in RA-VQA. FLMR addresses two major limitations in RA-VQA's retriever: (1) the image representations obtained via image-to-text transforms can be incomplete and inaccurate and (2) relevance scores between queries and documents are computed with one-dimensional embeddings, which can be insensitive to finer-grained relevance. FLMR overcomes these limitations by obtaining image representations that complement those from the image-to-text transforms using a vision model aligned with an existing text-based retriever through a simple alignment network. FLMR also encodes images and questions using multi-dimensional embeddings to capture finer-grained relevance between queries and documents. FLMR significantly improves the original RA-VQA retriever's PRRecall@5 by approximately 8\%. Finally, we equipped RA-VQA with two state-of-the-art large multi-modal/language models to achieve $\sim61\%$ VQA score in the OK-VQA dataset.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
retrieval-on-ok-vqaFLMR
Recall@5: 89.32
visual-question-answering-on-ok-vqaRA-VQA-v2 (BLIP 2)
Accuracy: 62.08
Exact Match (EM): 62.01
Recall@5: 89.32
visual-question-answering-on-ok-vqaRA-VQA-v2 (T5-large)
Accuracy: 54.85

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Fine-grained Late-interaction Multi-modal Retrieval for Retrieval Augmented Visual Question Answering | Papers | HyperAI