14 days ago

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Chenghao Zhang Guanting Dong Xinyu Yang Zhicheng Dou

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm forenhancing large language models (LLMs) by retrieving relevant documents from anexternal corpus. However, existing RAG systems primarily focus on unimodal textdocuments, and often fall short in real-world scenarios where both queries anddocuments may contain mixed modalities (such as text and images). In thispaper, we address the challenge of Universal Retrieval-Augmented Generation(URAG), which involves retrieving and reasoning over mixed-modal information toimprove vision-language generation. To this end, we propose Nyx, a unifiedmixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigatethe scarcity of realistic mixed-modal data, we introduce a four-stage automatedpipeline for generation and filtering, leveraging web documents to constructNyxQA, a dataset comprising diverse mixed-modal question-answer pairs thatbetter reflect real-world information needs. Building on this high-qualitydataset, we adopt a two-stage training framework for Nyx: we first performpre-training on NyxQA along with a variety of open-source retrieval datasets,followed by supervised fine-tuning using feedback from downstreamvision-language models (VLMs) to align retrieval outputs with generativepreferences. Experimental results demonstrate that Nyx not only performscompetitively on standard text-only RAG benchmarks, but also excels in the moregeneral and realistic URAG setting, significantly improving generation qualityin vision-language tasks.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Chenghao Zhang Guanting Dong Xinyu Yang Zhicheng Dou

Abstract

Build AI with AI

Hyper Newsletters