Command Palette
Search for a command to run...
Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation
Chenghao Zhang Guanting Dong Xinyu Yang Zhicheng Dou

Abstract
Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm forenhancing large language models (LLMs) by retrieving relevant documents from anexternal corpus. However, existing RAG systems primarily focus on unimodal textdocuments, and often fall short in real-world scenarios where both queries anddocuments may contain mixed modalities (such as text and images). In thispaper, we address the challenge of Universal Retrieval-Augmented Generation(URAG), which involves retrieving and reasoning over mixed-modal information toimprove vision-language generation. To this end, we propose Nyx, a unifiedmixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigatethe scarcity of realistic mixed-modal data, we introduce a four-stage automatedpipeline for generation and filtering, leveraging web documents to constructNyxQA, a dataset comprising diverse mixed-modal question-answer pairs thatbetter reflect real-world information needs. Building on this high-qualitydataset, we adopt a two-stage training framework for Nyx: we first performpre-training on NyxQA along with a variety of open-source retrieval datasets,followed by supervised fine-tuning using feedback from downstreamvision-language models (VLMs) to align retrieval outputs with generativepreferences. Experimental results demonstrate that Nyx not only performscompetitively on standard text-only RAG benchmarks, but also excels in the moregeneral and realistic URAG setting, significantly improving generation qualityin vision-language tasks.
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.