Command Palette
Search for a command to run...
Jinming Wu Zihao Deng Wei Li Yiding Liu Bo You Bo Li Zejun Ma Ziwei Liu

Abstract
Robust deployment of large multimodal models (LMMs) in real-world scenariosrequires access to external knowledge sources, given the complexity and dynamicnature of real-world information. Existing approaches such asretrieval-augmented generation (RAG) and prompt engineered search agents relyon rigid pipelines, often leading to inefficient or excessive search behaviors.We present MMSearch-R1, the first end-to-end reinforcement learning frameworkthat enables LMMs to perform on-demand, multi-turn search in real-worldInternet environments. Our framework integrates both image and text searchtools, allowing the model to reason about when and how to invoke them guided byan outcome-based reward with a search penalty. To support training, We collecta multimodal search VQA dataset through a semi-automated pipeline that coversdiverse visual and textual knowledge needs and curate a search-balanced subsetwith both search-required and search-free samples, which proves essential forshaping efficient and on-demand search behavior. Extensive experiments onknowledge-intensive and info-seeking VQA tasks show that our model not onlyoutperforms RAG-based baselines of the same model size, but also matches theperformance of a larger RAG-based model while reducing search calls by over30%. We further analyze key empirical findings to offer actionable insights foradvancing research in multimodal search.
Code Repositories
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.