22 days ago

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Yumin Choi Dongki Kim Jinheon Baek Sung Ju Hwang

Abstract

Large Language Models (LLMs) have shown remarkable success, and theirmultimodal expansions (MLLMs) further unlock capabilities spanning images,videos, and other modalities beyond text. However, despite this shift, promptoptimization approaches, designed to reduce the burden of manual promptcrafting while maximizing performance, remain confined to text, ultimatelylimiting the full potential of MLLMs. Motivated by this gap, we introduce thenew problem of multimodal prompt optimization, which expands the priordefinition of prompt optimization to the multimodal space defined by the pairsof textual and non-textual prompts. To tackle this problem, we then propose theMultimodal Prompt Optimizer (MPO), a unified framework that not only performsthe joint optimization of multimodal prompts through alignment-preservingupdates but also guides the selection process of candidate prompts byleveraging earlier evaluations as priors in a Bayesian-based selectionstrategy. Through extensive experiments across diverse modalities that gobeyond text, such as images, videos, and even molecules, we demonstrate thatMPO outperforms leading text-only optimization methods, establishing multimodalprompt optimization as a crucial step to realizing the potential of MLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Yumin Choi Dongki Kim Jinheon Baek Sung Ju Hwang

Abstract

Build AI with AI

Hyper Newsletters