HyperAIHyperAI

Command Palette

Search for a command to run...

22 days ago

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs

Yumin Choi Dongki Kim Jinheon Baek Sung Ju Hwang

Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for
  MLLMs

Abstract

Large Language Models (LLMs) have shown remarkable success, and theirmultimodal expansions (MLLMs) further unlock capabilities spanning images,videos, and other modalities beyond text. However, despite this shift, promptoptimization approaches, designed to reduce the burden of manual promptcrafting while maximizing performance, remain confined to text, ultimatelylimiting the full potential of MLLMs. Motivated by this gap, we introduce thenew problem of multimodal prompt optimization, which expands the priordefinition of prompt optimization to the multimodal space defined by the pairsof textual and non-textual prompts. To tackle this problem, we then propose theMultimodal Prompt Optimizer (MPO), a unified framework that not only performsthe joint optimization of multimodal prompts through alignment-preservingupdates but also guides the selection process of candidate prompts byleveraging earlier evaluations as priors in a Bayesian-based selectionstrategy. Through extensive experiments across diverse modalities that gobeyond text, such as images, videos, and even molecules, we demonstrate thatMPO outperforms leading text-only optimization methods, establishing multimodalprompt optimization as a crucial step to realizing the potential of MLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Prompt Optimization: Why Not Leverage Multiple Modalities for MLLMs | Papers | HyperAI