Command Palette
Search for a command to run...
MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Abstract
Generating natural and meaningful responses to communicate with multi-modalhuman inputs is a fundamental capability of Large Vision-LanguageModels(LVLMs). While current open-source LVLMs demonstrate promisingperformance in simplified scenarios such as single-turn single-image input,they fall short in real-world conversation scenarios such as followinginstructions in a long context history with multi-turn and multi-images.Existing LVLM benchmarks primarily focus on single-choice questions orshort-form responses, which do not adequately assess the capabilities of LVLMsin real-world human-AI interaction applications. Therefore, we introduce MMDU,a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuningdataset, designed to evaluate and improve LVLMs' abilities in multi-turn andmulti-image conversations. We employ the clustering algorithm to ffnd therelevant images and textual descriptions from the open-source Wikipedia andconstruct the question-answer pairs by human annotators with the assistance ofthe GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and27 turns, which is at least 5x longer than previous benchmarks and poseschallenges to current LVLMs. Our in-depth analysis of 15 representative LVLMsusing MMDU reveals that open-source LVLMs lag behind closed-source counterpartsdue to limited conversational instruction tuning data. We demonstrate thatffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap,generating longer and more accurate conversations, and improving scores on MMDUand existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Ourcontributions pave the way for bridging the gap between current LVLM models andreal-world application demands. This project is available athttps://github.com/Liuziyu77/MMDU.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| visual-question-answering-on-mm-vet | InternLM-XC2 + MMDU-45k | GPT-4 score: 38.8 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.