8 months ago

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin

Abstract

Generating natural and meaningful responses to communicate with multi-modalhuman inputs is a fundamental capability of Large Vision-LanguageModels(LVLMs). While current open-source LVLMs demonstrate promisingperformance in simplified scenarios such as single-turn single-image input,they fall short in real-world conversation scenarios such as followinginstructions in a long context history with multi-turn and multi-images.Existing LVLM benchmarks primarily focus on single-choice questions orshort-form responses, which do not adequately assess the capabilities of LVLMsin real-world human-AI interaction applications. Therefore, we introduce MMDU,a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuningdataset, designed to evaluate and improve LVLMs' abilities in multi-turn andmulti-image conversations. We employ the clustering algorithm to ffnd therelevant images and textual descriptions from the open-source Wikipedia andconstruct the question-answer pairs by human annotators with the assistance ofthe GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and27 turns, which is at least 5x longer than previous benchmarks and poseschallenges to current LVLMs. Our in-depth analysis of 15 representative LVLMsusing MMDU reveals that open-source LVLMs lag behind closed-source counterpartsdue to limited conversational instruction tuning data. We demonstrate thatffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap,generating longer and more accurate conversations, and improving scores on MMDUand existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Ourcontributions pave the way for bridging the gap between current LVLM models andreal-world application demands. This project is available athttps://github.com/Liuziyu77/MMDU.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multimodal

Dataset

Human-Computer Interaction

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

HyperAI

8 months ago

Multimodal

Dataset

Human-Computer Interaction

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin1 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin1 more

Abstract

Build AI with AI

HyperAI Newsletters

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin

Ziyu Liu Tao Chu Yuhang Zang Xilin Wei Xiaoyi Dong Pan Zhang Zijian Liang Yuanjun Xiong Yu Qiao Dahua Lin