HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal
  Large Language Models

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkablecapabilities in executing instructions for a variety of single-image tasks.Despite this progress, significant challenges remain in modeling long imagesequences. In this work, we introduce the versatile multi-modal large languagemodel, mPLUG-Owl3, which enhances the capability for long image-sequenceunderstanding in scenarios that incorporate retrieved image-text knowledge,interleaved image-text, and lengthy videos. Specifically, we propose novelhyper attention blocks to efficiently integrate vision and language into acommon language-guided semantic space, thereby facilitating the processing ofextended multi-image scenarios. Extensive experimental results suggest thatmPLUG-Owl3 achieves state-of-the-art performance among models with a similarsize on single-image, multi-image, and video benchmarks. Moreover, we propose achallenging long visual sequence evaluation named Distractor Resistance toassess the ability of models to maintain focus amidst distractions. Finally,with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performanceon ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute tothe development of more efficient and powerful multimodal large languagemodels.

Code Repositories

x-plug/mplug-owl
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-mvbenchmPLUG-Owl3(7B)
Avg.: 59.5
video-question-answering-on-next-qamPLUG-Owl3(8B)
Accuracy: 78.6
video-question-answering-on-tvbenchmPLUG-Owl3
Average Accuracy: 42.2
visual-question-answering-on-mm-vetmPLUG-Owl3
GPT-4 score: 40.1
visual-question-answering-vqa-on-vlm2-benchmPLUG-Owl3-7B
Average Score on VLM2-bench (9 subtasks): 37.85
GC-mat: 17.37
GC-trk: 18.26
OC-cnt: 62.97
OC-cpr: 49.17
OC-grp: 31.00
PC-VID: 13.50
PC-cnt: 58.86
PC-cpr: 63.50
PC-grp: 26.00

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models | Papers | HyperAI