Command Palette
Search for a command to run...
mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkablecapabilities in executing instructions for a variety of single-image tasks.Despite this progress, significant challenges remain in modeling long imagesequences. In this work, we introduce the versatile multi-modal large languagemodel, mPLUG-Owl3, which enhances the capability for long image-sequenceunderstanding in scenarios that incorporate retrieved image-text knowledge,interleaved image-text, and lengthy videos. Specifically, we propose novelhyper attention blocks to efficiently integrate vision and language into acommon language-guided semantic space, thereby facilitating the processing ofextended multi-image scenarios. Extensive experimental results suggest thatmPLUG-Owl3 achieves state-of-the-art performance among models with a similarsize on single-image, multi-image, and video benchmarks. Moreover, we propose achallenging long visual sequence evaluation named Distractor Resistance toassess the ability of models to maintain focus amidst distractions. Finally,with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performanceon ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute tothe development of more efficient and powerful multimodal large languagemodels.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-mvbench | mPLUG-Owl3(7B) | Avg.: 59.5 |
| video-question-answering-on-next-qa | mPLUG-Owl3(8B) | Accuracy: 78.6 |
| video-question-answering-on-tvbench | mPLUG-Owl3 | Average Accuracy: 42.2 |
| visual-question-answering-on-mm-vet | mPLUG-Owl3 | GPT-4 score: 40.1 |
| visual-question-answering-vqa-on-vlm2-bench | mPLUG-Owl3-7B | Average Score on VLM2-bench (9 subtasks): 37.85 GC-mat: 17.37 GC-trk: 18.26 OC-cnt: 62.97 OC-cpr: 49.17 OC-grp: 31.00 PC-VID: 13.50 PC-cnt: 58.86 PC-cpr: 63.50 PC-grp: 26.00 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.