6 months ago

Abstract

Multi-modal Large Language Models (MLLMs) have demonstrated remarkablecapabilities in executing instructions for a variety of single-image tasks.Despite this progress, significant challenges remain in modeling long imagesequences. In this work, we introduce the versatile multi-modal large languagemodel, mPLUG-Owl3, which enhances the capability for long image-sequenceunderstanding in scenarios that incorporate retrieved image-text knowledge,interleaved image-text, and lengthy videos. Specifically, we propose novelhyper attention blocks to efficiently integrate vision and language into acommon language-guided semantic space, thereby facilitating the processing ofextended multi-image scenarios. Extensive experimental results suggest thatmPLUG-Owl3 achieves state-of-the-art performance among models with a similarsize on single-image, multi-image, and video benchmarks. Moreover, we propose achallenging long visual sequence evaluation named Distractor Resistance toassess the ability of models to maintain focus amidst distractions. Finally,with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performanceon ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute tothe development of more efficient and powerful multimodal large languagemodels.

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

6 months ago

Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

Abstract

Source PDF

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models

Jiabo Ye Haiyang Xu Haowei Liu Anwen Hu Ming Yan Qi Qian Ji Zhang Fei Huang Jingren Zhou

Abstract

Build AI with AI

HyperAI Newsletters