7 months ago

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu

Abstract

This paper focuses on monolithic Multimodal Large Language Models (MLLMs), which integrate visual encoding and language decoding into a single model. Existing structures and pre-training strategies for monolithic MLLMs often suffer from unstable optimization and catastrophic forgetting. To address these challenges, our key idea is to embed a new visual parameter space into a pre-trained LLM, enabling stable learning of visual knowledge from noisy data via delta tuning. Based on this principle, we first introduce Mono-InternVL, an advanced monolithic MLLM that incorporates a set of visual experts through a multimodal mixture-of-experts architecture. In addition, we design an innovative Endogenous Visual Pre-training (EViP) for Mono-InternVL to maximize its visual capabilities via progressive learning. Mono-InternVL achieves competitive performance against existing MLLMs but also leads to relatively expensive data cost. Therefore, we further present Mono-InternVL-1.5, a cheaper and stronger monolithic MLLM equipped with an improved EViP (EViP++). EViP++ introduces additional visual attention experts to Mono-InternVL-1.5 and re-organizes the pre-training process in an efficient manner. During inference, it includes a fused CUDA kernel to speed up its MoE operations. With these designs, Mono-InternVL-1.5 significantly reduces training and inference costs, while still maintaining competitive performance with Mono-InternVL. To evaluate our approach, we conduct extensive experiments across 15 benchmarks. Results demonstrate that Mono-InternVL outperforms existing monolithic MLLMs on 12 out of 15 benchmarks, e.g., +114-point improvement over Emu3 on OCRBench. Compared to its modular counterpart, i.e., InternVL-1.5, Mono-InternVL-1.5 achieves similar multimodal performance while reducing first-token latency by up to 69%. Code and models are released at https://github.com/OpenGVLab/Mono-InternVL.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

7 months ago

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

7 months ago

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu

Abstract

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Command Palette

Mono-InternVL-1.5: Towards Cheaper and Faster Monolithic Multimodal Large Language Models

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu2 more

Abstract

Build AI with AI

HyperAI Newsletters

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu

Gen Luo Wenhan Dou Wenhao Li Zhaokai Wang Xue Yang Changyao Tian Hao Li Weiyun Wang Wenhai Wang Xizhou Zhu