2 months ago

Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon Jaewoo Jung Junwan Kim Hyungyu Choi Heeseong Shin Sangbeom Lim Honggyu An Chaehyun Kim Jisang Han Donghyun Kim

Abstract

Multimodal large language models (MLLMs) trained with visual instructiontuning have achieved strong performance across diverse tasks, yet they remainlimited in vision-centric tasks such as object counting or spatial reasoning.We attribute this gap to the prevailing text-only supervision paradigm, whichprovides only indirect guidance for the visual pathway and often leads MLLMs todiscard fine-grained visual details during training. In this paper, we presentVIsual Representation ALignment (VIRAL), a simple yet effective regularizationstrategy that aligns the internal visual representations of MLLMs with those ofpre-trained vision foundation models (VFMs). By explicitly enforcing thisalignment, VIRAL enables the model not only to retain critical visual detailsfrom the input vision encoder but also to complement additional visualknowledge from VFMs, thereby enhancing its ability to reason over complexvisual inputs. Our experiments demonstrate consistent improvements across alltasks on widely adopted multimodal benchmarks. Furthermore, we conductcomprehensive ablation studies to validate the key design choices underlyingour framework. We believe this simple finding opens up an important directionfor the effective integration of visual information in training MLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Visual Representation Alignment for Multimodal Large Language Models

Heeji Yoon Jaewoo Jung Junwan Kim Hyungyu Choi Heeseong Shin Sangbeom Lim Honggyu An Chaehyun Kim Jisang Han Donghyun Kim3 more

Abstract

Build AI with AI

Hyper Newsletters

Heeji Yoon Jaewoo Jung Junwan Kim Hyungyu Choi Heeseong Shin Sangbeom Lim Honggyu An Chaehyun Kim Jisang Han Donghyun Kim