HyperAIHyperAI

Command Palette

Search for a command to run...

2 months ago

Visual Representation Alignment for Multimodal Large Language Models

Visual Representation Alignment for Multimodal Large Language Models

Abstract

Multimodal large language models (MLLMs) trained with visual instructiontuning have achieved strong performance across diverse tasks, yet they remainlimited in vision-centric tasks such as object counting or spatial reasoning.We attribute this gap to the prevailing text-only supervision paradigm, whichprovides only indirect guidance for the visual pathway and often leads MLLMs todiscard fine-grained visual details during training. In this paper, we presentVIsual Representation ALignment (VIRAL), a simple yet effective regularizationstrategy that aligns the internal visual representations of MLLMs with those ofpre-trained vision foundation models (VFMs). By explicitly enforcing thisalignment, VIRAL enables the model not only to retain critical visual detailsfrom the input vision encoder but also to complement additional visualknowledge from VFMs, thereby enhancing its ability to reason over complexvisual inputs. Our experiments demonstrate consistent improvements across alltasks on widely adopted multimodal benchmarks. Furthermore, we conductcomprehensive ablation studies to validate the key design choices underlyingour framework. We believe this simple finding opens up an important directionfor the effective integration of visual information in training MLLMs.

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Visual Representation Alignment for Multimodal Large Language Models | Papers | HyperAI