HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Yuncheng Guo Xiaodong Gu

MMRL: Multi-Modal Representation Learning for Vision-Language Models

Abstract

Large-scale pre-trained Vision-Language Models (VLMs) have become essential for transfer learning across diverse tasks. However, adapting these models with limited few-shot data often leads to overfitting, diminishing their performance on new tasks. To tackle this issue, we propose a novel Multi-Modal Representation Learning (MMRL) framework that introduces a shared, learnable, and modality-agnostic representation space. MMRL projects the space tokens to text and image representation tokens, facilitating more effective multi-modal interactions. Unlike previous approaches that solely optimize class token features, MMRL integrates representation tokens at higher layers of the encoders--where dataset-specific features are more prominent--while preserving generalized knowledge in the lower layers. During training, both representation and class features are optimized, with trainable projection layer applied to the representation tokens, whereas the class token projection layer remains frozen to retain pre-trained knowledge. Furthermore, a regularization term is introduced to align the class features and text features with the zero-shot features from the frozen VLM, thereby safeguarding the model's generalization capacity. For inference, a decoupling strategy is employed, wherein both representation and class features are utilized for base classes, while only the class features, which retain more generalized knowledge, are used for new tasks. Extensive experiments across 15 datasets demonstrate that MMRL outperforms state-of-the-art methods, achieving a balanced trade-off between task-specific adaptation and generalization. Code is available at https://github.com/yunncheng/MMRL.

Code Repositories

yunncheng/MMRL
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
prompt-engineering-on-caltech-101MMRL
Harmonic mean: 96.68
prompt-engineering-on-dtdMMRL
Harmonic mean: 73.82
prompt-engineering-on-eurosatMMRL
Harmonic mean: 87.21
prompt-engineering-on-fgvc-aircraftMMRL
Harmonic mean: 41.15
prompt-engineering-on-food-101MMRL
Harmonic mean: 91.03
prompt-engineering-on-imagenetMMRL
Harmonic mean: 74.45
prompt-engineering-on-imagenet-aMMRL
Top-1 accuracy %: 51.20
prompt-engineering-on-imagenet-rMMRL
Top-1 accuracy %: 77.53
prompt-engineering-on-imagenet-sMMRL
Top-1 accuracy %: 49.17
prompt-engineering-on-imagenet-v2MMRL
Top-1 accuracy %: 64.47
prompt-engineering-on-oxford-102-flowerMMRL
Harmonic mean: 86.78
prompt-engineering-on-oxford-iiit-pet-datasetMMRL
Harmonic mean: 96.74
prompt-engineering-on-stanford-cars-1MMRL
Harmonic mean: 78.06
prompt-engineering-on-sun397MMRL
Harmonic mean: 81.20
prompt-engineering-on-ucf101MMRL
Harmonic mean: 83.89

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MMRL: Multi-Modal Representation Learning for Vision-Language Models | Papers | HyperAI