HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

What matters when building vision-language models?

Hugo Laurençon Léo Tronchon Matthieu Cord Victor Sanh

What matters when building vision-language models?

Abstract

The growing interest in vision-language models (VLMs) has been driven byimprovements in large language models and vision transformers. Despite theabundance of literature on this subject, we observe that critical decisionsregarding the design of VLMs are often not justified. We argue that theseunsupported decisions impede progress in the field by making it difficult toidentify which choices improve model performance. To address this issue, weconduct extensive experiments around pre-trained models, architecture choice,data, and training methods. Our consolidation of findings includes thedevelopment of Idefics2, an efficient foundational VLM of 8 billion parameters.Idefics2 achieves state-of-the-art performance within its size category acrossvarious multimodal benchmarks, and is often on par with models four times itssize. We release the model (base, instructed, and chat) along with the datasetscreated for its training.

Benchmarks

BenchmarkMethodologyMetrics
long-context-understanding-on-mmneedleIDEFICS2-8B
1 Image, 2*2 Stitching, Exact Accuracy: 18.9
1 Image, 4*4 Stitching, Exact Accuracy: 7.8
1 Image, 8*8 Stitching, Exact Accuracy: 0.9
10 Images, 1*1 Stitching, Exact Accuracy: 0
10 Images, 2*2 Stitching, Exact Accuracy: 0
10 Images, 4*4 Stitching, Exact Accuracy: 0
10 Images, 8*8 Stitching, Exact Accuracy: 0
mmr-total-on-mrr-benchmarkIdefics-2-8B
Total Column Score: 256

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
What matters when building vision-language models? | Papers | HyperAI