5 months ago

Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts

Wu Jialin ; Hu Xia ; Wang Yaqing ; Pang Bo ; Soricut Radu

Abstract

Large multi-modal models (LMMs) exhibit remarkable performance acrossnumerous tasks. However, generalist LMMs often suffer from performancedegradation when tuned over a large collection of tasks. Recent researchsuggests that Mixture of Experts (MoE) architectures are useful for instructiontuning, but for LMMs of parameter size around O(50-100B), the prohibitive costof replicating and storing the expert models severely limits the number ofexperts we can use. We propose Omni-SMoLA, an architecture that uses the SoftMoE approach to (softly) mix many multimodal low rank experts, and avoidsintroducing a significant number of new parameters compared to conventional MoEmodels. The core intuition here is that the large model provides a foundationalbackbone, while different lightweight experts residually learn specializedknowledge, either per-modality or multimodally. Extensive experimentsdemonstrate that the SMoLA approach helps improve the generalist performanceacross a broad range of generative vision-and-language tasks, achieving newSoTA generalist performance that often matches or outperforms singlespecialized LMM baselines, as well as new SoTA specialist performance.

Benchmarks

Benchmark	Methodology	Metrics
chart-question-answering-on-chartqa	SMoLA-PaLI-X Generalist Model	1:1 Accuracy: 73.8
chart-question-answering-on-chartqa	SMoLA-PaLI-X Specialist Model	1:1 Accuracy: 74.6
object-counting-on-tallyqa-complex	SMoLA-PaLI-X Specialist	Accuracy: 77.1
object-counting-on-tallyqa-complex	SMoLA-PaLI-X Generalist (0 shot)	Accuracy: 70.7
object-counting-on-tallyqa-simple	SMoLA-PaLI-X Generalist (0 shot)	Accuracy: 83.3
object-counting-on-tallyqa-simple	SMoLA-PaLI-X Specialist	Accuracy: 86.3
visual-question-answering-on-a-okvqa	SMoLA-PaLI-X Specialist Model	DA VQA Score: 70.55 MC Accuracy: 83.75
visual-question-answering-on-docvqa-test	SMoLA-PaLI-X Generalist	ANLS: 0.906
visual-question-answering-on-docvqa-test	SMoLA-PaLI-X Specialist	ANLS: 0.908
visual-question-answering-vqa-on	SMoLA-PaLI-X Specialist	ANLS: 66.2
visual-question-answering-vqa-on	SMoLA-PaLI-X Generalist	ANLS: 65.6
visual-question-answering-vqa-on-ai2d	SMoLA-PaLI-X Specialist Model	EM: 82.5
visual-question-answering-vqa-on-ai2d	SMoLA-PaLI-X Generalist Model	EM: 81.4

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning