Command Palette
Search for a command to run...
Omni-SMoLA: Boosting Generalist Multimodal Models with Soft Mixture of Low-rank Experts
Wu Jialin ; Hu Xia ; Wang Yaqing ; Pang Bo ; Soricut Radu

Abstract
Large multi-modal models (LMMs) exhibit remarkable performance acrossnumerous tasks. However, generalist LMMs often suffer from performancedegradation when tuned over a large collection of tasks. Recent researchsuggests that Mixture of Experts (MoE) architectures are useful for instructiontuning, but for LMMs of parameter size around O(50-100B), the prohibitive costof replicating and storing the expert models severely limits the number ofexperts we can use. We propose Omni-SMoLA, an architecture that uses the SoftMoE approach to (softly) mix many multimodal low rank experts, and avoidsintroducing a significant number of new parameters compared to conventional MoEmodels. The core intuition here is that the large model provides a foundationalbackbone, while different lightweight experts residually learn specializedknowledge, either per-modality or multimodally. Extensive experimentsdemonstrate that the SMoLA approach helps improve the generalist performanceacross a broad range of generative vision-and-language tasks, achieving newSoTA generalist performance that often matches or outperforms singlespecialized LMM baselines, as well as new SoTA specialist performance.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| chart-question-answering-on-chartqa | SMoLA-PaLI-X Generalist Model | 1:1 Accuracy: 73.8 |
| chart-question-answering-on-chartqa | SMoLA-PaLI-X Specialist Model | 1:1 Accuracy: 74.6 |
| object-counting-on-tallyqa-complex | SMoLA-PaLI-X Specialist | Accuracy: 77.1 |
| object-counting-on-tallyqa-complex | SMoLA-PaLI-X Generalist (0 shot) | Accuracy: 70.7 |
| object-counting-on-tallyqa-simple | SMoLA-PaLI-X Generalist (0 shot) | Accuracy: 83.3 |
| object-counting-on-tallyqa-simple | SMoLA-PaLI-X Specialist | Accuracy: 86.3 |
| visual-question-answering-on-a-okvqa | SMoLA-PaLI-X Specialist Model | DA VQA Score: 70.55 MC Accuracy: 83.75 |
| visual-question-answering-on-docvqa-test | SMoLA-PaLI-X Generalist | ANLS: 0.906 |
| visual-question-answering-on-docvqa-test | SMoLA-PaLI-X Specialist | ANLS: 0.908 |
| visual-question-answering-vqa-on | SMoLA-PaLI-X Specialist | ANLS: 66.2 |
| visual-question-answering-vqa-on | SMoLA-PaLI-X Generalist | ANLS: 65.6 |
| visual-question-answering-vqa-on-ai2d | SMoLA-PaLI-X Specialist Model | EM: 82.5 |
| visual-question-answering-vqa-on-ai2d | SMoLA-PaLI-X Generalist Model | EM: 81.4 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.