
摘要
大型多模态模型(LMMs)在众多任务中表现出卓越的性能。然而,当对大量任务进行微调时,通用型LMMs往往会出现性能下降的问题。近期研究表明,专家混合(Mixture of Experts, MoE)架构对于指令微调非常有用,但对于参数量约为50-100亿(O(50-100B))的LMMs,复制和存储专家模型的巨大成本严重限制了可使用的专家数量。我们提出了一种名为Omni-SMoLA的架构,该架构采用软专家混合(Soft MoE)方法来“软”地混合多个低秩多模态专家,并且相比传统的MoE模型,不会引入大量的新参数。其核心思想是,大型模型提供了一个基础骨干网络,而不同的轻量级专家则以残差学习的方式获取特定领域的知识,这些知识可以是单模态的或跨模态的。广泛的实验表明,SMoLA方法有助于提高在广泛生成性视觉与语言任务中的通用性能,实现了新的最先进(SoTA)通用性能,通常能够匹配或超越单一专门化的LMM基线模型的性能,并且还达到了新的最先进专门化性能。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| chart-question-answering-on-chartqa | SMoLA-PaLI-X Generalist Model | 1:1 Accuracy: 73.8 |
| chart-question-answering-on-chartqa | SMoLA-PaLI-X Specialist Model | 1:1 Accuracy: 74.6 |
| object-counting-on-tallyqa-complex | SMoLA-PaLI-X Specialist | Accuracy: 77.1 |
| object-counting-on-tallyqa-complex | SMoLA-PaLI-X Generalist (0 shot) | Accuracy: 70.7 |
| object-counting-on-tallyqa-simple | SMoLA-PaLI-X Generalist (0 shot) | Accuracy: 83.3 |
| object-counting-on-tallyqa-simple | SMoLA-PaLI-X Specialist | Accuracy: 86.3 |
| visual-question-answering-on-a-okvqa | SMoLA-PaLI-X Specialist Model | DA VQA Score: 70.55 MC Accuracy: 83.75 |
| visual-question-answering-on-docvqa-test | SMoLA-PaLI-X Generalist | ANLS: 0.906 |
| visual-question-answering-on-docvqa-test | SMoLA-PaLI-X Specialist | ANLS: 0.908 |
| visual-question-answering-vqa-on | SMoLA-PaLI-X Specialist | ANLS: 66.2 |
| visual-question-answering-vqa-on | SMoLA-PaLI-X Generalist | ANLS: 65.6 |
| visual-question-answering-vqa-on-ai2d | SMoLA-PaLI-X Specialist Model | EM: 82.5 |
| visual-question-answering-vqa-on-ai2d | SMoLA-PaLI-X Generalist Model | EM: 81.4 |