Haiyang XuQinghao YeMing YanYaya ShiJiabo YeYuanhong XuChenliang LiBin BiQi QianWei WangGuohai XuJi ZhangSongfang HuangFei HuangJingren Zhou

摘要
近年来,语言、视觉与多模态预训练领域呈现出显著的融合趋势。本文提出了一种全新的统一范式——mPLUG-2,其采用模块化设计,旨在促进模态间的协同作用,同时有效缓解模态纠缠问题。与当前主流方法(仅依赖序列到序列生成或基于编码器的实例判别)不同,mPLUG-2引入了一种多模块组合网络架构:通过共享通用的通用模块实现模态间的协同,同时将不同模态的模块进行解耦,以应对模态纠缠挑战。该架构具有高度灵活性,可根据不同模态(包括文本、图像和视频)下的理解与生成任务,自由选择适配的模块。实证研究表明,mPLUG-2在超过30项下游任务中取得了当前最优或具有竞争力的性能表现,涵盖图像-文本、视频-文本等多模态理解与生成任务,以及纯文本、纯图像和纯视频等单模态理解任务。尤为突出的是,在具有挑战性的MSRVTT视频问答与视频字幕任务上,mPLUG-2以远小于现有模型的参数规模和数据量,实现了48.0的Top-1准确率和80.3的CIDEr得分,刷新了该任务的最新纪录。此外,该模型在视觉-语言与视频-语言任务中展现出强大的零样本迁移能力。相关代码与模型将开源发布于:https://github.com/alibaba/AliceMind。
代码仓库
modelscope/modelscope
pytorch
X-PLUG/mPLUG-2
pytorch
GitHub 中提及
alibaba/AliceMind
官方
pytorch
x-plug/mplug-owl
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| action-classification-on-kinetics-400 | mPLUG-2 | Acc@1: 87.1 Acc@5: 97.7 |
| action-classification-on-kinetics-600 | mPLUG-2 | Top-1 Accuracy: 89.8 Top-5 Accuracy: 98.3 |
| action-classification-on-kinetics-700 | mPLUG-2 | Top-1 Accuracy: 80.4 Top-5 Accuracy: 94.9 |
| video-captioning-on-msr-vtt-1 | mPLUG-2 | BLEU-4: 57.8 CIDEr: 80.0 METEOR: 34.9 ROUGE-L: 70.1 |
| video-captioning-on-msvd-1 | mPLUG-2 | BLEU-4: 70.5 CIDEr: 165.8 METEOR: 48.4 ROUGE-L: 85.3 |
| video-question-answering-on-msrvtt-qa | mPLUG-2 | Accuracy: 48.0 |
| video-retrieval-on-didemo | mPLUG-2 | text-to-video R@1: 56.4 text-to-video R@10: 85.2 text-to-video R@5: 79.1 |
| video-retrieval-on-lsmdc | mPLUG-2 | text-to-video R@1: 34.4 text-to-video R@10: 65.1 text-to-video R@5: 55.2 |
| video-retrieval-on-msr-vtt-1ka | mPLUG-2 | text-to-video R@1: 53.1 text-to-video R@10: 84.7 text-to-video R@5: 77.6 |
| visual-grounding-on-refcoco-test-b | mPLUG-2 | Accuracy (%): 86.05 |
| visual-grounding-on-refcoco-testa | mPLUG-2 | Accuracy (%): 92.8 |
| visual-grounding-on-refcoco-val | mPLUG-2 | Accuracy (%): 90.33 |
| visual-question-answering-on-msrvtt-qa-1 | mPLUG-2 | Accuracy: 0.480 |
| visual-question-answering-on-msvd-qa-1 | mPLUG-2 | Accuracy: 0.581 |
| visual-question-answering-on-vqa-v2-test-dev-1 | mPLUG-2 | Accuracy: 81.11 |
| zero-shot-video-retrieval-on-didemo | mPLUG-2 | text-to-video R@1: 45.7 text-to-video R@10: 79.2 text-to-video R@5: 71.1 |
| zero-shot-video-retrieval-on-lsmdc | mPLUG-2 | text-to-video R@1: 24.1 text-to-video R@10: 52.0 text-to-video R@5: 43.8 |
| zero-shot-video-retrieval-on-msr-vtt | mPLUG-2 | text-to-video R@1: 47.1 text-to-video R@10: 79.0 text-to-video R@5: 69.7 |