ChenZhe ; WuJiannan ; WangWenhai ; SuWeijie ; ChenGuo ; XingSen ; ZhongMuyan ; ZhangQinglong ; ZhuXizhou ; LuLewei ; LiBin ; LuoPing ; LuTong ; QiaoYu ; DaiJifeng

摘要
大型语言模型(LLMs)的指数级增长为多模态AGI系统开辟了众多可能性。然而,视觉及视觉-语言基础模型的发展步伐并未跟上LLMs,而这些模型同样是构建多模态AGI的关键组成部分。在本研究中,我们设计了一种大规模视觉-语言基础模型(InternVL),该模型将视觉基础模型扩展至60亿参数,并利用来自不同来源的网络规模图像-文本数据逐步与LLM对齐。此模型可广泛应用于32个通用的视觉-语言基准测试,包括图像级或像素级识别等视觉感知任务,以及零样本图像/视频分类、零样本图像/视频-文本检索等视觉-语言任务,并且可以与LLM结合创建多模态对话系统。它具备强大的视觉能力,可以作为ViT-22B的良好替代方案。我们希望本研究能够为多模态大模型的发展做出贡献。代码和模型可在https://github.com/OpenGVLab/InternVL 获取。
代码仓库
opengvlab/internvl-mmdetseg
pytorch
GitHub 中提及
opengvlab/internvl
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-retrieval-on-flickr30k-cn | InternVL-G-FT | R@1: 85.9 R@10: 97.1 R@5: 98.7 |
| image-retrieval-on-flickr30k-cn | InternVL-C-FT | R@1: 85.2 R@10: 97.0 R@5: 98.5 |
| image-to-text-retrieval-on-flickr30k | InternVL-G-FT (finetuned, w/o ranking) | Recall@1: 97.9 Recall@10: 100 Recall@5: 100 |
| image-to-text-retrieval-on-flickr30k | InternVL-C-FT (finetuned, w/o ranking) | Recall@1: 97.2 Recall@10: 100 Recall@5: 100 |
| mmr-total-on-mrr-benchmark | InternVL2-8B | Total Column Score: 368 |
| mmr-total-on-mrr-benchmark | InternVL2-1B | Total Column Score: 237 |
| visual-question-answering-on-vqa-v2-test-dev | InternVL-C | Accuracy: 81.2 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | InternVL-C | Image-to-text R@1: 70.6 Image-to-text R@10: 93.5 Image-to-text R@5: 89.0 Text-to-image R@1: 54.1 Text-to-image R@10: 84.6 Text-to-image R@5: 77.3 |
| zero-shot-cross-modal-retrieval-on-coco-2014 | InternVL-G | Image-to-text R@1: 74.9 Image-to-text R@10: 95.2 Image-to-text R@5: 91.3 Text-to-image R@1: 58.6 Text-to-image R@10: 88.0 Text-to-image R@5: 81.3 |
| zero-shot-cross-modal-retrieval-on-flickr30k | InternVL-G | Image-to-text R@1: 95.7 Image-to-text R@10: 99.9 Image-to-text R@5: 99.7 Text-to-image R@1: 85.0 Text-to-image R@10: 98.6 Text-to-image R@5: 97.0 |
| zero-shot-cross-modal-retrieval-on-flickr30k | InternVL-C | Image-to-text R@1: 94.7 Image-to-text R@10: 99.9 Image-to-text R@5: 99.6 Text-to-image R@1: 81.7 Text-to-image R@10: 98.2 Text-to-image R@5: 96.0 |
| zero-shot-transfer-image-classification-on-1 | InternVL-C | Accuracy (Private): 83.2 |
| zero-shot-transfer-image-classification-on-17 | InternVL-C | Top 1 Accuracy: 95.3 |
| zero-shot-transfer-image-classification-on-3 | InternVL-C | Accuracy (Private): 77.3 |
| zero-shot-transfer-image-classification-on-5 | InternVL-C | Accuracy (Private): 83.8 |
| zero-shot-transfer-image-classification-on-6 | InternVL-C | Accuracy (Private): 80.6 |
| zero-shot-transfer-image-classification-on-8 | InternVL-C | Accuracy (Private): 73.9 |
| zero-shot-transfer-image-classification-on-cn | InternVL-C | Accuracy (Private): 64.5 |
| zero-shot-video-retrieval-on-msr-vtt-full | InternVL-C | text-to-video R@1: 44.7 text-to-video R@10: 78.4 text-to-video R@5: 68.2 video-to-text R@1: 40.2 video-to-text R@10: 74.1 video-to-text R@5: 63.1 |
| zero-shot-video-retrieval-on-msr-vtt-full | InternVL-G | text-to-video R@1: 46.3 text-to-video R@10: 79.6 text-to-video R@5: 70.5 video-to-text R@1: 42.4 video-to-text R@10: 75.4 video-to-text R@5: 65.9 |