WangWeihan ; LvQingsong ; YuWenmeng ; HongWenyi ; QiJi ; WangYan ; JiJunhui ; YangZhuoyi ; ZhaoLei ; SongXixuan ; XuJiazheng ; XuBin ; LiJuanzi ; DongYuxiao ; DingMing ; TangJie

摘要
我们介绍了一种强大的开源视觉语言基础模型——CogVLM。不同于流行的浅层对齐方法,该方法将图像特征映射到语言模型的输入空间,CogVLM 通过在注意力层和前馈神经网络(FFN)层中引入一个可训练的视觉专家模块,弥合了冻结预训练语言模型和图像编码器之间的差距。因此,CogVLM 能够实现视觉和语言特征的深度融合,同时不会牺牲任何自然语言处理任务的性能。CogVLM-17B 在包括 NoCaps、Flickr30k 描述生成、RefCOCO、RefCOCO+、RefCOCOg、Visual7W、GQA、ScienceQA、VizWiz VQA 和 TDIUC 在内的 10 个经典跨模态基准测试中取得了最先进的性能,并在 VQAv2、OKVQA、TextVQA 和 COCO 描述生成等任务上排名第二,超越或匹配了 PaLI-X 55B 的表现。代码和模型检查点可在 https://github.com/THUDM/CogVLM 获取。
代码仓库
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| fs-mevqa-on-sme | GLM-4V | #Learning Samples (N): 16 ACC: 34.23 BLEU-4: 14.45 CIDEr: 127.37 Detection: 0.89 METEOR: 17.53 ROUGE-L: 24.28 SPICE: 17.70 |
| long-context-understanding-on-mmneedle | CogVLM2-Llama-3 | 1 Image, 2*2 Stitching, Exact Accuracy: 7.3 1 Image, 4*4 Stitching, Exact Accuracy: 0.9 1 Image, 8*8 Stitching, Exact Accuracy: 0.1 10 Images, 1*1 Stitching, Exact Accuracy: 0 10 Images, 2*2 Stitching, Exact Accuracy: 0 10 Images, 4*4 Stitching, Exact Accuracy: 0 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| long-context-understanding-on-mmneedle | CogVLM-17B | 1 Image, 2*2 Stitching, Exact Accuracy: 0 1 Image, 4*4 Stitching, Exact Accuracy: 0.1 1 Image, 8*8 Stitching, Exact Accuracy: 0.3 10 Images, 1*1 Stitching, Exact Accuracy: 0 10 Images, 2*2 Stitching, Exact Accuracy: 0 10 Images, 4*4 Stitching, Exact Accuracy: 0 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| visual-question-answering-on-mm-vet | GLM4 Vision | GPT-4 score: 63.9 |
| visual-question-answering-on-mm-vet | CogVLM(Vicuna-7B) | GPT-4 score: 52.8 Params: 17B |
| visual-question-answering-on-mm-vet-v2 | CogVLM-Chat | GPT-4 score: 45.1±0.2 |
| visual-question-answering-vqa-on-core-mm | CogVLM-Chat | Abductive: 47.88 Analogical: 28.75 Deductive: 36.75 Overall score: 37.16 Params: 17B |