BaiJinze ; BaiShuai ; YangShusheng ; WangShijie ; TanSinan ; WangPeng ; LinJunyang ; ZhouChang ; ZhouJingren

摘要
在本研究中,我们介绍了Qwen-VL系列模型,这是一组大规模的视觉-语言模型(LVLMs),旨在感知和理解文本和图像。基于Qwen-LM作为基础,我们通过精心设计的(i)视觉接收器,(ii)输入输出接口,(iii)三阶段训练管道,以及(iv)多语言多模态清洗语料库,赋予其视觉能力。除了传统的图像描述和问答功能外,我们还通过对齐图像-标题-框元组实现了Qwen-VL模型的定位和文本阅读能力。最终生成的模型包括Qwen-VL和Qwen-VL-Chat,在广泛的以视觉为中心的基准测试(如图像描述、问答、视觉定位)和不同设置(如零样本、少样本)下,这些模型在类似规模的通才模型中创下了新的记录。此外,在现实世界的对话基准测试中,经过指令调优的Qwen-VL-Chat也表现出优于现有的视觉-语言聊天机器人。代码、演示和模型可在https://github.com/QwenLM/Qwen-VL获取。
代码仓库
brandon3964/multimodal-task-vector
pytorch
GitHub 中提及
qwenlm/qwen-vl
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| chart-question-answering-on-chartqa | Qwen-VL | 1:1 Accuracy: 65.7 |
| chart-question-answering-on-chartqa | Qwen-VL-Chat | 1:1 Accuracy: 66.3 |
| fs-mevqa-on-sme | Qwen-VL-Max | #Learning Samples (N): 16 ACC: 40.33 BLEU-4: 24.30 CIDEr: 201.47 Detection: 1.05 METEOR: 23.40 ROUGE-L: 34.52 SPICE: 26.13 |
| mmr-total-on-mrr-benchmark | Qwen-vl-max | Total Column Score: 366 |
| mmr-total-on-mrr-benchmark | Qwen-vl-plus | Total Column Score: 310 |
| natural-language-visual-grounding-on | Qwen-VL | Accuracy (%): 5.2 |
| spatial-reasoning-on-embspatial-bench | Qwen-VL-Max | Generation: 49.11 |
| visual-question-answering-on-docvqa-test | Qwen-VL | ANLS: 0.651 |
| visual-question-answering-on-docvqa-test | Qwen-VL-Plus | ANLS: 0.9024 |
| visual-question-answering-on-docvqa-test | Qwen-VL-Chat | ANLS: 0.626 |
| visual-question-answering-on-mm-vet | Qwen-VL-Max | GPT-4 score: 66.6±0.5 |
| visual-question-answering-on-mm-vet | Qwen-VL-Plus | GPT-4 score: 61.1±0.2 |
| visual-question-answering-on-mm-vet-v2 | Qwen-VL-Max | GPT-4 score: 55.8±0.2 |
| visual-question-answering-on-vip-bench | Qwen-VL-Chat (Coordinates) | GPT-4 score (bbox): 45.3 |
| visual-question-answering-on-vip-bench | Qwen-VL-Chat (Visual Prompt) | GPT-4 score (bbox): 39.2 GPT-4 score (human): 41.7 |
| visual-question-answering-vqa-on-core-mm | Qwen-VL-Chat | Abductive: 44.39 Analogical: 30.42 Deductive: 37.55 Overall score: 37.39 Params: 16B |