
摘要
大型多模态模型(LMM)在视觉指令调优方面最近展现出令人鼓舞的进展。本文指出,LLaVA 中的全连接视觉-语言跨模态连接器表现出惊人的强大性能和数据效率。通过对 LLaVA 进行简单的修改,即使用带有 MLP 投影的 CLIP-ViT-L-336px 模型并添加格式简单的学术任务导向的视觉问答(VQA)数据,我们建立了更强的基线模型,在 11 个基准测试中达到了当前最佳水平。我们的最终 130 亿参数模型仅使用了 120 万条公开可用的数据,并且在一个包含 8 块 A100 GPU 的节点上大约一天即可完成全部训练。我们希望这能使最先进的 LMM 研究更加普及。代码和模型将公开发布。
代码仓库
albertotestoni/ndq_visual_objects
pytorch
GitHub 中提及
x2fd/lvis-instruct4v
GitHub 中提及
haotian-liu/LLaVA
pytorch
GitHub 中提及
sshh12/multi_token
pytorch
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
dinhvietcuong1996/icme25-inova
pytorch
GitHub 中提及
linzhiqiu/clip-flant5
pytorch
GitHub 中提及
skunkworksai/bakllava
pytorch
GitHub 中提及
LLaVA-VL/LLaVA-NeXT
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| image-classification-on-coloninst-v1-seen | LLaVA-v1.5 (w/ LoRA, w/o extra data) | Accuray: 92.97 |
| image-classification-on-coloninst-v1-seen | LLaVA-v1.5 (w/ LoRA, w/ extra data) | Accuray: 93.33 |
| image-classification-on-coloninst-v1-unseen | LLaVA-v1.5 (w/ LoRA, w/o extra data) | Accuray: 79.10 |
| image-classification-on-coloninst-v1-unseen | LLaVA-v1.5 (w/ LoRA, w/ extra data) | Accuray: 80.89 |
| referring-expression-generation-on-coloninst | LLaVA-v1.5 (w/ LoRA, w/ extra data) | Accuray: 99.32 |
| referring-expression-generation-on-coloninst | LLaVA-v1.5 (w/ LoRA, w/o extra data) | Accuray: 98.58 |
| referring-expression-generation-on-coloninst-1 | LLaVA-v1.5 (w/ LoRA, w/o extra data) | Accuray: 70.38 |
| referring-expression-generation-on-coloninst-1 | LLaVA-v1.5 (w/ LoRA, w/ extra data) | Accuray: 72.88 |
| spatial-reasoning-on-6-dof-spatialbench | LLaVA-1.5 | Orientation-abs: 25.8 Orientation-rel: 28.3 Position-abs: 24.5 Position-rel: 30.9 Total: 27.2 |
| visual-instruction-following-on-llava-bench | LLaVA-v1.5-13B | avg score: 70.7 |
| visual-instruction-following-on-llava-bench | LLaVA-v1.5-7B | avg score: 63.4 |
| visual-question-answering-on-benchlmm | LLaVA-1.5-13B | GPT-3.5 score: 55.53 |
| visual-question-answering-on-mm-vet | LLaVA-1.5-7B | GPT-4 score: 31.1±0.2 Params: 7B |
| visual-question-answering-on-mm-vet | LLaVA-1.5-13B | GPT-4 score: 36.3±0.2 Params: 13B |
| visual-question-answering-on-mm-vet-v2 | LLaVA-v1.5-13B | GPT-4 score: 33.2±0.1 Params: 13B |
| visual-question-answering-on-mm-vet-v2 | LLaVA-v1.5-7B | GPT-4 score: 28.3±0.2 Params: 7B |
| visual-question-answering-on-vip-bench | LLaVA-1.5-13B (Visual Prompt) | GPT-4 score (bbox): 41.8 GPT-4 score (human): 42.9 |
| visual-question-answering-on-vip-bench | LLaVA-1.5-13B (Coordinates) | GPT-4 score (bbox): 47.1 |
| visual-question-answering-vqa-on-5 | LLaVA-1.5 | Overall Accuracy: 44.5 |
| visual-question-answering-vqa-on-core-mm | LLaVA-1.5 | Abductive: 47.91 Analogical: 24.31 Deductive: 30.94 Overall score: 32.62 Params: 13B |