
摘要
视觉-语言预训练的成本由于大规模模型的端到端训练而变得越来越高昂。本文提出了一种通用且高效的预训练策略——BLIP-2,该策略通过利用现成的冻结预训练图像编码器和冻结的大规模语言模型来引导视觉-语言预训练。BLIP-2 使用一个轻量级的查询Transformer(Querying Transformer)来弥合模态差距,该Transformer在两个阶段进行预训练。第一阶段从冻结的图像编码器中引导视觉-语言表示学习;第二阶段从冻结的语言模型中引导视觉到语言的生成学习。尽管可训练参数显著少于现有方法,BLIP-2 在各种视觉-语言任务上仍达到了最先进的性能。例如,在零样本VQAv2任务上,我们的模型以少54倍的可训练参数超越了Flamingo80B 8.7%。我们还展示了该模型在零样本图像到文本生成方面的能力,能够遵循自然语言指令。
代码仓库
salesforce/lavis
官方
pytorch
GitHub 中提及
yukw777/videoblip
pytorch
GitHub 中提及
rabiulcste/vqazero
pytorch
GitHub 中提及
albertotestoni/ndq_visual_objects
pytorch
GitHub 中提及
jiwanchung/vlis
pytorch
GitHub 中提及
gregor-ge/mblip
pytorch
GitHub 中提及
baaivision/eva
pytorch
GitHub 中提及
thudm/visualglm-6b
pytorch
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
linzhiqiu/clip-flant5
pytorch
GitHub 中提及
junshutang/Make-It-3D
pytorch
GitHub 中提及
unispac/visual-adversarial-examples-jailbreak-large-language-models
pytorch
GitHub 中提及
kdr/videorag-mrr2024
GitHub 中提及
alibaba/graphtranslator
pytorch
GitHub 中提及
facebookresearch/multimodal
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| generative-visual-question-answering-on-pmc | BLIP-2 | BLEU-1: 7.6 |
| image-captioning-on-coco-captions | BLIP-2 ViT-G FlanT5 XL (zero-shot) | BLEU-4: 42.4 CIDER: 144.5 |
| image-captioning-on-coco-captions | BLIP-2 ViT-G OPT 6.7B (zero-shot) | BLEU-4: 43.5 CIDER: 145.2 |
| image-captioning-on-coco-captions | BLIP-2 ViT-G OPT 2.7B (zero-shot) | BLEU-4: 43.7 CIDER: 145.8 |
| image-captioning-on-nocaps-val-in-domain | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 123.7 Pre-train (#images): 1.1B SPICE: 15.8 |
| image-captioning-on-nocaps-val-in-domain | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 123.7 Pre-train (#images): 1.1B SPICE: 16.3 |
| image-captioning-on-nocaps-val-in-domain | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 123 Pre-train (#images): 1.1B SPICE: 15.8 |
| image-captioning-on-nocaps-val-near-domain | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 119.2 Pre-train (#images): 1.1B SPICE: 15.3 |
| image-captioning-on-nocaps-val-near-domain | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 120.2 Pre-train (#images): 1.1B SPICE: 15.9 |
| image-captioning-on-nocaps-val-near-domain | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 117.8 Pre-train (#images): 1.1B SPICE: 15.4 |
| image-captioning-on-nocaps-val-out-domain | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 124.8 Pretrain (#images): 1.1B SPICE: 15.1 |
| image-captioning-on-nocaps-val-out-domain | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 124.4 Pretrain (#images): 1.1B SPICE: 14.8 |
| image-captioning-on-nocaps-val-out-domain | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 123.4 Pretrain (#images): 1.1B SPICE: 15.1 |
| image-captioning-on-nocaps-val-overall | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 121.6 Pretrain (#images): 1.1B SPICE: 15.8 |
| image-captioning-on-nocaps-val-overall | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 121.0 Pretrain (#images): 1.1B SPICE: 15.3 |
| image-captioning-on-nocaps-val-overall | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 119.7 Pretrain (#images): 1.1B SPICE: 15.4 |
| image-retrieval-on-coco | BLIP-2 ViT-G (fine-tuned) | Recall@10: 92.6 recall@1: 68.3 recall@5: 87.7 |
| image-retrieval-on-coco | BLIP-2 ViT-L (fine-tuned) | Recall@10: 91.8 recall@1: 66.3 recall@5: 86.5 |
| image-retrieval-on-flickr30k | BLIP-2 ViT-L (zero-shot, 1K test set) | Recall@1: 88.6 Recall@10: 98.9 Recall@5: 97.6 |
| image-retrieval-on-flickr30k | BLIP-2 ViT-G (zero-shot, 1K test set) | Recall@1: 89.7 Recall@10: 98.9 Recall@5: 98.1 |
| image-to-text-retrieval-on-coco | BLIP-2 (ViT-L, fine-tuned) | Recall@1: 83.5 Recall@10: 98.0 Recall@5: 96.0 |
| image-to-text-retrieval-on-coco | BLIP-2 (ViT-G, fine-tuned) | Recall@1: 85.4 Recall@10: 98.5 Recall@5: 97.0 |
| image-to-text-retrieval-on-flickr30k | BLIP-2 ViT-L (zero-shot, 1K test set) | Recall@1: 96.9 Recall@10: 100 Recall@5: 100 |
| image-to-text-retrieval-on-flickr30k | BLIP-2 ViT-G (zero-shot, 1K test set) | Recall@1: 97.6 Recall@10: 100 Recall@5: 100 |
| open-vocabulary-attribute-detection-on-ovad-1 | BLIP 2 (pretrained) | mean average precision: 25.5 |
| visual-instruction-following-on-llava-bench | BLIP-2 | avg score: 38.1 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 34.6 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 44.7 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 44.4 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 36.4 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 33.9 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 44.2 |
| visual-question-answering-on-mm-vet | BLIP-2-12B | GPT-4 score: 22.4±0.2 Params: 12B |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 39.4 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 45.9 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 31.7 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 40.7 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 30.2 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 36.4 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 52.3 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 62.3 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 49.7 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 52.6 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 65 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 63 |
| visual-question-answering-on-vqa-v2-test-dev-1 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) | Accuracy: 82.30 |
| visual-question-answering-on-vqa-v2-test-dev-1 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) | Accuracy: 81.74 |
| visual-question-answering-on-vqa-v2-test-dev-1 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) | Accuracy: 81.66 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 65.2 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 54.3 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 63.1 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 50.1 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 62.6 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 53.5 |
| visual-question-answering-on-vqa-v2-val-1 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) | Accuracy: 81.55 |
| visual-question-answering-on-vqa-v2-val-1 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) | Accuracy: 82.19 |
| visual-question-answering-on-vqa-v2-val-1 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) | Accuracy: 81.59 |
| visual-question-answering-vqa-on-core-mm | BLIP-2-OPT2.7B | Abductive: 18.96 Analogical: 7.5 Deductive: 2.76 Overall score: 19.31 Params: 3B |
| visual-question-answering-vqa-on-infoseek | BLIP2 | Accuracy: 14.6 |
| visual-question-answering-vqa-on-pmc-vqa | BLIP-2 | Accuracy: 24.3 |