4 个月前

BLIP-2:使用冻结的图像编码器和大型语言模型引导语言-图像预训练

BLIP-2:使用冻结的图像编码器和大型语言模型引导语言-图像预训练

摘要

视觉-语言预训练的成本由于大规模模型的端到端训练而变得越来越高昂。本文提出了一种通用且高效的预训练策略——BLIP-2,该策略通过利用现成的冻结预训练图像编码器和冻结的大规模语言模型来引导视觉-语言预训练。BLIP-2 使用一个轻量级的查询Transformer(Querying Transformer)来弥合模态差距,该Transformer在两个阶段进行预训练。第一阶段从冻结的图像编码器中引导视觉-语言表示学习;第二阶段从冻结的语言模型中引导视觉到语言的生成学习。尽管可训练参数显著少于现有方法,BLIP-2 在各种视觉-语言任务上仍达到了最先进的性能。例如,在零样本VQAv2任务上,我们的模型以少54倍的可训练参数超越了Flamingo80B 8.7%。我们还展示了该模型在零样本图像到文本生成方面的能力,能够遵循自然语言指令。

代码仓库

salesforce/lavis
官方
pytorch
GitHub 中提及
yukw777/videoblip
pytorch
GitHub 中提及
rabiulcste/vqazero
pytorch
GitHub 中提及
albertotestoni/ndq_visual_objects
pytorch
GitHub 中提及
jiwanchung/vlis
pytorch
GitHub 中提及
gregor-ge/mblip
pytorch
GitHub 中提及
baaivision/eva
pytorch
GitHub 中提及
thudm/visualglm-6b
pytorch
GitHub 中提及
huggingface/transformers
pytorch
GitHub 中提及
linzhiqiu/clip-flant5
pytorch
GitHub 中提及
junshutang/Make-It-3D
pytorch
GitHub 中提及
kdr/videorag-mrr2024
GitHub 中提及
alibaba/graphtranslator
pytorch
GitHub 中提及
facebookresearch/multimodal
pytorch
GitHub 中提及

基准测试

基准方法指标
generative-visual-question-answering-on-pmcBLIP-2
BLEU-1: 7.6
image-captioning-on-coco-captionsBLIP-2 ViT-G FlanT5 XL (zero-shot)
BLEU-4: 42.4
CIDER: 144.5
image-captioning-on-coco-captionsBLIP-2 ViT-G OPT 6.7B (zero-shot)
BLEU-4: 43.5
CIDER: 145.2
image-captioning-on-coco-captionsBLIP-2 ViT-G OPT 2.7B (zero-shot)
BLEU-4: 43.7
CIDER: 145.8
image-captioning-on-nocaps-val-in-domainBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 123.7
Pre-train (#images): 1.1B
SPICE: 15.8
image-captioning-on-nocaps-val-in-domainBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 123.7
Pre-train (#images): 1.1B
SPICE: 16.3
image-captioning-on-nocaps-val-in-domainBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 123
Pre-train (#images): 1.1B
SPICE: 15.8
image-captioning-on-nocaps-val-near-domainBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 119.2
Pre-train (#images): 1.1B
SPICE: 15.3
image-captioning-on-nocaps-val-near-domainBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 120.2
Pre-train (#images): 1.1B
SPICE: 15.9
image-captioning-on-nocaps-val-near-domainBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 117.8
Pre-train (#images): 1.1B
SPICE: 15.4
image-captioning-on-nocaps-val-out-domainBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 124.8
Pretrain (#images): 1.1B
SPICE: 15.1
image-captioning-on-nocaps-val-out-domainBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 124.4
Pretrain (#images): 1.1B
SPICE: 14.8
image-captioning-on-nocaps-val-out-domainBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 123.4
Pretrain (#images): 1.1B
SPICE: 15.1
image-captioning-on-nocaps-val-overallBLIP-2 ViT-G FlanT5 XL (zero-shot)
CIDEr: 121.6
Pretrain (#images): 1.1B
SPICE: 15.8
image-captioning-on-nocaps-val-overallBLIP-2 ViT-G OPT 6.7B (zero-shot)
CIDEr: 121.0
Pretrain (#images): 1.1B
SPICE: 15.3
image-captioning-on-nocaps-val-overallBLIP-2 ViT-G OPT 2.7B (zero-shot)
CIDEr: 119.7
Pretrain (#images): 1.1B
SPICE: 15.4
image-retrieval-on-cocoBLIP-2 ViT-G (fine-tuned)
Recall@10: 92.6
recall@1: 68.3
recall@5: 87.7
image-retrieval-on-cocoBLIP-2 ViT-L (fine-tuned)
Recall@10: 91.8
recall@1: 66.3
recall@5: 86.5
image-retrieval-on-flickr30kBLIP-2 ViT-L (zero-shot, 1K test set)
Recall@1: 88.6
Recall@10: 98.9
Recall@5: 97.6
image-retrieval-on-flickr30kBLIP-2 ViT-G (zero-shot, 1K test set)
Recall@1: 89.7
Recall@10: 98.9
Recall@5: 98.1
image-to-text-retrieval-on-cocoBLIP-2 (ViT-L, fine-tuned)
Recall@1: 83.5
Recall@10: 98.0
Recall@5: 96.0
image-to-text-retrieval-on-cocoBLIP-2 (ViT-G, fine-tuned)
Recall@1: 85.4
Recall@10: 98.5
Recall@5: 97.0
image-to-text-retrieval-on-flickr30kBLIP-2 ViT-L (zero-shot, 1K test set)
Recall@1: 96.9
Recall@10: 100
Recall@5: 100
image-to-text-retrieval-on-flickr30kBLIP-2 ViT-G (zero-shot, 1K test set)
Recall@1: 97.6
Recall@10: 100
Recall@5: 100
open-vocabulary-attribute-detection-on-ovad-1BLIP 2 (pretrained)
mean average precision: 25.5
visual-instruction-following-on-llava-benchBLIP-2
avg score: 38.1
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 34.6
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 44.7
visual-question-answering-on-gqa-test-devBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 44.4
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 36.4
visual-question-answering-on-gqa-test-devBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 33.9
visual-question-answering-on-gqa-test-devBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 44.2
visual-question-answering-on-mm-vetBLIP-2-12B
GPT-4 score: 22.4±0.2
Params: 12B
visual-question-answering-on-ok-vqaBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 39.4
visual-question-answering-on-ok-vqaBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 45.9
visual-question-answering-on-ok-vqaBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 31.7
visual-question-answering-on-ok-vqaBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 40.7
visual-question-answering-on-ok-vqaBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 30.2
visual-question-answering-on-ok-vqaBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 36.4
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 52.3
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 62.3
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 49.7
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 52.6
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 65
visual-question-answering-on-vqa-v2-test-devBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 63
visual-question-answering-on-vqa-v2-test-dev-1BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Accuracy: 82.30
visual-question-answering-on-vqa-v2-test-dev-1BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Accuracy: 81.74
visual-question-answering-on-vqa-v2-test-dev-1BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Accuracy: 81.66
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G FlanT5 XXL (zero-shot)
Accuracy: 65.2
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G OPT 6.7B (zero-shot)
Accuracy: 54.3
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G FlanT5 XL (zero-shot)
Accuracy: 63.1
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-L OPT 2.7B (zero-shot)
Accuracy: 50.1
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-L FlanT5 XL (zero-shot)
Accuracy: 62.6
visual-question-answering-on-vqa-v2-valBLIP-2 ViT-G OPT 2.7B (zero-shot)
Accuracy: 53.5
visual-question-answering-on-vqa-v2-val-1BLIP-2 ViT-G FlanT5 XL (fine-tuned)
Accuracy: 81.55
visual-question-answering-on-vqa-v2-val-1BLIP-2 ViT-G OPT 6.7B (fine-tuned)
Accuracy: 82.19
visual-question-answering-on-vqa-v2-val-1BLIP-2 ViT-G OPT 2.7B (fine-tuned)
Accuracy: 81.59
visual-question-answering-vqa-on-core-mmBLIP-2-OPT2.7B
Abductive: 18.96
Analogical: 7.5
Deductive: 2.76
Overall score: 19.31
Params: 3B
visual-question-answering-vqa-on-infoseekBLIP2
Accuracy: 14.6
visual-question-answering-vqa-on-pmc-vqaBLIP-2
Accuracy: 24.3

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
BLIP-2:使用冻结的图像编码器和大型语言模型引导语言-图像预训练 | 论文 | HyperAI超神经