Command Palette
Search for a command to run...
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models
Li Junnan ; Li Dongxu ; Savarese Silvio ; Hoi Steven

Abstract
The cost of vision-and-language pre-training has become increasinglyprohibitive due to end-to-end training of large-scale models. This paperproposes BLIP-2, a generic and efficient pre-training strategy that bootstrapsvision-language pre-training from off-the-shelf frozen pre-trained imageencoders and frozen large language models. BLIP-2 bridges the modality gap witha lightweight Querying Transformer, which is pre-trained in two stages. Thefirst stage bootstraps vision-language representation learning from a frozenimage encoder. The second stage bootstraps vision-to-language generativelearning from a frozen language model. BLIP-2 achieves state-of-the-artperformance on various vision-language tasks, despite having significantlyfewer trainable parameters than existing methods. For example, our modeloutperforms Flamingo80B by 8.7% on zero-shot VQAv2 with 54x fewer trainableparameters. We also demonstrate the model's emerging capabilities of zero-shotimage-to-text generation that can follow natural language instructions.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| generative-visual-question-answering-on-pmc | BLIP-2 | BLEU-1: 7.6 |
| image-captioning-on-coco-captions | BLIP-2 ViT-G FlanT5 XL (zero-shot) | BLEU-4: 42.4 CIDER: 144.5 |
| image-captioning-on-coco-captions | BLIP-2 ViT-G OPT 6.7B (zero-shot) | BLEU-4: 43.5 CIDER: 145.2 |
| image-captioning-on-coco-captions | BLIP-2 ViT-G OPT 2.7B (zero-shot) | BLEU-4: 43.7 CIDER: 145.8 |
| image-captioning-on-nocaps-val-in-domain | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 123.7 Pre-train (#images): 1.1B SPICE: 15.8 |
| image-captioning-on-nocaps-val-in-domain | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 123.7 Pre-train (#images): 1.1B SPICE: 16.3 |
| image-captioning-on-nocaps-val-in-domain | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 123 Pre-train (#images): 1.1B SPICE: 15.8 |
| image-captioning-on-nocaps-val-near-domain | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 119.2 Pre-train (#images): 1.1B SPICE: 15.3 |
| image-captioning-on-nocaps-val-near-domain | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 120.2 Pre-train (#images): 1.1B SPICE: 15.9 |
| image-captioning-on-nocaps-val-near-domain | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 117.8 Pre-train (#images): 1.1B SPICE: 15.4 |
| image-captioning-on-nocaps-val-out-domain | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 124.8 Pretrain (#images): 1.1B SPICE: 15.1 |
| image-captioning-on-nocaps-val-out-domain | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 124.4 Pretrain (#images): 1.1B SPICE: 14.8 |
| image-captioning-on-nocaps-val-out-domain | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 123.4 Pretrain (#images): 1.1B SPICE: 15.1 |
| image-captioning-on-nocaps-val-overall | BLIP-2 ViT-G FlanT5 XL (zero-shot) | CIDEr: 121.6 Pretrain (#images): 1.1B SPICE: 15.8 |
| image-captioning-on-nocaps-val-overall | BLIP-2 ViT-G OPT 6.7B (zero-shot) | CIDEr: 121.0 Pretrain (#images): 1.1B SPICE: 15.3 |
| image-captioning-on-nocaps-val-overall | BLIP-2 ViT-G OPT 2.7B (zero-shot) | CIDEr: 119.7 Pretrain (#images): 1.1B SPICE: 15.4 |
| image-retrieval-on-coco | BLIP-2 ViT-G (fine-tuned) | Recall@10: 92.6 recall@1: 68.3 recall@5: 87.7 |
| image-retrieval-on-coco | BLIP-2 ViT-L (fine-tuned) | Recall@10: 91.8 recall@1: 66.3 recall@5: 86.5 |
| image-retrieval-on-flickr30k | BLIP-2 ViT-L (zero-shot, 1K test set) | Recall@1: 88.6 Recall@10: 98.9 Recall@5: 97.6 |
| image-retrieval-on-flickr30k | BLIP-2 ViT-G (zero-shot, 1K test set) | Recall@1: 89.7 Recall@10: 98.9 Recall@5: 98.1 |
| image-to-text-retrieval-on-coco | BLIP-2 (ViT-L, fine-tuned) | Recall@1: 83.5 Recall@10: 98.0 Recall@5: 96.0 |
| image-to-text-retrieval-on-coco | BLIP-2 (ViT-G, fine-tuned) | Recall@1: 85.4 Recall@10: 98.5 Recall@5: 97.0 |
| image-to-text-retrieval-on-flickr30k | BLIP-2 ViT-L (zero-shot, 1K test set) | Recall@1: 96.9 Recall@10: 100 Recall@5: 100 |
| image-to-text-retrieval-on-flickr30k | BLIP-2 ViT-G (zero-shot, 1K test set) | Recall@1: 97.6 Recall@10: 100 Recall@5: 100 |
| open-vocabulary-attribute-detection-on-ovad-1 | BLIP 2 (pretrained) | mean average precision: 25.5 |
| visual-instruction-following-on-llava-bench | BLIP-2 | avg score: 38.1 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 34.6 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 44.7 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 44.4 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 36.4 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 33.9 |
| visual-question-answering-on-gqa-test-dev | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 44.2 |
| visual-question-answering-on-mm-vet | BLIP-2-12B | GPT-4 score: 22.4±0.2 Params: 12B |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 39.4 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 45.9 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 31.7 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 40.7 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 30.2 |
| visual-question-answering-on-ok-vqa | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 36.4 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 52.3 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 62.3 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 49.7 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 52.6 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 65 |
| visual-question-answering-on-vqa-v2-test-dev | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 63 |
| visual-question-answering-on-vqa-v2-test-dev-1 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) | Accuracy: 82.30 |
| visual-question-answering-on-vqa-v2-test-dev-1 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) | Accuracy: 81.74 |
| visual-question-answering-on-vqa-v2-test-dev-1 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) | Accuracy: 81.66 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G FlanT5 XXL (zero-shot) | Accuracy: 65.2 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G OPT 6.7B (zero-shot) | Accuracy: 54.3 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G FlanT5 XL (zero-shot) | Accuracy: 63.1 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-L OPT 2.7B (zero-shot) | Accuracy: 50.1 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-L FlanT5 XL (zero-shot) | Accuracy: 62.6 |
| visual-question-answering-on-vqa-v2-val | BLIP-2 ViT-G OPT 2.7B (zero-shot) | Accuracy: 53.5 |
| visual-question-answering-on-vqa-v2-val-1 | BLIP-2 ViT-G FlanT5 XL (fine-tuned) | Accuracy: 81.55 |
| visual-question-answering-on-vqa-v2-val-1 | BLIP-2 ViT-G OPT 6.7B (fine-tuned) | Accuracy: 82.19 |
| visual-question-answering-on-vqa-v2-val-1 | BLIP-2 ViT-G OPT 2.7B (fine-tuned) | Accuracy: 81.59 |
| visual-question-answering-vqa-on-core-mm | BLIP-2-OPT2.7B | Abductive: 18.96 Analogical: 7.5 Deductive: 2.76 Overall score: 19.31 Params: 3B |
| visual-question-answering-vqa-on-infoseek | BLIP2 | Accuracy: 14.6 |
| visual-question-answering-vqa-on-pmc-vqa | BLIP-2 | Accuracy: 24.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.