Command Palette
Search for a command to run...
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning
Wenliang Dai; Junnan Li; Dongxu Li; Anthony Meng Huat Tiong; Junqi Zhao; Weisheng Wang; Boyang Li; Pascale Fung; Steven Hoi

Abstract
Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| long-context-understanding-on-mmneedle | InstructBLIP-Vicuna-13B | 1 Image, 2*2 Stitching, Exact Accuracy: 0 1 Image, 4*4 Stitching, Exact Accuracy: 0 1 Image, 8*8 Stitching, Exact Accuracy: 0 10 Images, 1*1 Stitching, Exact Accuracy: 0 10 Images, 2*2 Stitching, Exact Accuracy: 0 10 Images, 4*4 Stitching, Exact Accuracy: 0 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| long-context-understanding-on-mmneedle | InstructBLIP-Flan-T5-XXL | 1 Image, 2*2 Stitching, Exact Accuracy: 3.8 1 Image, 4*4 Stitching, Exact Accuracy: 6.2 1 Image, 8*8 Stitching, Exact Accuracy: 2.2 10 Images, 1*1 Stitching, Exact Accuracy: 0 10 Images, 2*2 Stitching, Exact Accuracy: 0 10 Images, 4*4 Stitching, Exact Accuracy: 0 10 Images, 8*8 Stitching, Exact Accuracy: 0 |
| video-question-answering-on-mvbench | InstructBLIP | Avg.: 32.5 |
| visual-instruction-following-on-llava-bench | InstructBLIP-13B | avg score: 58.2 |
| visual-instruction-following-on-llava-bench | InstructBLIP-7B | avg score: 60.9 |
| visual-question-answering-on-benchlmm | InstructBLIP-7B | GPT-3.5 score: 44.63 |
| visual-question-answering-on-benchlmm | InstructBLIP-13B | GPT-3.5 score: 45.03 |
| visual-question-answering-on-vip-bench | InstructBLIP-13B (Visual Prompt) | GPT-4 score (bbox): 35.8 GPT-4 score (human): 35.2 |
| visual-question-answering-vqa-on-core-mm | InstructBLIP | Abductive: 37.76 Analogical: 20.56 Deductive: 27.56 Overall score: 28.02 Params: 8B |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.