HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Wenliang Dai; Junnan Li; Dongxu Li; Anthony Meng Huat Tiong; Junqi Zhao; Weisheng Wang; Boyang Li; Pascale Fung; Steven Hoi

InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning

Abstract

Large-scale pre-training and instruction tuning have been successful at creating general-purpose language models with broad competence. However, building general-purpose vision-language models is challenging due to the rich input distributions and task diversity resulting from the additional visual input. Although vision-language pretraining has been widely studied, vision-language instruction tuning remains under-explored. In this paper, we conduct a systematic and comprehensive study on vision-language instruction tuning based on the pretrained BLIP-2 models. We gather 26 publicly available datasets, covering a wide variety of tasks and capabilities, and transform them into instruction tuning format. Additionally, we introduce an instruction-aware Query Transformer, which extracts informative features tailored to the given instruction. Trained on 13 held-in datasets, InstructBLIP attains state-of-the-art zero-shot performance across all 13 held-out datasets, substantially outperforming BLIP-2 and larger Flamingo models. Our models also lead to state-of-the-art performance when finetuned on individual downstream tasks (e.g., 90.7% accuracy on ScienceQA questions with image contexts). Furthermore, we qualitatively demonstrate the advantages of InstructBLIP over concurrent multimodal models. All InstructBLIP models are open-sourced at https://github.com/salesforce/LAVIS/tree/main/projects/instructblip.

Code Repositories

salesforce/lavis
Official
pytorch
Mentioned in GitHub
tabtoyou/kollava
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
long-context-understanding-on-mmneedleInstructBLIP-Vicuna-13B
1 Image, 2*2 Stitching, Exact Accuracy: 0
1 Image, 4*4 Stitching, Exact Accuracy: 0
1 Image, 8*8 Stitching, Exact Accuracy: 0
10 Images, 1*1 Stitching, Exact Accuracy: 0
10 Images, 2*2 Stitching, Exact Accuracy: 0
10 Images, 4*4 Stitching, Exact Accuracy: 0
10 Images, 8*8 Stitching, Exact Accuracy: 0
long-context-understanding-on-mmneedleInstructBLIP-Flan-T5-XXL
1 Image, 2*2 Stitching, Exact Accuracy: 3.8
1 Image, 4*4 Stitching, Exact Accuracy: 6.2
1 Image, 8*8 Stitching, Exact Accuracy: 2.2
10 Images, 1*1 Stitching, Exact Accuracy: 0
10 Images, 2*2 Stitching, Exact Accuracy: 0
10 Images, 4*4 Stitching, Exact Accuracy: 0
10 Images, 8*8 Stitching, Exact Accuracy: 0
video-question-answering-on-mvbenchInstructBLIP
Avg.: 32.5
visual-instruction-following-on-llava-benchInstructBLIP-13B
avg score: 58.2
visual-instruction-following-on-llava-benchInstructBLIP-7B
avg score: 60.9
visual-question-answering-on-benchlmmInstructBLIP-7B
GPT-3.5 score: 44.63
visual-question-answering-on-benchlmmInstructBLIP-13B
GPT-3.5 score: 45.03
visual-question-answering-on-vip-benchInstructBLIP-13B (Visual Prompt)
GPT-4 score (bbox): 35.8
GPT-4 score (human): 35.2
visual-question-answering-vqa-on-core-mmInstructBLIP
Abductive: 37.76
Analogical: 20.56
Deductive: 27.56
Overall score: 28.02
Params: 8B

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
InstructBLIP: Towards General-purpose Vision-Language Models with Instruction Tuning | Papers | HyperAI