HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Visual Instruction Tuning

Liu Haotian ; Li Chunyuan ; Wu Qingyang ; Lee Yong Jae

Visual Instruction Tuning

Abstract

Instruction tuning large language models (LLMs) using machine-generatedinstruction-following data has improved zero-shot capabilities on new tasks,but the idea is less explored in the multimodal field. In this paper, wepresent the first attempt to use language-only GPT-4 to generate multimodallanguage-image instruction-following data. By instruction tuning on suchgenerated data, we introduce LLaVA: Large Language and Vision Assistant, anend-to-end trained large multimodal model that connects a vision encoder andLLM for general-purpose visual and language understanding.Our early experimentsshow that LLaVA demonstrates impressive multimodel chat abilities, sometimesexhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, andyields a 85.1% relative score compared with GPT-4 on a synthetic multimodalinstruction-following dataset. When fine-tuned on Science QA, the synergy ofLLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We makeGPT-4 generated visual instruction tuning data, our model and code basepublicly available.

Code Repositories

llava-annonymous/llava
pytorch
Mentioned in GitHub
ZhangYiqun018/StickerConv
pytorch
Mentioned in GitHub
qiujihao19/artemis
pytorch
Mentioned in GitHub
haotian-liu/LLaVA
Official
pytorch
Mentioned in GitHub
sshh12/multi_token
pytorch
Mentioned in GitHub
huggingface/transformers
pytorch
Mentioned in GitHub
dinhvietcuong1996/icme25-inova
pytorch
Mentioned in GitHub
sunsmarterjie/chatterbox
pytorch
Mentioned in GitHub
skunkworksai/bakllava
pytorch
Mentioned in GitHub
LLaVA-VL/LLaVA-NeXT
pytorch
Mentioned in GitHub
camenduru/llava-colab
Mentioned in GitHub
tabtoyou/kollava
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
image-classification-on-coloninst-v1-seenLLaVA-v1 (w/ LoRA, w/ extra data)
Accuray: 89.61
image-classification-on-coloninst-v1-seenLLaVA-v1 (w/ LoRA, w/o extra data)
Accuray: 87.86
image-classification-on-coloninst-v1-unseenLLaVA-v1 (w/ LoRA, w/ extra data)
Accuray: 42.17
image-classification-on-coloninst-v1-unseenLLaVA-v1 (w/ LoRA, w/o extra data)
Accuray: 72.08
mmr-total-on-mrr-benchmarkLLaVA-NEXT-13B
Total Column Score: 335
mmr-total-on-mrr-benchmarkLLaVA-NEXT-34B
Total Column Score: 412
mmr-total-on-mrr-benchmarkLLaVA-1.5-13B
Total Column Score: 243
referring-expression-generation-on-coloninstLLaVA-v1 (w/ LoRA, w/o extra data)
Accuray: 84.55
referring-expression-generation-on-coloninstLLaVA-v1 (w/ LoRA, w/ extra data)
Accuray: 86.87
referring-expression-generation-on-coloninst-1LLaVA-v1 (w/ LoRA, w/ extra data)
Accuray: 46.85
referring-expression-generation-on-coloninst-1LLaVA-v1 (w/ LoRA, w/o extra data)
Accuray: 68.11
spatial-reasoning-on-embspatial-benchLLaVA-1.6
Generation: 35.19
video-question-answering-on-mvbenchLLaVa
Avg.: 36.0
visual-question-answering-on-benchlmmLLaVA-1.5-7B
GPT-3.5 score: 46.83
visual-question-answering-on-benchlmmLLaVA-1-13B
GPT-3.5 score: 43.50

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Visual Instruction Tuning | Papers | HyperAI