5 months ago

Visual Instruction Tuning

Liu Haotian ; Li Chunyuan ; Wu Qingyang ; Lee Yong Jae

Abstract

Instruction tuning large language models (LLMs) using machine-generatedinstruction-following data has improved zero-shot capabilities on new tasks,but the idea is less explored in the multimodal field. In this paper, wepresent the first attempt to use language-only GPT-4 to generate multimodallanguage-image instruction-following data. By instruction tuning on suchgenerated data, we introduce LLaVA: Large Language and Vision Assistant, anend-to-end trained large multimodal model that connects a vision encoder andLLM for general-purpose visual and language understanding.Our early experimentsshow that LLaVA demonstrates impressive multimodel chat abilities, sometimesexhibiting the behaviors of multimodal GPT-4 on unseen images/instructions, andyields a 85.1% relative score compared with GPT-4 on a synthetic multimodalinstruction-following dataset. When fine-tuned on Science QA, the synergy ofLLaVA and GPT-4 achieves a new state-of-the-art accuracy of 92.53%. We makeGPT-4 generated visual instruction tuning data, our model and code basepublicly available.

Code Repositories

llava-annonymous/llava

pytorch

Mentioned in GitHub

ZhangYiqun018/StickerConv

pytorch

Mentioned in GitHub

qiujihao19/artemis

pytorch

Mentioned in GitHub

haotian-liu/LLaVA

Official

pytorch

Mentioned in GitHub

sshh12/multi_token

pytorch

Mentioned in GitHub

huggingface/transformers

pytorch

Mentioned in GitHub

dinhvietcuong1996/icme25-inova

pytorch

Mentioned in GitHub

sunsmarterjie/chatterbox

pytorch

Mentioned in GitHub

skunkworksai/bakllava

pytorch

Mentioned in GitHub

computer-vision-in-the-wild/cvinw_readings

LLaVA-VL/LLaVA-NeXT

pytorch

Mentioned in GitHub

camenduru/llava-colab

Mentioned in GitHub

tabtoyou/kollava

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
image-classification-on-coloninst-v1-seen	LLaVA-v1 (w/ LoRA, w/ extra data)	Accuray: 89.61
image-classification-on-coloninst-v1-seen	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray: 87.86
image-classification-on-coloninst-v1-unseen	LLaVA-v1 (w/ LoRA, w/ extra data)	Accuray: 42.17
image-classification-on-coloninst-v1-unseen	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray: 72.08
mmr-total-on-mrr-benchmark	LLaVA-NEXT-13B	Total Column Score: 335
mmr-total-on-mrr-benchmark	LLaVA-NEXT-34B	Total Column Score: 412
mmr-total-on-mrr-benchmark	LLaVA-1.5-13B	Total Column Score: 243
referring-expression-generation-on-coloninst	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray: 84.55
referring-expression-generation-on-coloninst	LLaVA-v1 (w/ LoRA, w/ extra data)	Accuray: 86.87
referring-expression-generation-on-coloninst-1	LLaVA-v1 (w/ LoRA, w/ extra data)	Accuray: 46.85
referring-expression-generation-on-coloninst-1	LLaVA-v1 (w/ LoRA, w/o extra data)	Accuray: 68.11
spatial-reasoning-on-embspatial-bench	LLaVA-1.6	Generation: 35.19
video-question-answering-on-mvbench	LLaVa	Avg.: 36.0
visual-question-answering-on-benchlmm	LLaVA-1.5-7B	GPT-3.5 score: 46.83
visual-question-answering-on-benchlmm	LLaVA-1-13B	GPT-3.5 score: 43.50

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Visual Instruction Tuning

Liu Haotian ; Li Chunyuan ; Wu Qingyang ; Lee Yong Jae

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters