Command Palette
Search for a command to run...
Bo Li Yuanhan Zhang Dong Guo Renrui Zhang Feng Li Hao Zhang Kaichen Zhang Yanwei Li Ziwei Liu Chunyuan Li

Abstract
We present LLaVA-OneVision, a family of open large multimodal models (LMMs)developed by consolidating our insights into data, models, and visualrepresentations in the LLaVA-NeXT blog series. Our experimental resultsdemonstrate that LLaVA-OneVision is the first single model that cansimultaneously push the performance boundaries of open LMMs in three importantcomputer vision scenarios: single-image, multi-image, and video scenarios.Importantly, the design of LLaVA-OneVision allows strong transfer learningacross different modalities/scenarios, yielding new emerging capabilities. Inparticular, strong video understanding and cross-scenario capabilities aredemonstrated through task transfer from images to videos.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| temporal-relation-extraction-on-vinoground | LLaVA-OneVision-Qwen2-72B | Group Score: 21.8 Text Score: 48.4 Video Score: 35.2 |
| temporal-relation-extraction-on-vinoground | LLaVA-OneVision-Qwen2-7B | Group Score: 14.6 Text Score: 41.6 Video Score: 29.4 |
| video-question-answering-on-next-qa | LLaVA-OV(7B) | Accuracy: 79.4 |
| video-question-answering-on-next-qa | LLaVA-OV(72B) | Accuracy: 80.2 |
| visual-question-answering-on-mm-vet | LLaVA-OneVision-7B | GPT-4 score: 57.5 |
| visual-question-answering-on-mm-vet | LLaVA-OneVision-72B | GPT-4 score: 63.7 |
| visual-question-answering-on-mm-vet | LLaVA-OneVision-0.5B | GPT-4 score: 29.1 |
| visual-question-answering-vqa-on-vlm2-bench | LLaVA-OneVision-7B | Average Score on VLM2-bench (9 subtasks): 39.35 GC-mat: 16.60 GC-trk: 13.70 OC-cnt: 56.17 OC-cpr: 47.22 OC-grp: 27.50 PC-VID: 47.25 PC-cnt: 46.67 PC-cpr: 62.00 PC-grp: 37.00 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.