Command Palette
Search for a command to run...
Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract
The development of video large multimodal models (LMMs) has been hindered bythe difficulty of curating large amounts of high-quality raw data from the web.To address this, we propose an alternative approach by creating a high-qualitysynthetic dataset specifically for video instruction-following, namelyLLaVA-Video-178K. This dataset includes key tasks such as detailed captioning,open-ended question-answering (QA), and multiple-choice QA. By training on thisdataset, in combination with existing visual instruction tuning data, weintroduce LLaVA-Video, a new video LMM. Our experiments demonstrate thatLLaVA-Video achieves strong performance across various video benchmarks,highlighting the effectiveness of our dataset. We plan to release the dataset,its generation pipeline, and the model checkpoints.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-next-qa | LLaVA-Video | Accuracy: 83.2 |
| video-question-answering-on-tvbench | LLaVA-Video 7B | Average Accuracy: 45.6 |
| video-question-answering-on-tvbench | LLaVA-Video 72B | Average Accuracy: 50.0 |
| visual-question-answering-vqa-on-vlm2-bench | LLaVA-Video-7B | Average Score on VLM2-bench (9 subtasks): 43.32 GC-mat: 18.53 GC-trk: 12.79 OC-cnt: 62.47 OC-cpr: 54.72 OC-grp: 28.50 PC-VID: 59.00 PC-cnt: 66.91 PC-cpr: 62.00 PC-grp: 25.00 |
| zero-shot-video-question-answer-on-zero-shot | LLaVA-Video | Accuracy (% ): 61.9 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.