4 months ago

Video Instruction Tuning With Synthetic Data

Yuanhan Zhang Jinming Wu Wei Li Bo Li Zejun Ma Ziwei Liu Chunyuan Li

Abstract

The development of video large multimodal models (LMMs) has been hindered bythe difficulty of curating large amounts of high-quality raw data from the web.To address this, we propose an alternative approach by creating a high-qualitysynthetic dataset specifically for video instruction-following, namelyLLaVA-Video-178K. This dataset includes key tasks such as detailed captioning,open-ended question-answering (QA), and multiple-choice QA. By training on thisdataset, in combination with existing visual instruction tuning data, weintroduce LLaVA-Video, a new video LMM. Our experiments demonstrate thatLLaVA-Video achieves strong performance across various video benchmarks,highlighting the effectiveness of our dataset. We plan to release the dataset,its generation pipeline, and the model checkpoints.

Benchmarks

Benchmark	Methodology	Metrics
video-question-answering-on-next-qa	LLaVA-Video	Accuracy: 83.2
video-question-answering-on-tvbench	LLaVA-Video 7B	Average Accuracy: 45.6
video-question-answering-on-tvbench	LLaVA-Video 72B	Average Accuracy: 50.0
visual-question-answering-vqa-on-vlm2-bench	LLaVA-Video-7B	Average Score on VLM2-bench (9 subtasks): 43.32 GC-mat: 18.53 GC-trk: 12.79 OC-cnt: 62.47 OC-cpr: 54.72 OC-grp: 28.50 PC-VID: 59.00 PC-cnt: 66.91 PC-cpr: 62.00 PC-grp: 25.00
zero-shot-video-question-answer-on-zero-shot	LLaVA-Video	Accuracy (% ): 61.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning