4 months ago

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

Boqiang Zhang Kehan Li Zesen Cheng Zhiqiang Hu Yuqian Yuan Guanzheng Chen Sicong Leng Yuming Jiang Hang Zhang Xin Li

Abstract

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: thevision-centric training paradigm and vision-centric framework design. The keyinsight of our vision-centric training paradigm is that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up the vision encoder andprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder, projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model'scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrained vision encoder isadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.

Code Repositories

damo-nlp-sg/videollama3

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-question-answering-on-next-qa	VideoLLaMA3(7B)	Accuracy: 84.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette