HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding

VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video
  Understanding

Abstract

In this paper, we propose VideoLLaMA3, a more advanced multimodal foundationmodel for image and video understanding. The core design philosophy ofVideoLLaMA3 is vision-centric. The meaning of "vision-centric" is two-fold: thevision-centric training paradigm and vision-centric framework design. The keyinsight of our vision-centric training paradigm is that high-quality image-textdata is crucial for both image and video understanding. Instead of preparingmassive video-text datasets, we focus on constructing large-scale andhigh-quality image-text datasets. VideoLLaMA3 has four training stages: 1)vision-centric alignment stage, which warms up the vision encoder andprojector; 2) vision-language pretraining stage, which jointly tunes the visionencoder, projector, and LLM with large-scale image-text data covering multipletypes (including scene images, documents, charts) as well as text-only data. 3)multi-task fine-tuning stage, which incorporates image-text SFT data fordownstream tasks and video-text data to establish a foundation for videounderstanding. 4) video-centric fine-tuning, which further improves the model'scapability in video understanding. As for the framework design, to bettercapture fine-grained details in images, the pretrained vision encoder isadapted to encode images of varying sizes into vision tokens with correspondingnumbers, rather than a fixed number of tokens. For video inputs, we reduce thenumber of vision tokens according to their similarity so that therepresentation of videos will be more precise and compact. Benefit fromvision-centric designs, VideoLLaMA3 achieves compelling performances in bothimage and video understanding benchmarks.

Code Repositories

damo-nlp-sg/videollama3
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-next-qaVideoLLaMA3(7B)
Accuracy: 84.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
VideoLLaMA 3: Frontier Multimodal Foundation Models for Image and Video Understanding | Papers | HyperAI