5 months ago

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin; Yang Ye; Bin Zhu; Jiaxi Cui; Munan Ning; Peng Jin; Li Yuan

Abstract

The Large Vision-Language Model (LVLM) has enhanced the performance of various downstream tasks in visual-language understanding. Most existing approaches encode images and videos into separate feature spaces, which are then fed as inputs to large language models. However, due to the lack of unified tokenization for images and videos, namely misalignment before projection, it becomes challenging for a Large Language Model (LLM) to learn multi-modal interactions from several poor projection layers. In this work, we unify visual representation into the language feature space to advance the foundational LLM towards a unified LVLM. As a result, we establish a simple but robust LVLM baseline, Video-LLaVA, which learns from a mixed dataset of images and videos, mutually enhancing each other. Video-LLaVA achieves superior performances on a broad range of 9 image benchmarks across 5 image question-answering datasets and 4 image benchmark toolkits. Additionally, our Video-LLaVA also outperforms Video-ChatGPT by 5.8%, 9.9%, 18.6%, and 10.1% on MSRVTT, MSVD, TGIF, and ActivityNet, respectively. Notably, extensive experiments demonstrate that Video-LLaVA mutually benefits images and videos within a unified visual representation, outperforming models designed specifically for images or videos. We aim for this work to provide modest insights into the multi-modal inputs for the LLM. Code address: \href{https://github.com/PKU-YuanGroup/Video-LLaVA}

Code Repositories

pku-yuangroup/video-bench

Mentioned in GitHub

PKU-YuanGroup/MoE-LLaVA

pytorch

Mentioned in GitHub

qiujihao19/artemis

pytorch

Mentioned in GitHub

pku-yuangroup/languagebind

pytorch

Mentioned in GitHub

PKU-YuanGroup/Video-LLaVA

Official

pytorch

Mentioned in GitHub

PKU-YuanGroup/LLMBind

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
temporal-relation-extraction-on-vinoground	Video-LLaVA-7B	Group Score: 6.6 Text Score: 24.8 Video Score: 25.8
video-question-answering-on-activitynet-qa	Video-LLaVA	Accuracy: 45.3 Confidence score: 3.3
visual-question-answering-on-mm-vet	Video-LLaVA	GPT-4 score: 32.0
zeroshot-video-question-answer-on-activitynet	Video-LLaVA	Accuracy: 45.3 Confidence Score: 3.3
zeroshot-video-question-answer-on-msrvtt-qa	Video-LLaVA-7B	Accuracy: 59.2 Confidence Score: 3.5
zeroshot-video-question-answer-on-msvd-qa	Video-LLaVA-7B	Accuracy: 70.7 Confidence Score: 3.9
zeroshot-video-question-answer-on-tgif-qa	Video-LLaVA-7B	Accuracy: 70.0 Confidence Score: 4.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Video-LLaVA: Learning United Visual Representation by Alignment Before Projection

Bin Lin; Yang Ye; Bin Zhu; Jiaxi Cui; Munan Ning; Peng Jin; Li Yuan

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters