HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Lishuai Gao; Yujie Zhong; Yingsen Zeng; Haoxian Tan; Dengjie Li; Zheng Zhao

LinVT: Empower Your Image-level Large Language Model to Understand Videos

Abstract

Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.

Code Repositories

gls0425/linvt
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-mvbenchLinVT-Qwen2-VL (7B)
Avg.: 69.3
video-question-answering-on-next-qaLinVT-Qwen2-VL (7B)
Accuracy: 85.5
visual-question-answering-on-mm-vetLinVT
GPT-4 score: 23.5
zero-shot-video-question-answer-on-egoschema-1LinVT-Qwen2-VL(7B)
Accuracy: 69.5
zeroshot-video-question-answer-on-activitynetLinVT-Qwen2-VL(7B)
Accuracy: 60.1
Confidence Score: 3.6
zeroshot-video-question-answer-on-msrvtt-qaLinVT-Qwen2-VL (7B)
Accuracy: 66.2
Confidence Score: 4.0
zeroshot-video-question-answer-on-msvd-qaLinVT-Qwen2-VL (7B)
Accuracy: 80.2
Confidence Score: 4.4
zeroshot-video-question-answer-on-tgif-qaLinVT-Qwen2-VL (7B)
Accuracy: 81.3
Confidence Score: 4.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
LinVT: Empower Your Image-level Large Language Model to Understand Videos | Papers | HyperAI