Command Palette
Search for a command to run...
LinVT: Empower Your Image-level Large Language Model to Understand Videos
Lishuai Gao; Yujie Zhong; Yingsen Zeng; Haoxian Tan; Dengjie Li; Zheng Zhao

Abstract
Large Language Models (LLMs) have been widely used in various tasks, motivating us to develop an LLM-based assistant for videos. Instead of training from scratch, we propose a module to transform arbitrary well-trained image-based LLMs into video-LLMs (after being trained on video data). To better adapt image-LLMs for processing videos, we introduce two design principles: linear transformation to preserve the original visual-language alignment and representative information condensation from redundant video content. Guided by these principles, we propose a plug-and-play Linear Video Tokenizer(LinVT), which enables existing image-LLMs to understand videos. We benchmark LinVT with six recent visual LLMs: Aquila, Blip-3, InternVL2, Mipha, Molmo and Qwen2-VL, showcasing the high compatibility of LinVT. LinVT-based LLMs achieve state-of-the-art performance across various video benchmarks, illustrating the effectiveness of LinVT in multi-modal video understanding.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-mvbench | LinVT-Qwen2-VL (7B) | Avg.: 69.3 |
| video-question-answering-on-next-qa | LinVT-Qwen2-VL (7B) | Accuracy: 85.5 |
| visual-question-answering-on-mm-vet | LinVT | GPT-4 score: 23.5 |
| zero-shot-video-question-answer-on-egoschema-1 | LinVT-Qwen2-VL(7B) | Accuracy: 69.5 |
| zeroshot-video-question-answer-on-activitynet | LinVT-Qwen2-VL(7B) | Accuracy: 60.1 Confidence Score: 3.6 |
| zeroshot-video-question-answer-on-msrvtt-qa | LinVT-Qwen2-VL (7B) | Accuracy: 66.2 Confidence Score: 4.0 |
| zeroshot-video-question-answer-on-msvd-qa | LinVT-Qwen2-VL (7B) | Accuracy: 80.2 Confidence Score: 4.4 |
| zeroshot-video-question-answer-on-tgif-qa | LinVT-Qwen2-VL (7B) | Accuracy: 81.3 Confidence Score: 4.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.