
摘要
在本研究中,我们提出了一种新颖的方法来解决视频和图像理解中的视觉语言模型(VLMs)的令牌生成挑战,称为LLaMA-VID。尽管当前的VLMs在诸如图像描述和视觉问答等任务上表现出色,但在处理长视频时仍面临计算负担,原因是视觉令牌数量过多。LLaMA-VID通过为每一帧使用两个不同的令牌——上下文令牌和内容令牌——来解决这一问题。上下文令牌根据用户输入编码整个图像的上下文信息,而内容令牌则封装每一帧中的视觉线索。这种双令牌策略显著减轻了长视频的计算负担,同时保留了关键信息。总体而言,LLaMA-VID使现有框架能够支持长达数小时的视频,并通过增加额外的上下文令牌进一步扩展其上限。该方法已在大多数基于视频或图像的基准测试中证明优于先前的方法。代码已发布在 https://github.com/dvlab-research/LLaMA-VID 上。
代码仓库
dvlab-research/llama-vid
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| video-based-generative-performance | LLaMA-VID-7B (2 Token) | Consistency: 2.51 Contextual Understanding: 3.53 Correctness of Information: 2.96 Detail Orientation: 3.00 Temporal Understanding: 2.46 mean: 2.89 |
| video-based-generative-performance | LLaMA-VID-13B (2 Token) | Consistency: 2.63 Contextual Understanding: 3.60 Correctness of Information: 3.07 Detail Orientation: 3.05 Temporal Understanding: 2.58 mean: 2.99 |
| video-question-answering-on-activitynet-qa | LLaMA-VID-7B (2 Token) | Accuracy: 47.4 Confidence score: 3.3 |
| video-question-answering-on-activitynet-qa | LLaMA-VID-13B (2 Token) | Accuracy: 47.5 Confidence score: 3.3 |
| zeroshot-video-question-answer-on-activitynet | LLaMA-VID-13B (2 Token) | Accuracy: 47.5 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-activitynet | LLaMA-VID-7B (2 Token) | Accuracy: 47.4 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-msrvtt-qa | LLaMA-VID-7B (2 Token) | Accuracy: 57.7 Confidence Score: 3.2 |
| zeroshot-video-question-answer-on-msrvtt-qa | LLaMA-VID-13B (2 Token) | Accuracy: 58.9 Confidence Score: 3.3 |
| zeroshot-video-question-answer-on-msvd-qa | LLaMA-VID-7B (2 Token) | Accuracy: 69.7 Confidence Score: 3.7 |
| zeroshot-video-question-answer-on-msvd-qa | LLaMA-VID-13B (2 Token) | Accuracy: 70.0 Confidence Score: 3.7 |