HyperAIHyperAI

Command Palette

Search for a command to run...

4 months ago

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Tingyu Qu; Mingxiao Li; Tinne Tuytelaars; Marie-Francine Moens

TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models

Abstract

Recent advances in multimodal Large Language Models (LLMs) have shown great success in understanding multi-modal contents. For video understanding tasks, training-based video LLMs are difficult to build due to the scarcity of high-quality, curated video-text paired data. In contrast, paired image-text data are much easier to obtain, and there is substantial similarity between images and videos. Consequently, extending image LLMs for video understanding tasks presents an appealing alternative. Developing effective strategies for compressing visual tokens from multiple frames is a promising way to leverage the powerful pre-trained image LLM. In this work, we explore the limitations of the existing compression strategies for building a training-free video LLM. The findings lead to our method TS-LLaVA, which constructs visual tokens through a Thumbnail-and-Sampling strategy. Given a video, we select few equidistant frames from all input frames to construct a Thumbnail image as a detailed visual cue, complemented by Sampled visual tokens from all input frames. Our method establishes the new state-of-the-art performance among training-free video LLMs on various benchmarks. Notably, our 34B model outperforms GPT-4V on the MVBench benchmark, and achieves performance comparable to the 72B training-based video LLM, Video-LLaMA2, on the challenging MLVU benchmark. Code is available at https://github.com/tingyu215/TS-LLaVA.

Code Repositories

tingyu215/ts-llava
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-based-generative-performanceTS-LLaVA-34B
mean: 3.38
video-based-generative-performance-1TS-LLaVA-34B
gpt-score: 3.55
video-based-generative-performance-2TS-LLaVA-34B
gpt-score: 3.69
video-based-generative-performance-3TS-LLaVA-34B
gpt-score: 3.86
video-based-generative-performance-4TS-LLaVA-34B
gpt-score: 3.03
video-based-generative-performance-5TS-LLaVA-34B
gpt-score: 2.77
zero-shot-video-question-answer-on-egoschemaTS-LLaVA-34B
Accuracy: 57.8
zero-shot-video-question-answer-on-intentqaTS-LLaVA-34B
Accuracy: 67.9
zero-shot-video-question-answer-on-mvbenchTS-LLaVA-34B
Accuracy: 52.6
zero-shot-video-question-answer-on-next-qaTS-LLaVA-34B
Accuracy: 73.6
zeroshot-video-question-answer-on-activitynetTS-LLaVA-34B
Accuracy: 58.9
Confidence Score: 3.5
zeroshot-video-question-answer-on-msrvtt-qaTS-LLaVA-34B
Accuracy: 66.2
Confidence Score: 3.6
zeroshot-video-question-answer-on-msvd-qaTS-LLaVA-34B
Accuracy: 79.4
Confidence Score: 4.1
zeroshot-video-question-answer-on-tgif-qaTS-LLaVA-34B
Accuracy: 81.0
Confidence Score: 4.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
TS-LLaVA: Constructing Visual Tokens through Thumbnail-and-Sampling for Training-Free Video Large Language Models | Papers | HyperAI