HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models

Mingze Xu Mingfei Gao Zhe Gan Hong-You Chen Zhengfeng Lai Haiming Gang Kai Kang Afshin Dehghan

SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language
  Models

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free videolarge language model (LLM) that can jointly capture the detailed spatialsemantics and long-range temporal context without exceeding the token budget ofcommonly used LLMs. This is realized by using a two-stream SlowFast design ofinputs for Video LLMs to aggregate features from sampled video frames in aneffective way. Specifically, the Slow pathway extracts features at a low framerate while keeping as many spatial details as possible (e.g., with 24x24tokens), and the Fast pathway operates on a high frame rate but uses a largerspatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. Asa result, this design allows us to adequately capture both spatial and temporalfeatures that are beneficial for understanding details along the video.Experimental results show that SF-LLaVA outperforms existing training-freemethods on a wide range of video tasks. On some benchmarks, it achievescomparable or even better performance compared to state-of-the-art Video LLMsthat are fine-tuned on video datasets.

Code Repositories

apple/ml-slowfast-llava
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-based-generative-performanceSlowFast-LLaVA-34B
mean: 3.32
video-based-generative-performance-1SlowFast-LLaVA-34B
gpt-score: 3.48
video-based-generative-performance-2SlowFast-LLaVA-34B
gpt-score: 3.57
video-based-generative-performance-3SlowFast-LLaVA-34B
gpt-score: 3.84
video-based-generative-performance-4SlowFast-LLaVA-34B
gpt-score: 2.96
video-based-generative-performance-5SlowFast-LLaVA-34B
gpt-score: 2.77
zero-shot-video-question-answer-on-egoschemaSlowFast-LLaVA-34B
Accuracy: 47.2
zero-shot-video-question-answer-on-intentqaSlowFast-LLaVA-34B
Accuracy: 60.1
zero-shot-video-question-answer-on-next-qaSlowFast-LLaVA-34B
Accuracy: 64.2
zeroshot-video-question-answer-on-activitynetSlowFast-LLaVA-34B
Accuracy: 59.2
Confidence Score: 3.5
zeroshot-video-question-answer-on-msrvtt-qaSlowFast-LLaVA-34B
Accuracy: 67.4
Confidence Score: 3.7
zeroshot-video-question-answer-on-msvd-qaSlowFast-LLaVA-34B
Accuracy: 79.9
Confidence Score: 4.1
zeroshot-video-question-answer-on-tgif-qaSlowFast-LLaVA-34B
Accuracy: 80.6
Confidence Score: 4.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SlowFast-LLaVA: A Strong Training-Free Baseline for Video Large Language Models | Papers | HyperAI