8 months ago

Visual Question Answering

Method/Architecture

Mingze Xu Mingfei Gao Zhe Gan Hong-You Chen Zhengfeng Lai Haiming Gang Kai Kang Afshin Dehghan

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free videolarge language model (LLM) that can jointly capture the detailed spatialsemantics and long-range temporal context without exceeding the token budget ofcommonly used LLMs. This is realized by using a two-stream SlowFast design ofinputs for Video LLMs to aggregate features from sampled video frames in aneffective way. Specifically, the Slow pathway extracts features at a low framerate while keeping as many spatial details as possible (e.g., with 24x24tokens), and the Fast pathway operates on a high frame rate but uses a largerspatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. Asa result, this design allows us to adequately capture both spatial and temporalfeatures that are beneficial for understanding details along the video.Experimental results show that SF-LLaVA outperforms existing training-freemethods on a wide range of video tasks. On some benchmarks, it achievescomparable or even better performance compared to state-of-the-art Video LLMsthat are fine-tuned on video datasets.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Visual Question Answering

Method/Architecture

Mingze Xu Mingfei Gao Zhe Gan Hong-You Chen Zhengfeng Lai Haiming Gang Kai Kang Afshin Dehghan

Abstract

We propose SlowFast-LLaVA (or SF-LLaVA for short), a training-free videolarge language model (LLM) that can jointly capture the detailed spatialsemantics and long-range temporal context without exceeding the token budget ofcommonly used LLMs. This is realized by using a two-stream SlowFast design ofinputs for Video LLMs to aggregate features from sampled video frames in aneffective way. Specifically, the Slow pathway extracts features at a low framerate while keeping as many spatial details as possible (e.g., with 24x24tokens), and the Fast pathway operates on a high frame rate but uses a largerspatial pooling stride (e.g., downsampling 6x) to focus on the motion cues. Asa result, this design allows us to adequately capture both spatial and temporalfeatures that are beneficial for understanding details along the video.Experimental results show that SF-LLaVA outperforms existing training-freemethods on a wide range of video tasks. On some benchmarks, it achievescomparable or even better performance compared to state-of-the-art Video LLMsthat are fine-tuned on video datasets.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp