HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

BIMBA: Selective-Scan Compression for Long-Range Video Question Answering

Md Mohaiminul Islam Tushar Nagarajan Huiyu Wang Gedas Bertasius Lorenzo Torresani

BIMBA: Selective-Scan Compression for Long-Range Video Question
  Answering

Abstract

Video Question Answering (VQA) in long videos poses the key challenge ofextracting relevant information and modeling long-range dependencies from manyredundant frames. The self-attention mechanism provides a general solution forsequence modeling, but it has a prohibitive cost when applied to a massivenumber of spatiotemporal tokens in long videos. Most prior methods rely oncompression strategies to lower the computational cost, such as reducing theinput length via sparse frame sampling or compressing the output sequencepassed to the large language model (LLM) via space-time pooling. However, thesenaive approaches over-represent redundant information and often miss salientevents or fast-occurring space-time patterns. In this work, we introduce BIMBA,an efficient state-space model to handle long-form videos. Our model leveragesthe selective scan algorithm to learn to effectively select criticalinformation from high-dimensional video and transform it into a reduced tokensequence for efficient LLM processing. Extensive experiments demonstrate thatBIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks,including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, andVideo-MME. Code, and models are publicly available athttps://sites.google.com/view/bimba-mllm.

Code Repositories

md-mohaiminul/BIMBA
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-next-qaBIMBA-LLaVA-Qwen2-7B
Accuracy: 83.73
video-question-answering-on-perception-testBIMBA-LLaVA-Qwen2-7B
Accuracy (Top-1): 68.51
zero-shot-video-question-answer-on-egoschema-1BIMBA-LLaVA-Qwen2-7B
Accuracy: 71.14
zero-shot-video-question-answer-on-video-mme-1BIMBA-LLaVA-Qwen2-7B
Accuracy (%): 64.67

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
BIMBA: Selective-Scan Compression for Long-Range Video Question Answering | Papers | HyperAI