6 months ago

Abstract

Video Question Answering (VQA) in long videos poses the key challenge ofextracting relevant information and modeling long-range dependencies from manyredundant frames. The self-attention mechanism provides a general solution forsequence modeling, but it has a prohibitive cost when applied to a massivenumber of spatiotemporal tokens in long videos. Most prior methods rely oncompression strategies to lower the computational cost, such as reducing theinput length via sparse frame sampling or compressing the output sequencepassed to the large language model (LLM) via space-time pooling. However, thesenaive approaches over-represent redundant information and often miss salientevents or fast-occurring space-time patterns. In this work, we introduce BIMBA,an efficient state-space model to handle long-form videos. Our model leveragesthe selective scan algorithm to learn to effectively select criticalinformation from high-dimensional video and transform it into a reduced tokensequence for efficient LLM processing. Extensive experiments demonstrate thatBIMBA achieves state-of-the-art accuracy on multiple long-form VQA benchmarks,including PerceptionTest, NExT-QA, EgoSchema, VNBench, LongVideoBench, andVideo-MME. Code, and models are publicly available athttps://sites.google.com/view/bimba-mllm.

Source PDF