5 months ago

Question-Answering Dense Video Events

Qin Hangyu ; Xiao Junbin ; Yao Angela

Abstract

This paper presents question-answering on dense video events, a novel taskthat answers and grounds dense-event questions in long videos, thus challengingMLLMs to faithfully comprehend and reason about multiple events over extendedperiods of time. To facilitate the study, we construct DeVE-QA -- a datasetfeaturing 78K questions about 26K events on 10.6K long videos. Our benchmarkingshows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, wepropose DeVi, a novel training-free MLLM approach that highlights ahierarchical captioning module, a temporal event memory module, and aself-consistency checking module to respectively detect, contextualize andmemorize, and ground dense-events in long videos for question answering.Extensive experiments show that DeVi is superior at answering dense-eventquestions and grounding relevant video moments. Compared with existing MLLMs,it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy onDeVE-QA and NExT-GQA, respectively. Data and code are available athttps://github.com/QHUni/DeVE-QA.

Code Repositories

qhuni/deve-qa

Official

Benchmarks

Benchmark	Methodology	Metrics
zero-shot-video-question-answer-on-next-gqa	DeVi (GPT-4)	Acc@GQA: 28.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette