HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

Question-Answering Dense Video Events

Qin Hangyu ; Xiao Junbin ; Yao Angela

Question-Answering Dense Video Events

Abstract

This paper presents question-answering on dense video events, a novel taskthat answers and grounds dense-event questions in long videos, thus challengingMLLMs to faithfully comprehend and reason about multiple events over extendedperiods of time. To facilitate the study, we construct DeVE-QA -- a datasetfeaturing 78K questions about 26K events on 10.6K long videos. Our benchmarkingshows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, wepropose DeVi, a novel training-free MLLM approach that highlights ahierarchical captioning module, a temporal event memory module, and aself-consistency checking module to respectively detect, contextualize andmemorize, and ground dense-events in long videos for question answering.Extensive experiments show that DeVi is superior at answering dense-eventquestions and grounding relevant video moments. Compared with existing MLLMs,it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy onDeVE-QA and NExT-GQA, respectively. Data and code are available athttps://github.com/QHUni/DeVE-QA.

Code Repositories

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-video-question-answer-on-next-gqaDeVi (GPT-4)
Acc@GQA: 28.0

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Question-Answering Dense Video Events | Papers | HyperAI