Command Palette
Search for a command to run...
Qin Hangyu ; Xiao Junbin ; Yao Angela

Abstract
This paper presents question-answering on dense video events, a novel taskthat answers and grounds dense-event questions in long videos, thus challengingMLLMs to faithfully comprehend and reason about multiple events over extendedperiods of time. To facilitate the study, we construct DeVE-QA -- a datasetfeaturing 78K questions about 26K events on 10.6K long videos. Our benchmarkingshows that state-of-the-art MLLMs struggle on DeVE-QA. For improvement, wepropose DeVi, a novel training-free MLLM approach that highlights ahierarchical captioning module, a temporal event memory module, and aself-consistency checking module to respectively detect, contextualize andmemorize, and ground dense-events in long videos for question answering.Extensive experiments show that DeVi is superior at answering dense-eventquestions and grounding relevant video moments. Compared with existing MLLMs,it achieves a notable increase of 4.8% and 2.1% for G(round)QA accuracy onDeVE-QA and NExT-GQA, respectively. Data and code are available athttps://github.com/QHUni/DeVE-QA.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-video-question-answer-on-next-gqa | DeVi (GPT-4) | Acc@GQA: 28.0 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.