HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Difei Gao Luowei Zhou Lei Ji Linchao Zhu Yi Yang Mike Zheng Shou

MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering

Abstract

To build Video Question Answering (VideoQA) systems capable of assisting humans in daily activities, seeking answers from long-form videos with diverse and complex events is a must. Existing multi-modal VQA models achieve promising performance on images or short video clips, especially with the recent success of large-scale multi-modal pre-training. However, when extending these methods to long-form videos, new challenges arise. On the one hand, using a dense video sampling strategy is computationally prohibitive. On the other hand, methods relying on sparse sampling struggle in scenarios where multi-event and multi-granularity visual reasoning are required. In this work, we introduce a new model named Multi-modal Iterative Spatial-temporal Transformer (MIST) to better adapt pre-trained models for long-form VideoQA. Specifically, MIST decomposes traditional dense spatial-temporal self-attention into cascaded segment and region selection modules that adaptively select frames and image regions that are closely relevant to the question itself. Visual concepts at different granularities are then processed efficiently through an attention module. In addition, MIST iteratively conducts selection and attention over multiple layers to support reasoning over multiple events. The experimental results on four VideoQA datasets, including AGQA, NExT-QA, STAR, and Env-QA, show that MIST achieves state-of-the-art performance and is superior at computation efficiency and interpretability.

Code Repositories

showlab/mist
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-agqa-2-0-balancedMIST - AIO
Average Accuracy: 50.96
video-question-answering-on-agqa-2-0-balancedMIST - CLIP
Average Accuracy: 54.39
video-question-answering-on-next-qaMIST
Accuracy: 57.2
video-question-answering-on-situatedMIST
Average Accuracy: 51.13

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
MIST: Multi-modal Iterative Spatial-Temporal Transformer for Long-form Video Question Answering | Papers | HyperAI