HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Wonkyun Kim; Changin Choi; Wonseok Lee; Wonjong Rhee

An Image Grid Can Be Worth a Video: Zero-shot Video Question Answering Using a VLM

Abstract

Stimulated by the sophisticated reasoning capabilities of recent Large Language Models (LLMs), a variety of strategies for bridging video modality have been devised. A prominent strategy involves Video Language Models (VideoLMs), which train a learnable interface with video data to connect advanced vision encoders with LLMs. Recently, an alternative strategy has surfaced, employing readily available foundation models, such as VideoLMs and LLMs, across multiple stages for modality bridging. In this study, we introduce a simple yet novel strategy where only a single Vision Language Model (VLM) is utilized. Our starting point is the plain insight that a video comprises a series of images, or frames, interwoven with temporal information. The essence of video comprehension lies in adeptly managing the temporal aspects along with the spatial details of each frame. Initially, we transform a video into a single composite image by arranging multiple frames in a grid layout. The resulting single image is termed as an image grid. This format, while maintaining the appearance of a solitary image, effectively retains temporal information within the grid structure. Therefore, the image grid approach enables direct application of a single high-performance VLM without necessitating any video-data training. Our extensive experimental analysis across ten zero-shot video question answering benchmarks, including five open-ended and five multiple-choice benchmarks, reveals that the proposed Image Grid Vision Language Model (IG-VLM) surpasses the existing methods in nine out of ten benchmarks.

Code Repositories

imagegridworth/IG-VLM
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-based-generative-performanceIG-VLM-GPT4v
Consistency: 3.13
Contextual Understanding: 3.61
Correctness of Information: 3.40
Detail Orientation: 2.80
Temporal Understanding: 2.89
mean: 3.17
zero-shot-video-question-answer-on-intentqaIG-VLM
Accuracy: 65.3
zero-shot-video-question-answer-on-next-qaIG-VLM(LLaVA v1.6)
Accuracy: 70.9
zero-shot-video-question-answer-on-next-qaIG-VLM (GPT-4)
Accuracy: 68.6
zero-shot-video-question-answer-on-tvqaIG-VLM (no speech, GPT-4V)
Accuracy: 57.8
zeroshot-video-question-answer-on-activitynetIG-VLM
Accuracy: 58.4
Confidence Score: 3.5
zeroshot-video-question-answer-on-msrvtt-qaIG-VLM
Accuracy: 63.8
Confidence Score: 3.5
zeroshot-video-question-answer-on-msvd-qaIG-VLM-34B
Accuracy: 79.6
Confidence Score: 4.1
zeroshot-video-question-answer-on-tgif-qaIG-VLM
Accuracy: 79.1
Confidence Score: 4.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp