HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Antoine Yang Antoine Miech Josef Sivic Ivan Laptev Cordelia Schmid

Zero-Shot Video Question Answering via Frozen Bidirectional Language Models

Abstract

Video question answering (VideoQA) is a complex task that requires diverse multi-modal data for training. Manual annotation of question and answers for videos, however, is tedious and prohibits scalability. To tackle this problem, recent methods consider zero-shot settings with no manual annotation of visual question-answer. In particular, a promising approach adapts frozen autoregressive language models pretrained on Web-scale text-only data to multi-modal inputs. In contrast, we here build on frozen bidirectional language models (BiLM) and show that such an approach provides a stronger and cheaper alternative for zero-shot VideoQA. In particular, (i) we combine visual inputs with the frozen BiLM using light trainable modules, (ii) we train such modules using Web-scraped multi-modal data, and finally (iii) we perform zero-shot VideoQA inference through masked language modeling, where the masked text is the answer to a given question. Our proposed approach, FrozenBiLM, outperforms the state of the art in zero-shot VideoQA by a significant margin on a variety of datasets, including LSMDC-FiB, iVQA, MSRVTT-QA, MSVD-QA, ActivityNet-QA, TGIF-FrameQA, How2QA and TVQA. It also demonstrates competitive performance in the few-shot and fully-supervised setting. Our code and models are publicly available at https://github.com/antoyang/FrozenBiLM.

Code Repositories

klauscc/dam
pytorch
Mentioned in GitHub
antoyang/FrozenBiLM
Official
pytorch
Mentioned in GitHub
sts-vlcc/sts-vlcc
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-activitynet-qaFrozenBiLM
Accuracy: 43.2
video-question-answering-on-activitynet-qaFrozenBiLM (0-shot)
Accuracy: 25.9
video-question-answering-on-how2qaFrozenBiLM
Accuracy: 86.7
video-question-answering-on-how2qaFrozenBiLM (0-shot)
Accuracy: 58.4
video-question-answering-on-ivqaFrozenBiLM (0-shot)
Accuracy: 26.8
video-question-answering-on-ivqaFrozenBiLM
Accuracy: 39.6
video-question-answering-on-msrvtt-qaFrozenBiLM
Accuracy: 47.0
video-question-answering-on-msrvtt-qaFrozenBiLM (0-shot)
Accuracy: 16.7
video-question-answering-on-tvqaFrozenBiLM
Accuracy: 82
visual-question-answering-on-msrvtt-qa-2FrozenBiLM
Accuracy: 0.470
visual-question-answering-on-msvd-qa-2FrozenBiLM
Accuracy: 0.548
zero-shot-learning-on-ivqaFrozenBiLM
Accuracy: 0.268
zero-shot-learning-on-lsmdcFrozenBiLM
Accuracy: 51.5
zero-shot-video-question-answer-on-egoschema-1FrozenBiLM
Accuracy: 26.9
zero-shot-video-question-answer-on-tvqaFrozenBiLM (with speech)
Accuracy: 59.7
zero-shot-video-question-answer-on-tvqaFrozenBILM (no speech)
Accuracy: 29.7
zeroshot-video-question-answer-on-activitynetFrozenBiLM
Accuracy: 24.7
Confidence Score: -
zeroshot-video-question-answer-on-msvd-qaFrozenBiLM
Accuracy: 33.8
zeroshot-video-question-answer-on-tgif-qaFrozenBiLM
Accuracy: 41.9

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Zero-Shot Video Question Answering via Frozen Bidirectional Language Models | Papers | HyperAI