Command Palette
Search for a command to run...
Just Ask: Learning to Answer Questions from Millions of Narrated Videos
Antoine Yang Antoine Miech Josef Sivic Ivan Laptev Cordelia Schmid

Abstract
Recent methods for visual question answering rely on large-scale annotated datasets. Manual annotation of questions and answers for videos, however, is tedious, expensive and prevents scalability. In this work, we propose to avoid manual annotation and generate a large-scale training dataset for video question answering making use of automatic cross-modal supervision. We leverage a question generation transformer trained on text data and use it to generate question-answer pairs from transcribed video narrations. Given narrated videos, we then automatically generate the HowToVQA69M dataset with 69M video-question-answer triplets. To handle the open vocabulary of diverse answers in this dataset, we propose a training procedure based on a contrastive loss between a video-question multi-modal transformer and an answer transformer. We introduce the zero-shot VideoQA task and show excellent results, in particular for rare answers. Furthermore, we demonstrate our method to significantly outperform the state of the art on MSRVTT-QA, MSVD-QA, ActivityNet-QA and How2QA. Finally, for a detailed evaluation we introduce iVQA, a new VideoQA dataset with reduced language biases and high-quality redundant manual annotations. Our code, datasets and trained models are available at https://antoyang.github.io/just-ask.html.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-activitynet-qa | Just Ask (fine-tune) | Accuracy: 38.9 |
| video-question-answering-on-activitynet-qa | Just Ask (0-shot) | Accuracy: 12.2 |
| video-question-answering-on-how2qa | Just Ask | Accuracy: 84.4 |
| video-question-answering-on-how2qa | Just Ask (0-shot) | Accuracy: 51.1 |
| video-question-answering-on-ivqa | Just Ask (0-shot) | Accuracy: 12.2 |
| video-question-answering-on-ivqa | Just Ask (fine-tune) | Accuracy: 35.4 |
| video-question-answering-on-videoqa | Just Ask (fine-tune) | Accuracy: 15.6 |
| visual-question-answering-on-msrvtt-qa-2 | Just Ask | Accuracy: 0.415 |
| visual-question-answering-on-msvd-qa-2 | Just Ask | Accuracy: 0.463 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.