Command Palette
Search for a command to run...
Lei Jie ; Berg Tamara L. ; Bansal Mohit

Abstract
Training an effective video-and-language model intuitively requires multipleframes as model inputs. However, it is unclear whether using multiple frames isbeneficial to downstream tasks, and if yes, whether the performance gain isworth the drastically-increased computation and memory costs resulting fromusing more frames. In this work, we explore single-frame models forvideo-and-language learning. On a diverse set of video-and-language tasks(including text-to-video retrieval and video question answering), we show thesurprising result that, with large-scale pre-training and a proper frameensemble strategy at inference time, a single-frame trained model that does notconsider temporal information can achieve better performance than existingmethods that use multiple frames for training. This result reveals theexistence of a strong "static appearance bias" in popular video-and-languagedatasets. Therefore, to allow for a more comprehensive evaluation ofvideo-and-language models, we propose two new retrieval tasks based on existingfine-grained action recognition datasets that encourage temporal modeling. Ourcode is available at https://github.com/jayleicn/singularity
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-activitynet-qa | Singularity-temporal | Accuracy: 44.1 |
| video-question-answering-on-activitynet-qa | Singularity | Accuracy: 43.1 |
| video-question-answering-on-msrvtt-mc | Singularity-temporal | Accuracy: 93.7 |
| video-question-answering-on-msrvtt-mc | Singularity | Accuracy: 92.1 |
| video-question-answering-on-msrvtt-qa | Singularity-temporal | Accuracy: 43.9 |
| video-question-answering-on-msrvtt-qa | Singularity | Accuracy: 43.5 |
| video-retrieval-on-activitynet | Singularity | text-to-video R@1: 47.1 text-to-video R@10: 85.5 text-to-video R@5: 75.5 |
| video-retrieval-on-didemo | Singularity | text-to-video R@1: 53.9 text-to-video R@10: 86.9 text-to-video R@5: 79.4 |
| video-retrieval-on-msr-vtt-1ka | Singularity | text-to-video R@1: 41.5 text-to-video R@10: 77 text-to-video R@5: 68.7 |
| video-retrieval-on-ssv2-label-retrieval | Singularity-temporal | text-to-video R@1: 47.4 text-to-video R@10: 84 text-to-video R@5: 75.9 |
| video-retrieval-on-ssv2-template-retrieval | Singularity-temporal | text-to-video R@1: 77.6 text-to-video R@10: 98.9 text-to-video R@5: 96 |
| zero-shot-video-retrieval-on-activitynet | Singularity-temporal-17M | text-to-video R@1: 30.6 text-to-video R@10: 66.9 text-to-video R@5: 55.6 |
| zero-shot-video-retrieval-on-activitynet | Singularity-temporal-5M | text-to-video R@1: 30.8 text-to-video R@10: 66.3 text-to-video R@5: 55.9 |
| zero-shot-video-retrieval-on-didemo | Singularity-5M | text-to-video R@1: 36.9 text-to-video R@10: 69.3 text-to-video R@5: 61.1 |
| zero-shot-video-retrieval-on-didemo | Singularity-17M | text-to-video R@1: 37.1 text-to-video R@10: 69.9 text-to-video R@5: 61.7 |
| zero-shot-video-retrieval-on-msr-vtt | Singularity-17M | text-to-video R@1: 34.0 text-to-video R@10: 66.7 text-to-video R@5: 56.7 |
| zero-shot-video-retrieval-on-msr-vtt | Singularity-5M | text-to-video R@1: 28.4 text-to-video R@10: 59.5 text-to-video R@5: 50.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.