Command Palette
Search for a command to run...
Shyamal Buch Cristóbal Eyzaguirre Adrien Gaidon Jiajun Wu Li Fei-Fei Juan Carlos Niebles

Abstract
What makes a video task uniquely suited for videos, beyond what can be understood from a single image? Building on recent progress in self-supervised image-language models, we revisit this question in the context of video and language tasks. We propose the atemporal probe (ATP), a new model for video-language analysis which provides a stronger bound on the baseline accuracy of multimodal models constrained by image-level understanding. By applying this model to standard discriminative video and language tasks, such as video question answering and text-to-video retrieval, we characterize the limitations and potential of current video-language benchmarks. We find that understanding of event temporality is often not necessary to achieve strong or state-of-the-art performance, even compared with recent large-scale video-language models and in contexts intended to benchmark deeper video-level understanding. We also demonstrate how ATP can improve both video-language dataset and model design. We describe a technique for leveraging ATP to better disentangle dataset subsets with a higher concentration of temporally challenging data, improving benchmarking efficacy for causal and temporal understanding. Further, we show that effectively integrating ATP into full video-level temporal models can improve efficiency and state-of-the-art accuracy.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-question-answering-on-how2qa | ATP | Accuracy: 65.1 |
| video-question-answering-on-msr-vtt-mc | ATP (1<-16) | Accuracy: 93.2 |
| video-question-answering-on-next-qa | ATP | Accuracy: 54.3 |
| video-question-answering-on-situated | Temp[ATP] | Average Accuracy: 48.37 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.