HyperAIHyperAI

Command Palette

Search for a command to run...

5 months ago

SViTT: Temporal Learning of Sparse Video-Text Transformers

Yi Li; Kyle Min; Subarna Tripathi; Nuno Vasconcelos

SViTT: Temporal Learning of Sparse Video-Text Transformers

Abstract

Do video-text transformers learn to model temporal relationships across frames? Despite their immense capacity and the abundance of multimodal training data, recent work has revealed the strong tendency of video-text models towards frame-based spatial representations, while temporal reasoning remains largely unsolved. In this work, we identify several key challenges in temporal learning of video-text transformers: the spatiotemporal trade-off from limited network size; the curse of dimensionality for multi-frame modeling; and the diminishing returns of semantic information by extending clip length. Guided by these findings, we propose SViTT, a sparse video-text architecture that performs multi-frame reasoning with significantly lower cost than naive transformers with dense attention. Analogous to graph-based networks, SViTT employs two forms of sparsity: edge sparsity that limits the query-key communications between tokens in self-attention, and node sparsity that discards uninformative visual tokens. Trained with a curriculum which increases model sparsity with the clip length, SViTT outperforms dense transformer baselines on multiple video-text retrieval and question answering benchmarks, with a fraction of computational cost. Project page: http://svcl.ucsd.edu/projects/svitt.

Code Repositories

jerryyli/svitt
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-agqa-2-0-balancedSViTT
Average Accuracy: 52.7

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
SViTT: Temporal Learning of Sparse Video-Text Transformers | Papers | HyperAI