8 months ago

Abstract

In this paper, we efficiently transfer the surpassing representation power ofthe vision foundation models, such as ViT and Swin, for video understandingwith only a few trainable parameters. Previous adaptation methods havesimultaneously considered spatial and temporal modeling with a unifiedlearnable module but still suffered from fully leveraging the representativecapabilities of image transformers. We argue that the popular dual-path(two-stream) architecture in video models can mitigate this problem. We proposea novel DualPath adaptation separated into spatial and temporal adaptationpaths, where a lightweight bottleneck adapter is employed in each transformerblock. Especially for temporal dynamic modeling, we incorporate consecutiveframes into a grid-like frameset to precisely imitate vision transformers'capability that extrapolates relationships between tokens. In addition, weextensively investigate the multiple baselines from a unified perspective invideo understanding and compare them with DualPath. Experimental results onfour action recognition benchmarks prove that pretrained image transformerswith DualPath can be effectively generalized beyond the data domain.

Source PDF