3 months ago

ActionCLIP: A New Paradigm for Video Action Recognition

Mengmeng Wang Jiazheng Xing Yong Liu

Abstract

The canonical approach to video action recognition dictates a neural model to do a classic and standard 1-of-N majority vote task. They are trained to predict a fixed set of predefined categories, limiting their transferable ability on new datasets with unseen concepts. In this paper, we provide a new perspective on action recognition by attaching importance to the semantic information of label texts rather than simply mapping them into numbers. Specifically, we model this task as a video-text matching problem within a multimodal learning framework, which strengthens the video representation with more semantic language supervision and enables our model to do zero-shot action recognition without any further labeled data or parameters requirements. Moreover, to handle the deficiency of label texts and make use of tremendous web data, we propose a new paradigm based on this multimodal learning framework for action recognition, which we dub "pre-train, prompt and fine-tune". This paradigm first learns powerful representations from pre-training on a large amount of web image-text or video-text data. Then it makes the action recognition task to act more like pre-training problems via prompt engineering. Finally, it end-to-end fine-tunes on target datasets to obtain strong performance. We give an instantiation of the new paradigm, ActionCLIP, which not only has superior and flexible zero-shot/few-shot transfer ability but also reaches a top performance on general action recognition task, achieving 83.8% top-1 accuracy on Kinetics-400 with a ViT-B/16 as the backbone. Code is available at https://github.com/sallymmx/ActionCLIP.git

Code Repositories

sallymmx/actionclip

Official

pytorch

Mentioned in GitHub

towhee-io/towhee

pytorch

Benchmarks

Benchmark	Methodology	Metrics
action-classification-on-charades	ActionCLIP (ViT-B/16)	MAP: 44.3
action-classification-on-kinetics-400	ActionCLIP (CLIP-pretrained)	Acc@1: 83.8 Acc@5: 97.1
action-recognition-in-videos-on-kinetics-400-1	ActionCLIP (ViT-B/16)	Top-1 Accuracy: 83.8

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette