Command Palette
Search for a command to run...
MAtch, eXpand and Improve: Unsupervised Finetuning for Zero-Shot Action Recognition with Language Knowledge
Wei Lin; Leonid Karlinsky; Nina Shvetsova; Horst Possegger; Mateusz Kozinski; Rameswar Panda; Rogerio Feris; Hilde Kuehne; Horst Bischof

Abstract
Large scale Vision-Language (VL) models have shown tremendous success in aligning representations between visual and text modalities. This enables remarkable progress in zero-shot recognition, image generation & editing, and many other exciting tasks. However, VL models tend to over-represent objects while paying much less attention to verbs, and require additional tuning on video data for best zero-shot action recognition performance. While previous work relied on large-scale, fully-annotated data, in this work we propose an unsupervised approach. We adapt a VL model for zero-shot and few-shot action recognition using a collection of unlabeled videos and an unpaired action dictionary. Based on that, we leverage Large Language Models and VL models to build a text bag for each unlabeled video via matching, text expansion and captioning. We use those bags in a Multiple Instance Learning setup to adapt an image-text backbone to video data. Although finetuned on unlabeled video data, our resulting models demonstrate high transferability to numerous unseen zero-shot downstream tasks, improving the base VL model performance by up to 14\%, and even comparing favorably to fully-supervised baselines in both zero-shot and few-shot video recognition transfer. The code will be released later at \url{https://github.com/wlin-at/MAXI}.
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| zero-shot-action-recognition-on-charades-1 | MAXI | mAP: 23.8 |
| zero-shot-action-recognition-on-hmdb51 | MAXI | Top-1 Accuracy: 52.3 |
| zero-shot-action-recognition-on-kinetics | MAXI | Top-1 Accuracy: 71.6 |
| zero-shot-action-recognition-on-ucf101 | MAXI | Top-1 Accuracy: 78.2 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.