HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Rui Qian Yeqing Li Zheng Xu Ming-Hsuan Yang Serge Belongie Yin Cui

Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models

Abstract

Utilizing vision and language models (VLMs) pre-trained on large-scale image-text pairs is becoming a promising paradigm for open-vocabulary visual recognition. In this work, we extend this paradigm by leveraging motion and audio that naturally exist in video. We present \textbf{MOV}, a simple yet effective method for \textbf{M}ultimodal \textbf{O}pen-\textbf{V}ocabulary video classification. In MOV, we directly use the vision encoder from pre-trained VLMs with minimal modifications to encode video, optical flow and audio spectrogram. We design a cross-modal fusion mechanism to aggregate complimentary multimodal information. Experiments on Kinetics-700 and VGGSound show that introducing flow or audio modality brings large performance gains over the pre-trained VLM and existing methods. Specifically, MOV greatly improves the accuracy on base classes, while generalizes better on novel classes. MOV achieves state-of-the-art results on UCF and HMDB zero-shot video classification benchmarks, significantly outperforming both traditional zero-shot methods and recent methods based on VLMs. Code and models will be released.

Benchmarks

BenchmarkMethodologyMetrics
zero-shot-action-recognition-on-hmdb51MOV (ViT-B/16)
Top-1 Accuracy: 60.8
zero-shot-action-recognition-on-hmdb51MOV (ViT-L/14)
Top-1 Accuracy: 64.7
zero-shot-action-recognition-on-ucf101MOV (ViT-B/16)
Top-1 Accuracy: 82.6
zero-shot-action-recognition-on-ucf101MOV (ViT-L/14)
Top-1 Accuracy: 87.1

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Multimodal Open-Vocabulary Video Classification via Pre-Trained Vision and Language Models | Papers | HyperAI