HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Jingjia Huang Yinan Li Jiashi Feng Xinglong Wu Xiaoshuai Sun Rongrong Ji

Clover: Towards A Unified Video-Language Alignment and Fusion Model

Abstract

Building a universal Video-Language model for solving various video understanding tasks (\emph{e.g.}, text-video retrieval, video question answering) is an open challenge to the machine learning field. Towards this goal, most recent works build the model by stacking uni-modal and cross-modal feature encoders and train it with pair-wise contrastive pre-text tasks. Though offering attractive generality, the resulted models have to compromise between efficiency and performance. They mostly adopt different architectures to deal with different downstream tasks. We find this is because the pair-wise training cannot well \emph{align} and \emph{fuse} features from different modalities. We then introduce \textbf{Clover}\textemdash a Correlated Video-Language pre-training method\textemdash towards a universal Video-Language model for solving multiple video understanding tasks with neither performance nor efficiency compromise. It improves cross-modal feature alignment and fusion via a novel tri-modal alignment pre-training task. Additionally, we propose to enhance the tri-modal alignment via incorporating learning from semantic masked samples and a new pair-wise ranking loss. Clover establishes new state-of-the-arts on multiple downstream tasks, including three retrieval tasks for both zero-shot and fine-tuning settings, and eight video question answering tasks. Codes and pre-trained models will be released at \url{https://github.com/LeeYN-43/Clover}.

Code Repositories

leeyn-43/clover
Official
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-lsmdc-fibClover
Accuracy: 54.1
video-question-answering-on-lsmdc-mcClover
Accuracy: 83.7
video-question-answering-on-msrvtt-mcClover
Accuracy: 95.2
video-retrieval-on-didemoClover
text-to-video Median Rank: 1
text-to-video R@1: 50.1
text-to-video R@10: 85.6
text-to-video R@5: 76.7
video-retrieval-on-lsmdcClover
text-to-video Median Rank: 8
text-to-video R@1: 24.8
text-to-video R@10: 54.5
text-to-video R@5: 44
video-retrieval-on-msr-vtt-1kaClover
text-to-video Median Rank: 2
text-to-video R@1: 40.5
text-to-video R@10: 79.4
text-to-video R@5: 69.8
visual-question-answering-on-msrvtt-qa-1Clover
Accuracy: 0.441
visual-question-answering-on-msvd-qa-1Clover
Accuracy: 0.524
zero-shot-video-retrieval-on-didemoClover
text-to-video Median Rank: 4
text-to-video R@1: 29.5
text-to-video R@10: 66.3
text-to-video R@5: 55.2
zero-shot-video-retrieval-on-lsmdcClover
text-to-video Median Rank: 24
text-to-video R@1: 14.7
text-to-video R@10: 38.2
text-to-video R@5: 29.2
zero-shot-video-retrieval-on-msr-vttClover
text-to-video Median Rank: 6
text-to-video R@1: 26.4
text-to-video R@10: 60
text-to-video R@5: 49.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp