3 months ago

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Peng Jin Jinfa Huang Fenglin Liu Xian Wu Shen Ge Guoli Song David A. Clifton Jie Chen

Abstract

Most video-and-language representation learning approaches employ contrastive learning, e.g., CLIP, to project the video and text features into a common latent space according to the semantic similarities of text-video pairs. However, such learned shared latent spaces are not often optimal, and the modality gap between visual and textual representation can not be fully eliminated. In this paper, we propose Expectation-Maximization Contrastive Learning (EMCL) to learn compact video-and-language representations. Specifically, we use the Expectation-Maximization algorithm to find a compact set of bases for the latent space, where the features could be concisely represented as the linear combinations of these bases. Such feature decomposition of video-and-language representations reduces the rank of the latent space, resulting in increased representing power for the semantics. Extensive experiments on three benchmark text-video retrieval datasets prove that our EMCL can learn more discriminative video-and-language representations than previous methods, and significantly outperform previous state-of-the-art methods across all metrics. More encouragingly, the proposed method can be applied to boost the performance of existing approaches either as a jointly training layer or an out-of-the-box inference module with no extra training, making it easy to be incorporated into any existing methods.

Code Repositories

jpthu17/dicosa

pytorch

Mentioned in GitHub

jpthu17/HBI

pytorch

Mentioned in GitHub

jpthu17/emcl

Official

pytorch

Mentioned in GitHub

jpthu17/diffusionret

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-captioning-on-msr-vtt-1	EMCL-Net	BLEU-4: 45.3 CIDEr: 54.6 METEOR: 30.2 ROUGE-L: 63.2
video-question-answering-on-msrvtt-qa	EMCL-Net	Accuracy: 45.8
video-retrieval-on-activitynet	EMCL-Net	text-to-video Mean Rank: 2 text-to-video R@1: 41.2 text-to-video R@5: 72.7 video-to-text Mean Rank: 2 video-to-text R@1: 42.7 video-to-text R@5: 74 video-to-text R@50: 98.3
video-retrieval-on-activitynet	EMCL-Net++	text-to-video Mean Rank: 1 text-to-video R@1: 50.6 text-to-video R@5: 78.7 text-to-video R@50: 98.1 video-to-text Mean Rank: 1 video-to-text R@1: 50.6 video-to-text R@5: 78.9 video-to-text R@50: 98.4
video-retrieval-on-lsmdc	EMCL-Net	text-to-video R@1: 23.9 text-to-video R@10: 50.9 text-to-video R@5: 42.4 video-to-text Mean Rank: 12 video-to-text R@1: 22.2 video-to-text R@10: 49.2 video-to-text R@5: 40.6
video-retrieval-on-lsmdc	EMCL-Net (Ours)++ LSMDC Rohrbach et al. (2015)	text-to-video Mean Rank: 8 text-to-video R@10: 53.7
video-retrieval-on-lsmdc	EMCL-Net++	text-to-video R@1: 25.9 text-to-video R@5: 46.4 video-to-text Mean Rank: 8 video-to-text R@1: 26.7 video-to-text R@10: 54.4 video-to-text R@5: 44.7
video-retrieval-on-msr-vtt-1ka	EMCL-Net	text-to-video Mean Rank: 2 text-to-video R@1: 46.8 text-to-video R@10: 83.1 text-to-video R@5: 73.1 video-to-text Mean Rank: 2 video-to-text R@1: 46.5 video-to-text R@10: 83.5 video-to-text R@5: 73.5
video-retrieval-on-msr-vtt-1ka	EMCL-Net++	text-to-video Mean Rank: 1 text-to-video R@1: 51.6 text-to-video R@10: 85.3 text-to-video R@5: 78.1 video-to-text Mean Rank: 1 video-to-text R@1: 51.8 video-to-text R@10: 88 video-to-text R@5: 80.2
visual-question-answering-on-msrvtt-qa-1	EMCL-Net	Accuracy: 0.458

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

Expectation-Maximization Contrastive Learning for Compact Video-and-Language Representations

Peng Jin Jinfa Huang Fenglin Liu Xian Wu Shen Ge Guoli Song David A. Clifton Jie Chen

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters