HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Peng Jin Jinfa Huang Pengfei Xiong Shangxuan Tian Chang Liu Xiangyang Ji Li Yuan Jie Chen

Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning

Abstract

Contrastive learning-based video-language representation learning approaches, e.g., CLIP, have achieved outstanding performance, which pursue semantic interaction upon pre-defined video-text pairs. To clarify this coarse-grained global interaction and move a step further, we have to encounter challenging shell-breaking interactions for fine-grained cross-modal learning. In this paper, we creatively model video-text as game players with multivariate cooperative game theory to wisely handle the uncertainty during fine-grained semantic interaction with diverse granularity, flexible combination, and vague intensity. Concretely, we propose Hierarchical Banzhaf Interaction (HBI) to value possible correspondence between video frames and text words for sensitive and explainable cross-modal contrast. To efficiently realize the cooperative game of multiple video frames and multiple text words, the proposed method clusters the original video frames (text words) and computes the Banzhaf Interaction between the merged tokens. By stacking token merge modules, we achieve cooperative games at different semantic levels. Extensive experiments on commonly used text-video retrieval and video-question answering benchmarks with superior performances justify the efficacy of our HBI. More encouragingly, it can also serve as a visualization tool to promote the understanding of cross-modal interaction, which have a far-reaching impact on the community. Project page is available at https://jpthu17.github.io/HBI/.

Code Repositories

jpthu17/dicosa
pytorch
Mentioned in GitHub
jpthu17/HBI
Official
pytorch
Mentioned in GitHub
jpthu17/emcl
pytorch
Mentioned in GitHub
jpthu17/diffusionret
pytorch
Mentioned in GitHub

Benchmarks

BenchmarkMethodologyMetrics
video-question-answering-on-msrvtt-qaHBI
Accuracy: 46.2
video-retrieval-on-activitynetHBI
text-to-video Mean Rank: 6.6
text-to-video Median Rank: 2.0
text-to-video R@1: 42.2
text-to-video R@10: 84.6
text-to-video R@5: 73.0
video-to-text Mean Rank: 6.5
video-to-text Median Rank: 2.0
video-to-text R@1: 42.4
video-to-text R@10: 86.0
video-to-text R@5: 73.0
video-retrieval-on-didemoHBI
text-to-video Mean Rank: 12.1
text-to-video Median Rank: 2.0
text-to-video R@1: 46.9
text-to-video R@10: 82.7
text-to-video R@5: 74.9
video-to-text Mean Rank: 8.7
video-to-text Median Rank: 2.0
video-to-text R@1: 46.2
video-to-text R@10: 82.7
video-to-text R@5: 73.0
video-retrieval-on-msr-vtt-1kaHBI
text-to-video Mean Rank: 12.0
text-to-video Median Rank: 2.0
text-to-video R@1: 48.6
text-to-video R@10: 83.4
text-to-video R@5: 74.6
video-to-text Mean Rank: 8.9
video-to-text Median Rank: 2.0
video-to-text R@1: 46.8
video-to-text R@10: 84.3
video-to-text R@5: 74.3
visual-question-answering-on-msrvtt-qa-1HBI
Accuracy: 0.462

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Video-Text as Game Players: Hierarchical Banzhaf Interaction for Cross-Modal Representation Learning | Papers | HyperAI