HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Jie Jiang Shaobo Min Weijie Kong Dihong Gong Hongfa Wang Zhifeng Li Wei Liu

Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations

Abstract

Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.

Benchmarks

BenchmarkMethodologyMetrics
video-retrieval-on-activitynetHunYuan_tvr
text-to-video Mean Rank: 4.0
text-to-video Median Rank: 1
text-to-video R@1: 57.3
text-to-video R@10: 93.1
text-to-video R@5: 84.8
video-to-text Mean Rank: 3.4
video-to-text Median Rank: 1
video-to-text R@1: 57.7
video-to-text R@10: 93.9
video-to-text R@5: 85.7
video-retrieval-on-didemoHunYuan_tvr (huge)
text-to-video Mean Rank: 13.7
text-to-video Median Rank: 1.0
text-to-video R@1: 52.7
text-to-video R@10: 85.2
text-to-video R@5: 77.8
video-to-text Mean Rank: 9.1
video-to-text Median Rank: 1.0
video-to-text R@1: 54.1
video-to-text R@10: 86.8
video-to-text R@5: 78.3
video-retrieval-on-didemoHunYuan_tvr
text-to-video Mean Rank: 11.1
text-to-video Median Rank: 1
text-to-video R@1: 52.1
text-to-video R@10: 85.7
text-to-video R@5: 78.2
video-to-text Mean Rank: 7.1
video-to-text Median Rank: 1
video-to-text R@1: 54.8
video-to-text R@10: 87.2
video-to-text R@5: 79.9
video-retrieval-on-lsmdcHunYuan_tvr (huge)
text-to-video Mean Rank: 3.9
text-to-video Median Rank: 2.0
text-to-video R@1: 40.4
text-to-video R@10: 92.8
text-to-video R@5: 80.1
video-to-text Mean Rank: 4.3
video-to-text Median Rank: 2.0
video-to-text R@1: 34.6
video-to-text R@10: 91.8
video-to-text R@5: 71.8
video-retrieval-on-lsmdcHunYuan_tvr
text-to-video Mean Rank: 56.4
text-to-video Median Rank: 7
text-to-video R@1: 29.7
text-to-video R@10: 55.4
text-to-video R@5: 46.4
video-to-text Mean Rank: 48.9
video-to-text Median Rank: 7
video-to-text R@1: 30.1
video-to-text R@10: 55.7
video-to-text R@5: 47.5
video-retrieval-on-msr-vtt-1kaHunYuan_tvr
text-to-video R@1: 55.0
video-to-text Mean Rank: 7.7
video-to-text Median Rank: 1.0
video-to-text R@1: 55.5
video-to-text R@10: 85.8
video-to-text R@5: 78.4
video-retrieval-on-msr-vtt-1kaHunYuan_tvr (huge)
text-to-video Mean Rank: 9.3
text-to-video Median Rank: 1.0
text-to-video R@1: 62.9
text-to-video R@10: 90.8
text-to-video R@5: 84.5
video-to-text Mean Rank: 5.5
video-to-text Median Rank: 1.0
video-to-text R@1: 64.8
video-to-text R@10: 91.1
video-to-text R@5: 84.9
video-retrieval-on-msvdHunYuan_tvr (huge)
text-to-video Mean Rank: 7.6
text-to-video Median Rank: 1.0
text-to-video R@1: 59.0
text-to-video R@10: 90.3
text-to-video R@5: 84.0
video-to-text Mean Rank: 7.6
video-to-text Median Rank: 1.0
video-to-text R@1: 73.0
video-to-text R@10: 96.6
video-to-text R@5: 94.5
video-retrieval-on-msvdHunYuan_tvr
text-to-video Mean Rank: 7.8
text-to-video Median Rank: 1
text-to-video R@1: 58.2
text-to-video R@10: 90.1
text-to-video R@5: 83.5
video-to-text Mean Rank: 3.8
video-to-text Median Rank: 1.0
video-to-text R@1: 69.1
video-to-text R@10: 95.0
video-to-text R@5: 91.5

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations | Papers | HyperAI