Command Palette
Search for a command to run...
Tencent Text-Video Retrieval: Hierarchical Cross-Modal Interactions with Multi-Level Representations
Jie Jiang Shaobo Min Weijie Kong Dihong Gong Hongfa Wang Zhifeng Li Wei Liu

Abstract
Text-Video Retrieval plays an important role in multi-modal understanding and has attracted increasing attention in recent years. Most existing methods focus on constructing contrastive pairs between whole videos and complete caption sentences, while overlooking fine-grained cross-modal relationships, e.g., clip-phrase or frame-word. In this paper, we propose a novel method, named Hierarchical Cross-Modal Interaction (HCMI), to explore multi-level cross-modal relationships among video-sentence, clip-phrase, and frame-word for text-video retrieval. Considering intrinsic semantic frame relations, HCMI performs self-attention to explore frame-level correlations and adaptively cluster correlated frames into clip-level and video-level representations. In this way, HCMI constructs multi-level video representations for frame-clip-video granularities to capture fine-grained video content, and multi-level text representations at word-phrase-sentence granularities for the text modality. With multi-level representations for video and text, hierarchical contrastive learning is designed to explore fine-grained cross-modal relationships, i.e., frame-word, clip-phrase, and video-sentence, which enables HCMI to achieve a comprehensive semantic comparison between video and text modalities. Further boosted by adaptive label denoising and marginal sample enhancement, HCMI achieves new state-of-the-art results on various benchmarks, e.g., Rank@1 of 55.0%, 58.2%, 29.7%, 52.1%, and 57.3% on MSR-VTT, MSVD, LSMDC, DiDemo, and ActivityNet, respectively.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| video-retrieval-on-activitynet | HunYuan_tvr | text-to-video Mean Rank: 4.0 text-to-video Median Rank: 1 text-to-video R@1: 57.3 text-to-video R@10: 93.1 text-to-video R@5: 84.8 video-to-text Mean Rank: 3.4 video-to-text Median Rank: 1 video-to-text R@1: 57.7 video-to-text R@10: 93.9 video-to-text R@5: 85.7 |
| video-retrieval-on-didemo | HunYuan_tvr (huge) | text-to-video Mean Rank: 13.7 text-to-video Median Rank: 1.0 text-to-video R@1: 52.7 text-to-video R@10: 85.2 text-to-video R@5: 77.8 video-to-text Mean Rank: 9.1 video-to-text Median Rank: 1.0 video-to-text R@1: 54.1 video-to-text R@10: 86.8 video-to-text R@5: 78.3 |
| video-retrieval-on-didemo | HunYuan_tvr | text-to-video Mean Rank: 11.1 text-to-video Median Rank: 1 text-to-video R@1: 52.1 text-to-video R@10: 85.7 text-to-video R@5: 78.2 video-to-text Mean Rank: 7.1 video-to-text Median Rank: 1 video-to-text R@1: 54.8 video-to-text R@10: 87.2 video-to-text R@5: 79.9 |
| video-retrieval-on-lsmdc | HunYuan_tvr (huge) | text-to-video Mean Rank: 3.9 text-to-video Median Rank: 2.0 text-to-video R@1: 40.4 text-to-video R@10: 92.8 text-to-video R@5: 80.1 video-to-text Mean Rank: 4.3 video-to-text Median Rank: 2.0 video-to-text R@1: 34.6 video-to-text R@10: 91.8 video-to-text R@5: 71.8 |
| video-retrieval-on-lsmdc | HunYuan_tvr | text-to-video Mean Rank: 56.4 text-to-video Median Rank: 7 text-to-video R@1: 29.7 text-to-video R@10: 55.4 text-to-video R@5: 46.4 video-to-text Mean Rank: 48.9 video-to-text Median Rank: 7 video-to-text R@1: 30.1 video-to-text R@10: 55.7 video-to-text R@5: 47.5 |
| video-retrieval-on-msr-vtt-1ka | HunYuan_tvr | text-to-video R@1: 55.0 video-to-text Mean Rank: 7.7 video-to-text Median Rank: 1.0 video-to-text R@1: 55.5 video-to-text R@10: 85.8 video-to-text R@5: 78.4 |
| video-retrieval-on-msr-vtt-1ka | HunYuan_tvr (huge) | text-to-video Mean Rank: 9.3 text-to-video Median Rank: 1.0 text-to-video R@1: 62.9 text-to-video R@10: 90.8 text-to-video R@5: 84.5 video-to-text Mean Rank: 5.5 video-to-text Median Rank: 1.0 video-to-text R@1: 64.8 video-to-text R@10: 91.1 video-to-text R@5: 84.9 |
| video-retrieval-on-msvd | HunYuan_tvr (huge) | text-to-video Mean Rank: 7.6 text-to-video Median Rank: 1.0 text-to-video R@1: 59.0 text-to-video R@10: 90.3 text-to-video R@5: 84.0 video-to-text Mean Rank: 7.6 video-to-text Median Rank: 1.0 video-to-text R@1: 73.0 video-to-text R@10: 96.6 video-to-text R@5: 94.5 |
| video-retrieval-on-msvd | HunYuan_tvr | text-to-video Mean Rank: 7.8 text-to-video Median Rank: 1 text-to-video R@1: 58.2 text-to-video R@10: 90.1 text-to-video R@5: 83.5 video-to-text Mean Rank: 3.8 video-to-text Median Rank: 1.0 video-to-text R@1: 69.1 video-to-text R@10: 95.0 video-to-text R@5: 91.5 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.