
摘要
文本-视频检索在多模态理解中扮演着重要角色,近年来受到越来越多的关注。现有大多数方法主要聚焦于构建完整视频与完整文本句子之间的对比样本,而忽视了细粒度的跨模态关联关系,例如视频片段与短语之间、帧与词之间的对应关系。本文提出一种新颖的方法——层次化跨模态交互(Hierarchical Cross-Modal Interaction, HCMI),旨在探索视频-句子、片段-短语以及帧-词等多个层次的跨模态关系,以提升文本-视频检索性能。考虑到视频帧之间固有的语义关联,HCMI通过自注意力机制挖掘帧级别的相关性,并自适应地将具有强关联性的帧聚类为片段级与视频级的表示。由此,HCMI构建了从帧到片段再到视频的多层次视频表征,以捕捉视频内容的细粒度语义信息;同时,在文本模态上,也构建了从词到短语再到句子的多层次文本表征。基于视频与文本的多层次表示,HCMI设计了层次化对比学习策略,以深入挖掘细粒度的跨模态对应关系,包括帧-词、片段-短语以及视频-句子之间的匹配关系,从而实现视频与文本模态间全面而精准的语义对齐。此外,通过引入自适应标签去噪与边缘样本增强机制,HCMI在多个基准数据集上取得了新的最先进性能:在MSR-VTT、MSVD、LSMDC、DiDemo和ActivityNet数据集上,其Rank@1指标分别达到55.0%、58.2%、29.7%、52.1%和57.3%。
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| video-retrieval-on-activitynet | HunYuan_tvr | text-to-video Mean Rank: 4.0 text-to-video Median Rank: 1 text-to-video R@1: 57.3 text-to-video R@10: 93.1 text-to-video R@5: 84.8 video-to-text Mean Rank: 3.4 video-to-text Median Rank: 1 video-to-text R@1: 57.7 video-to-text R@10: 93.9 video-to-text R@5: 85.7 |
| video-retrieval-on-didemo | HunYuan_tvr (huge) | text-to-video Mean Rank: 13.7 text-to-video Median Rank: 1.0 text-to-video R@1: 52.7 text-to-video R@10: 85.2 text-to-video R@5: 77.8 video-to-text Mean Rank: 9.1 video-to-text Median Rank: 1.0 video-to-text R@1: 54.1 video-to-text R@10: 86.8 video-to-text R@5: 78.3 |
| video-retrieval-on-didemo | HunYuan_tvr | text-to-video Mean Rank: 11.1 text-to-video Median Rank: 1 text-to-video R@1: 52.1 text-to-video R@10: 85.7 text-to-video R@5: 78.2 video-to-text Mean Rank: 7.1 video-to-text Median Rank: 1 video-to-text R@1: 54.8 video-to-text R@10: 87.2 video-to-text R@5: 79.9 |
| video-retrieval-on-lsmdc | HunYuan_tvr (huge) | text-to-video Mean Rank: 3.9 text-to-video Median Rank: 2.0 text-to-video R@1: 40.4 text-to-video R@10: 92.8 text-to-video R@5: 80.1 video-to-text Mean Rank: 4.3 video-to-text Median Rank: 2.0 video-to-text R@1: 34.6 video-to-text R@10: 91.8 video-to-text R@5: 71.8 |
| video-retrieval-on-lsmdc | HunYuan_tvr | text-to-video Mean Rank: 56.4 text-to-video Median Rank: 7 text-to-video R@1: 29.7 text-to-video R@10: 55.4 text-to-video R@5: 46.4 video-to-text Mean Rank: 48.9 video-to-text Median Rank: 7 video-to-text R@1: 30.1 video-to-text R@10: 55.7 video-to-text R@5: 47.5 |
| video-retrieval-on-msr-vtt-1ka | HunYuan_tvr | text-to-video R@1: 55.0 video-to-text Mean Rank: 7.7 video-to-text Median Rank: 1.0 video-to-text R@1: 55.5 video-to-text R@10: 85.8 video-to-text R@5: 78.4 |
| video-retrieval-on-msr-vtt-1ka | HunYuan_tvr (huge) | text-to-video Mean Rank: 9.3 text-to-video Median Rank: 1.0 text-to-video R@1: 62.9 text-to-video R@10: 90.8 text-to-video R@5: 84.5 video-to-text Mean Rank: 5.5 video-to-text Median Rank: 1.0 video-to-text R@1: 64.8 video-to-text R@10: 91.1 video-to-text R@5: 84.9 |
| video-retrieval-on-msvd | HunYuan_tvr (huge) | text-to-video Mean Rank: 7.6 text-to-video Median Rank: 1.0 text-to-video R@1: 59.0 text-to-video R@10: 90.3 text-to-video R@5: 84.0 video-to-text Mean Rank: 7.6 video-to-text Median Rank: 1.0 video-to-text R@1: 73.0 video-to-text R@10: 96.6 video-to-text R@5: 94.5 |
| video-retrieval-on-msvd | HunYuan_tvr | text-to-video Mean Rank: 7.8 text-to-video Median Rank: 1 text-to-video R@1: 58.2 text-to-video R@10: 90.1 text-to-video R@5: 83.5 video-to-text Mean Rank: 3.8 video-to-text Median Rank: 1.0 video-to-text R@1: 69.1 video-to-text R@10: 95.0 video-to-text R@5: 91.5 |