3 个月前

腾讯文本-视频检索:基于多层级表示的分层跨模态交互

腾讯文本-视频检索:基于多层级表示的分层跨模态交互

摘要

文本-视频检索在多模态理解中扮演着重要角色,近年来受到越来越多的关注。现有大多数方法主要聚焦于构建完整视频与完整文本句子之间的对比样本,而忽视了细粒度的跨模态关联关系,例如视频片段与短语之间、帧与词之间的对应关系。本文提出一种新颖的方法——层次化跨模态交互(Hierarchical Cross-Modal Interaction, HCMI),旨在探索视频-句子、片段-短语以及帧-词等多个层次的跨模态关系,以提升文本-视频检索性能。考虑到视频帧之间固有的语义关联,HCMI通过自注意力机制挖掘帧级别的相关性,并自适应地将具有强关联性的帧聚类为片段级与视频级的表示。由此,HCMI构建了从帧到片段再到视频的多层次视频表征,以捕捉视频内容的细粒度语义信息;同时,在文本模态上,也构建了从词到短语再到句子的多层次文本表征。基于视频与文本的多层次表示,HCMI设计了层次化对比学习策略,以深入挖掘细粒度的跨模态对应关系,包括帧-词、片段-短语以及视频-句子之间的匹配关系,从而实现视频与文本模态间全面而精准的语义对齐。此外,通过引入自适应标签去噪与边缘样本增强机制,HCMI在多个基准数据集上取得了新的最先进性能:在MSR-VTT、MSVD、LSMDC、DiDemo和ActivityNet数据集上,其Rank@1指标分别达到55.0%、58.2%、29.7%、52.1%和57.3%。

基准测试

基准方法指标
video-retrieval-on-activitynetHunYuan_tvr
text-to-video Mean Rank: 4.0
text-to-video Median Rank: 1
text-to-video R@1: 57.3
text-to-video R@10: 93.1
text-to-video R@5: 84.8
video-to-text Mean Rank: 3.4
video-to-text Median Rank: 1
video-to-text R@1: 57.7
video-to-text R@10: 93.9
video-to-text R@5: 85.7
video-retrieval-on-didemoHunYuan_tvr (huge)
text-to-video Mean Rank: 13.7
text-to-video Median Rank: 1.0
text-to-video R@1: 52.7
text-to-video R@10: 85.2
text-to-video R@5: 77.8
video-to-text Mean Rank: 9.1
video-to-text Median Rank: 1.0
video-to-text R@1: 54.1
video-to-text R@10: 86.8
video-to-text R@5: 78.3
video-retrieval-on-didemoHunYuan_tvr
text-to-video Mean Rank: 11.1
text-to-video Median Rank: 1
text-to-video R@1: 52.1
text-to-video R@10: 85.7
text-to-video R@5: 78.2
video-to-text Mean Rank: 7.1
video-to-text Median Rank: 1
video-to-text R@1: 54.8
video-to-text R@10: 87.2
video-to-text R@5: 79.9
video-retrieval-on-lsmdcHunYuan_tvr (huge)
text-to-video Mean Rank: 3.9
text-to-video Median Rank: 2.0
text-to-video R@1: 40.4
text-to-video R@10: 92.8
text-to-video R@5: 80.1
video-to-text Mean Rank: 4.3
video-to-text Median Rank: 2.0
video-to-text R@1: 34.6
video-to-text R@10: 91.8
video-to-text R@5: 71.8
video-retrieval-on-lsmdcHunYuan_tvr
text-to-video Mean Rank: 56.4
text-to-video Median Rank: 7
text-to-video R@1: 29.7
text-to-video R@10: 55.4
text-to-video R@5: 46.4
video-to-text Mean Rank: 48.9
video-to-text Median Rank: 7
video-to-text R@1: 30.1
video-to-text R@10: 55.7
video-to-text R@5: 47.5
video-retrieval-on-msr-vtt-1kaHunYuan_tvr
text-to-video R@1: 55.0
video-to-text Mean Rank: 7.7
video-to-text Median Rank: 1.0
video-to-text R@1: 55.5
video-to-text R@10: 85.8
video-to-text R@5: 78.4
video-retrieval-on-msr-vtt-1kaHunYuan_tvr (huge)
text-to-video Mean Rank: 9.3
text-to-video Median Rank: 1.0
text-to-video R@1: 62.9
text-to-video R@10: 90.8
text-to-video R@5: 84.5
video-to-text Mean Rank: 5.5
video-to-text Median Rank: 1.0
video-to-text R@1: 64.8
video-to-text R@10: 91.1
video-to-text R@5: 84.9
video-retrieval-on-msvdHunYuan_tvr (huge)
text-to-video Mean Rank: 7.6
text-to-video Median Rank: 1.0
text-to-video R@1: 59.0
text-to-video R@10: 90.3
text-to-video R@5: 84.0
video-to-text Mean Rank: 7.6
video-to-text Median Rank: 1.0
video-to-text R@1: 73.0
video-to-text R@10: 96.6
video-to-text R@5: 94.5
video-retrieval-on-msvdHunYuan_tvr
text-to-video Mean Rank: 7.8
text-to-video Median Rank: 1
text-to-video R@1: 58.2
text-to-video R@10: 90.1
text-to-video R@5: 83.5
video-to-text Mean Rank: 3.8
video-to-text Median Rank: 1.0
video-to-text R@1: 69.1
video-to-text R@10: 95.0
video-to-text R@5: 91.5

用 AI 构建 AI

从想法到上线——通过免费 AI 协同编程、开箱即用的环境和市场最优价格的 GPU 加速您的 AI 开发

AI 协同编程
即用型 GPU
最优价格
立即开始

Hyper Newsletters

订阅我们的最新资讯
我们会在北京时间 每周一的上午九点 向您的邮箱投递本周内的最新更新
邮件发送服务由 MailChimp 提供
腾讯文本-视频检索:基于多层级表示的分层跨模态交互 | 论文 | HyperAI超神经