
摘要
现有的文本-视频检索方法本质上属于判别式模型,其核心目标是最大化条件概率,即 $ p(\text{候选项}|\text{查询}) $。尽管这一范式实现简单,但其忽略了查询数据本身的潜在分布 $ p(\text{查询}) $,导致难以有效识别分布外(out-of-distribution)的数据。为克服这一局限,本文创造性地从生成式视角出发,将文本与视频之间的关联建模为它们的联合概率分布 $ p(\text{候选项}, \text{查询}) $。为此,我们提出了一种基于扩散模型的文本-视频检索框架——DiffusionRet,该框架将检索任务建模为从噪声中逐步生成联合分布的过程。在训练过程中,DiffusionRet 同时从生成与判别两个角度进行优化:生成器通过生成损失进行优化,而特征提取器则通过对比损失进行训练。这种设计巧妙地融合了生成模型与判别模型的优势。在五个广泛使用的文本-视频检索基准数据集(包括 MSRVTT、LSMDC、MSVD、ActivityNet Captions 和 DiDeMo)上的大量实验表明,该方法取得了优异的性能。更令人振奋的是,无需任何修改,DiffusionRet 在分布外检索场景下同样表现出色。我们认为,本工作为相关领域提供了重要的理论启示。代码已开源,地址为:https://github.com/jpthu17/DiffusionRet。
代码仓库
jpthu17/dicosa
pytorch
GitHub 中提及
jpthu17/HBI
pytorch
GitHub 中提及
jpthu17/emcl
pytorch
GitHub 中提及
jpthu17/diffusionret
官方
pytorch
GitHub 中提及
基准测试
| 基准 | 方法 | 指标 |
|---|---|---|
| video-retrieval-on-activitynet | DiffusionRet | text-to-video Mean Rank: 6.5 text-to-video Median Rank: 2.0 text-to-video R@1: 45.8 text-to-video R@10: 86.3 text-to-video R@5: 75.6 video-to-text Mean Rank: 6.3 video-to-text Median Rank: 2.0 video-to-text R@1: 43.8 video-to-text R@10: 86.7 video-to-text R@5: 75.3 |
| video-retrieval-on-activitynet | DiffusionRet+QB-Norm | text-to-video Mean Rank: 6.8 text-to-video Median Rank: 2.0 text-to-video R@1: 48.1 text-to-video R@10: 85.7 video-to-text Mean Rank: 6.7 video-to-text Median Rank: 2.0 video-to-text R@1: 47.4 video-to-text R@10: 86.7 video-to-text R@5: 76.3 |
| video-retrieval-on-didemo | DiffusionRet+QB-Norm | text-to-video Mean Rank: 14.1 text-to-video Median Rank: 2.0 text-to-video R@1: 48.9 text-to-video R@10: 83.3 text-to-video R@5: 75.5 video-to-text Mean Rank: 10.3 video-to-text Median Rank: 1.0 video-to-text R@1: 50.3 video-to-text R@10: 82.9 video-to-text R@5: 75.1 |
| video-retrieval-on-didemo | DiffusionRet | text-to-video Mean Rank: 14.3 text-to-video Median Rank: 2.0 text-to-video R@1: 46.7 text-to-video R@10: 82.7 text-to-video R@5: 74.7 video-to-text Mean Rank: 10.7 video-to-text Median Rank: 2.0 video-to-text R@1: 46.2 video-to-text R@10: 82.2 video-to-text R@5: 74.3 |
| video-retrieval-on-lsmdc | DiffusionRet | text-to-video Mean Rank: 40.7 text-to-video Median Rank: 8.0 text-to-video R@1: 24.4 text-to-video R@10: 54.3 text-to-video R@5: 43.1 video-to-text Mean Rank: 40.2 video-to-text Median Rank: 9.0 video-to-text R@1: 23.0 video-to-text R@10: 51.5 video-to-text R@5: 43.5 |
| video-retrieval-on-msr-vtt-1ka | DiffusionRet | text-to-video Mean Rank: 12.1 text-to-video Median Rank: 2.0 text-to-video R@1: 49.0 text-to-video R@10: 82.7 text-to-video R@5: 75.2 video-to-text Mean Rank: 8.8 video-to-text Median Rank: 2.0 video-to-text R@1: 47.7 video-to-text R@10: 84.5 video-to-text R@5: 73.8 |
| video-retrieval-on-msr-vtt-1ka | DiffusionRet+QB-Norm | text-to-video Mean Rank: 12.1 text-to-video Median Rank: 2.0 text-to-video R@1: 48.9 text-to-video R@10: 83.1 text-to-video R@5: 75.2 video-to-text Mean Rank: 8.5 video-to-text Median Rank: 2.0 video-to-text R@1: 49.3 video-to-text R@10: 83.8 video-to-text R@5: 74.3 |
| video-retrieval-on-msvd | DiffusionRet+QB-Norm | text-to-video Mean Rank: 15.6 text-to-video R@1: 47.9 text-to-video R@10: 84.8 text-to-video R@5: 77.2 video-to-text Mean Rank: 4.5 video-to-text Median Rank: 1.0 video-to-text R@1: 60.3 video-to-text R@10: 92 video-to-text R@5: 86.4 |
| video-retrieval-on-msvd | DiffusionRet | text-to-video Mean Rank: 15.7 text-to-video Median Rank: 2.0 text-to-video R@1: 46.6 text-to-video R@10: 84.1 text-to-video R@5: 75.9 video-to-text Mean Rank: 4.5 video-to-text Median Rank: 1.0 video-to-text R@1: 61.9 video-to-text R@10: 92.9 video-to-text R@5: 88.3 |