3 months ago

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Peng Jin Hao Li Zesen Cheng Kehan Li Xiangyang Ji Chang Liu Li Yuan Jie Chen

Abstract

Existing text-video retrieval solutions are, in essence, discriminant models focused on maximizing the conditional likelihood, i.e., p(candidates|query). While straightforward, this de facto paradigm overlooks the underlying data distribution p(query), which makes it challenging to identify out-of-distribution data. To address this limitation, we creatively tackle this task from a generative viewpoint and model the correlation between the text and the video as their joint probability p(candidates,query). This is accomplished through a diffusion-based text-video retrieval framework (DiffusionRet), which models the retrieval task as a process of gradually generating joint distribution from noise. During training, DiffusionRet is optimized from both the generation and discrimination perspectives, with the generator being optimized by generation loss and the feature extractor trained with contrastive loss. In this way, DiffusionRet cleverly leverages the strengths of both generative and discriminative methods. Extensive experiments on five commonly used text-video retrieval benchmarks, including MSRVTT, LSMDC, MSVD, ActivityNet Captions, and DiDeMo, with superior performances, justify the efficacy of our method. More encouragingly, without any modification, DiffusionRet even performs well in out-domain retrieval settings. We believe this work brings fundamental insights into the related fields. Code is available at https://github.com/jpthu17/DiffusionRet.

Code Repositories

jpthu17/dicosa

pytorch

Mentioned in GitHub

jpthu17/HBI

pytorch

Mentioned in GitHub

jpthu17/emcl

pytorch

Mentioned in GitHub

jpthu17/diffusionret

Official

pytorch

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-retrieval-on-activitynet	DiffusionRet	text-to-video Mean Rank: 6.5 text-to-video Median Rank: 2.0 text-to-video R@1: 45.8 text-to-video R@10: 86.3 text-to-video R@5: 75.6 video-to-text Mean Rank: 6.3 video-to-text Median Rank: 2.0 video-to-text R@1: 43.8 video-to-text R@10: 86.7 video-to-text R@5: 75.3
video-retrieval-on-activitynet	DiffusionRet+QB-Norm	text-to-video Mean Rank: 6.8 text-to-video Median Rank: 2.0 text-to-video R@1: 48.1 text-to-video R@10: 85.7 video-to-text Mean Rank: 6.7 video-to-text Median Rank: 2.0 video-to-text R@1: 47.4 video-to-text R@10: 86.7 video-to-text R@5: 76.3
video-retrieval-on-didemo	DiffusionRet+QB-Norm	text-to-video Mean Rank: 14.1 text-to-video Median Rank: 2.0 text-to-video R@1: 48.9 text-to-video R@10: 83.3 text-to-video R@5: 75.5 video-to-text Mean Rank: 10.3 video-to-text Median Rank: 1.0 video-to-text R@1: 50.3 video-to-text R@10: 82.9 video-to-text R@5: 75.1
video-retrieval-on-didemo	DiffusionRet	text-to-video Mean Rank: 14.3 text-to-video Median Rank: 2.0 text-to-video R@1: 46.7 text-to-video R@10: 82.7 text-to-video R@5: 74.7 video-to-text Mean Rank: 10.7 video-to-text Median Rank: 2.0 video-to-text R@1: 46.2 video-to-text R@10: 82.2 video-to-text R@5: 74.3
video-retrieval-on-lsmdc	DiffusionRet	text-to-video Mean Rank: 40.7 text-to-video Median Rank: 8.0 text-to-video R@1: 24.4 text-to-video R@10: 54.3 text-to-video R@5: 43.1 video-to-text Mean Rank: 40.2 video-to-text Median Rank: 9.0 video-to-text R@1: 23.0 video-to-text R@10: 51.5 video-to-text R@5: 43.5
video-retrieval-on-msr-vtt-1ka	DiffusionRet	text-to-video Mean Rank: 12.1 text-to-video Median Rank: 2.0 text-to-video R@1: 49.0 text-to-video R@10: 82.7 text-to-video R@5: 75.2 video-to-text Mean Rank: 8.8 video-to-text Median Rank: 2.0 video-to-text R@1: 47.7 video-to-text R@10: 84.5 video-to-text R@5: 73.8
video-retrieval-on-msr-vtt-1ka	DiffusionRet+QB-Norm	text-to-video Mean Rank: 12.1 text-to-video Median Rank: 2.0 text-to-video R@1: 48.9 text-to-video R@10: 83.1 text-to-video R@5: 75.2 video-to-text Mean Rank: 8.5 video-to-text Median Rank: 2.0 video-to-text R@1: 49.3 video-to-text R@10: 83.8 video-to-text R@5: 74.3
video-retrieval-on-msvd	DiffusionRet+QB-Norm	text-to-video Mean Rank: 15.6 text-to-video R@1: 47.9 text-to-video R@10: 84.8 text-to-video R@5: 77.2 video-to-text Mean Rank: 4.5 video-to-text Median Rank: 1.0 video-to-text R@1: 60.3 video-to-text R@10: 92 video-to-text R@5: 86.4
video-retrieval-on-msvd	DiffusionRet	text-to-video Mean Rank: 15.7 text-to-video Median Rank: 2.0 text-to-video R@1: 46.6 text-to-video R@10: 84.1 text-to-video R@5: 75.9 video-to-text Mean Rank: 4.5 video-to-text Median Rank: 1.0 video-to-text R@1: 61.9 video-to-text R@10: 92.9 video-to-text R@5: 88.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

DiffusionRet: Generative Text-Video Retrieval with Diffusion Model

Peng Jin Hao Li Zesen Cheng Kehan Li Xiangyang Ji Chang Liu Li Yuan Jie Chen

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters