3 months ago

W2VV++: Fully Deep Learning for Ad-hoc Video Search

{Xirong Li; Chaoxi Xu; Gang Yang; Zhineng Chen; Jianfeng Dong}

Abstract

Ad-hoc video search (AVS) is an important yet challenging problem in multimedia retrieval. Different from previous concept-based methods, we propose an end-to-end deep learning method for query representation learning. The proposed method requires no concept modeling, matching and selection. The backbone of our method is the proposed W2VV++ model, a super version of Word2VisualVec (W2VV) previously developed for visual-to-text matching. W2VV++ is obtained by tweaking W2VV with a better sentence encoding strategy and an improved triplet ranking loss. With these simple changes, W2VV++ brings in a substantial improvement in performance. As our participation in the TRECVID 2018 AVS task and retrospective experiments on the TRECVID 2016 and 2017 data show, our best single model, with an overall inferred average precision (infAP) of 0.157, outperforms the state-of-the-art. The performance can be further boosted by model ensemble using late average fusion, reaching a higher infAP of 0.163. With W2VV++, we establish a new baseline for ad-hoc video search.

Benchmarks

Benchmark	Methodology	Metrics
ad-hoc-video-search-on-trecvid-avs16-iacc-3	W2VV++	infAP: 0.151
ad-hoc-video-search-on-trecvid-avs17-iacc-3	W2VV++	infAP: 0.220
ad-hoc-video-search-on-trecvid-avs18-iacc-3	W2VV++	infAP: 0.121

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning