8 months ago

Multimodal Representation

Visual Document Retrieval

Jianfeng Dong Xirong Li, Member, IEEE Chaoxi Xu Xun Yang Gang Yang Xun Wang, Member, IEEE Meng Wang, Fellow, IEEE

Abstract

This paper attacks the challenging problem of video retrieval by text. Insuch a retrieval paradigm, an end user searches for unlabeled videos by ad-hocqueries described exclusively in the form of a natural-language sentence, withno visual example provided. Given videos as sequences of frames and queries assequences of words, an effective sequence-to-sequence cross-modal matching iscrucial. To that end, the two modalities need to be first encoded intoreal-valued vectors and then projected into a common space. In this paper weachieve this by proposing a dual deep encoding network that encodes videos andqueries into powerful dense representations of their own. Our novelty istwo-fold. First, different from prior art that resorts to a specificsingle-level encoder, the proposed network performs multi-level encoding thatrepresents the rich content of both modalities in a coarse-to-fine fashion.Second, different from a conventional common space learning algorithm which iseither concept based or latent space based, we introduce hybrid space learningwhich combines the high performance of the latent space and the goodinterpretability of the concept space. Dual encoding is conceptually simple,practically effective and end-to-end trained with hybrid space learning.Extensive experiments on four challenging video datasets show the viability ofthe new method.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Multimodal Representation

Visual Document Retrieval

Jianfeng Dong Xirong Li, Member, IEEE Chaoxi Xu Xun Yang Gang Yang Xun Wang, Member, IEEE Meng Wang, Fellow, IEEE

Abstract

This paper attacks the challenging problem of video retrieval by text. Insuch a retrieval paradigm, an end user searches for unlabeled videos by ad-hocqueries described exclusively in the form of a natural-language sentence, withno visual example provided. Given videos as sequences of frames and queries assequences of words, an effective sequence-to-sequence cross-modal matching iscrucial. To that end, the two modalities need to be first encoded intoreal-valued vectors and then projected into a common space. In this paper weachieve this by proposing a dual deep encoding network that encodes videos andqueries into powerful dense representations of their own. Our novelty istwo-fold. First, different from prior art that resorts to a specificsingle-level encoder, the proposed network performs multi-level encoding thatrepresents the rich content of both modalities in a coarse-to-fine fashion.Second, different from a conventional common space learning algorithm which iseither concept based or latent space based, we introduce hybrid space learningwhich combines the high performance of the latent space and the goodinterpretability of the concept space. Dual encoding is conceptually simple,practically effective and end-to-end trained with hybrid space learning.Extensive experiments on four challenging video datasets show the viability ofthe new method.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Dual Encoding for Video Retrieval by Text | Papers | HyperAI