8 months ago

Video Understanding

Object Detection

Multimodal Representation

Computer Vision

Antoine Yang Antoine Miech Josef Sivic Ivan Laptev Cordelia Schmid

Abstract

We consider the problem of localizing a spatio-temporal tube in a videocorresponding to a given text query. This is a challenging task that requiresthe joint and efficient modeling of temporal, spatial and multi-modalinteractions. To address this task, we propose TubeDETR, a transformer-basedarchitecture inspired by the recent success of such models for text-conditionedobject detection. Our model notably includes: (i) an efficient video and textencoder that models spatial multi-modal interactions over sparsely sampledframes and (ii) a space-time decoder that jointly performs spatio-temporallocalization. We demonstrate the advantage of our proposed components throughan extensive ablation study. We also evaluate our full approach on thespatio-temporal video grounding task and demonstrate improvements over thestate of the art on the challenging VidSTG and HC-STVG benchmarks. Code andtrained models are publicly available athttps://antoyang.github.io/tubedetr.html.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Video Understanding

Object Detection

Multimodal Representation

Computer Vision

Antoine Yang Antoine Miech Josef Sivic Ivan Laptev Cordelia Schmid

Abstract

We consider the problem of localizing a spatio-temporal tube in a videocorresponding to a given text query. This is a challenging task that requiresthe joint and efficient modeling of temporal, spatial and multi-modalinteractions. To address this task, we propose TubeDETR, a transformer-basedarchitecture inspired by the recent success of such models for text-conditionedobject detection. Our model notably includes: (i) an efficient video and textencoder that models spatial multi-modal interactions over sparsely sampledframes and (ii) a space-time decoder that jointly performs spatio-temporallocalization. We demonstrate the advantage of our proposed components throughan extensive ablation study. We also evaluate our full approach on thespatio-temporal video grounding task and demonstrate improvements over thestate of the art on the challenging VidSTG and HC-STVG benchmarks. Code andtrained models are publicly available athttps://antoyang.github.io/tubedetr.html.

Source PDF View Code

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp