HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Visual-Textual Capsule Routing for Text-Based Video Segmentation

{ Mubarak Shah Yogesh S Rawat Kevin Duarte Bruce McIntosh}

Visual-Textual Capsule Routing for Text-Based Video Segmentation

Abstract

Joint understanding of vision and natural language is a challenging problem with a wide range of applications in artificial intelligence. In this work, we focus on integration of video and text for the task of actor and action video segmentation from a sentence. We propose a capsule-based approach which performs pixel-level localization based on a natural language query describing the actor of interest. We encode both the video and textual input in the form of capsules, which provide a more effective representation in comparison with standard convolution based features. Our novel visual-textual routing mechanism allows for the fusion of video and text capsules to successfully localize the actor and action. The existing works on actor-action localization are mainly focused on localization in a single frame instead of the full video. Different from existing works, we propose to perform the localization on all frames of the video. To validate the potential of the proposed network for actor and action video localization, we extend an existing actor-action dataset (A2D) with annotations for all the frames. The experimental evaluation demonstrates the effectiveness of our capsule network for text selective actor and action localization in videos. The proposed method also improves upon the performance of the existing state-of-the art works on single frame-based localization.

Benchmarks

BenchmarkMethodologyMetrics
referring-expression-segmentation-on-a2dVT-Capsule
AP: 0.303
IoU mean: 0.460
IoU overall: 0.568
Precision@0.5: 0.526
Precision@0.6: 0.450
Precision@0.7: 0.345
Precision@0.8: 0.207
Precision@0.9: 0.036
referring-expression-segmentation-on-j-hmdbVT-Capsule
AP: 0.261
IoU mean: 0.550
IoU overall: 0.535
Precision@0.5: 0.677
Precision@0.6: 0.513
Precision@0.7: 0.283
Precision@0.8: 0.051
Precision@0.9: 0.000

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Visual-Textual Capsule Routing for Text-Based Video Segmentation | Papers | HyperAI