HyperAIHyperAI

Command Palette

Search for a command to run...

3 months ago

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Tsu-Jui Fu Licheng Yu Ning Zhang Cheng-Yang Fu Jong-Chyi Su William Yang Wang Sean Bell

Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation

Abstract

Generating a video given the first several static frames is challenging as it anticipates reasonable future frames with temporal coherence. Besides video prediction, the ability to rewind from the last frame or infilling between the head and tail is also crucial, but they have rarely been explored for video completion. Since there could be different outcomes from the hints of just a few frames, a system that can follow natural language to perform video completion may significantly improve controllability. Inspired by this, we introduce a novel task, text-guided video completion (TVC), which requests the model to generate a video from partial frames guided by an instruction. We then propose Multimodal Masked Video Generation (MMVG) to address this TVC task. During training, MMVG discretizes the video frames into visual tokens and masks most of them to perform video completion from any time point. At inference time, a single MMVG model can address all 3 cases of TVC, including video prediction, rewind, and infilling, by applying corresponding masking conditions. We evaluate MMVG in various video scenarios, including egocentric, animation, and gaming. Extensive experimental results indicate that MMVG is effective in generating high-quality visual appearances with text guidance for TVC.

Code Repositories

tsujuifu/pytorch_tvc
Official
pytorch

Benchmarks

BenchmarkMethodologyMetrics
text-to-video-generation-on-msr-vttMMVG
CLIPSIM: 0.2644
FID: 23.4
video-generation-on-ucf-101MMVG (128x128, class-conditional)
FVD16: 328
Inception Score: 73.7
video-generation-on-ucf-101MMVG (128x128, unconditional)
FVD16: 395
Inception Score: 58.3
video-prediction-on-bair-robot-pushing-1MMVG
FVD: 85.2

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding
Ready-to-use GPUs
Best Pricing
Get Started

Hyper Newsletters

Subscribe to our latest updates
We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning
Powered by MailChimp
Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation | Papers | HyperAI