3 months ago

Photorealistic Video Generation with Diffusion Models

Agrim Gupta Lijun Yu Kihyuk Sohn Xiuye Gu Meera Hahn Li Fei-Fei Irfan Essa Lu Jiang José Lezama

Abstract

We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.

Benchmarks

Benchmark	Methodology	Metrics
text-to-video-generation-on-ucf-101	W.A.L.T 3B	FVD16: 258.1
video-generation-on-kinetics-600-12-frames	W.A.L.T-L	FVD: 3.3±0.0
video-generation-on-ucf-101	W.A.L.T-XL (class-conditional)	FVD16: 36±2
video-generation-on-ucf-101	W.A.L.T 3B (text-conditional)	FVD16: 258.1 Inception Score: 35.1
video-prediction-on-kinetics-600-12-frames	W.A.L.T.-L	FVD: 3.3

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning