Command Palette
Search for a command to run...
Agrim Gupta Lijun Yu Kihyuk Sohn Xiuye Gu Meera Hahn Li Fei-Fei Irfan Essa Lu Jiang José Lezama

Abstract
We present W.A.L.T, a transformer-based approach for photorealistic video generation via diffusion modeling. Our approach has two key design decisions. First, we use a causal encoder to jointly compress images and videos within a unified latent space, enabling training and generation across modalities. Second, for memory and training efficiency, we use a window attention architecture tailored for joint spatial and spatiotemporal generative modeling. Taken together these design decisions enable us to achieve state-of-the-art performance on established video (UCF-101 and Kinetics-600) and image (ImageNet) generation benchmarks without using classifier free guidance. Finally, we also train a cascade of three models for the task of text-to-video generation consisting of a base latent video diffusion model, and two video super-resolution diffusion models to generate videos of $512 \times 896$ resolution at $8$ frames per second.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-video-generation-on-ucf-101 | W.A.L.T 3B | FVD16: 258.1 |
| video-generation-on-kinetics-600-12-frames | W.A.L.T-L | FVD: 3.3±0.0 |
| video-generation-on-ucf-101 | W.A.L.T-XL (class-conditional) | FVD16: 36±2 |
| video-generation-on-ucf-101 | W.A.L.T 3B (text-conditional) | FVD16: 258.1 Inception Score: 35.1 |
| video-prediction-on-kinetics-600-12-frames | W.A.L.T.-L | FVD: 3.3 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.