Command Palette
Search for a command to run...
Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models
Songwei Ge Seungjun Nah Guilin Liu Tyler Poon Andrew Tao Bryan Catanzaro David Jacobs Jia-Bin Huang Ming-Yu Liu Yogesh Balaji

Abstract
Despite tremendous progress in generating high-quality images using diffusion models, synthesizing a sequence of animated frames that are both photorealistic and temporally coherent is still in its infancy. While off-the-shelf billion-scale datasets for image generation are available, collecting similar video data of the same scale is still challenging. Also, training a video diffusion model is computationally much more expensive than its image counterpart. In this work, we explore finetuning a pretrained image diffusion model with video data as a practical solution for the video synthesis task. We find that naively extending the image noise prior to video noise prior in video diffusion leads to sub-optimal performance. Our carefully designed video noise prior leads to substantially better performance. Extensive experimental validation shows that our model, Preserve Your Own Correlation (PYoCo), attains SOTA zero-shot text-to-video results on the UCF-101 and MSR-VTT benchmarks. It also achieves SOTA video generation quality on the small-scale UCF-101 benchmark with a $10\times$ smaller model using significantly less computation than the prior art.
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| text-to-video-generation-on-ucf-101 | PYoCo (Zero-shot, 64x64) | FVD16: 355.19 |
| video-generation-on-ucf-101 | PYoCo (Zero-shot, 64x64, text-conditional) | FVD16: 355.19 Inception Score: 47.76 |
| video-generation-on-ucf-101 | PYoCo (Zero-shot, 64x64, unconditional) | FVD16: 310 Inception Score: 60.01 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.