| MCVD | 2460 | - | 148 | Latent Video Diffusion Models for High-Fidelity Long Video Generation |  | 
| VDM | 1396 | - | 116 | Latent Video Diffusion Models for High-Fidelity Long Video Generation |  | 
| TGAN-v2 (128x128) | 1209 | - | - | Latent Video Diffusion Models for High-Fidelity Long Video Generation |  | 
| MCVD (64x64) | 1143 | - | - | MCVD: Masked Conditional Video Diffusion for Prediction, Generation, and Interpolation |  | 
| MoCoGAN-HD (256x256, unconditional) | 700 | 33.95 | - | A Good Image Generator Is What You Need for High-Resolution Video Synthesis |  | 
| MagicVideo (256x256, text-conditional) | 699 | - | - | MagicVideo: Efficient Video Generation With Latent Diffusion Models | - | 
| TATS (256x256) | 635 | - | 55 | Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer |  | 
| DIGAN (128x128, unconditional) | 577 | 32.70 | - | Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks |  | 
| LVDM (256x256, unconditional) | 552 | - | 42 | Latent Video Diffusion Models for High-Fidelity Long Video Generation |  | 
| Video LDM (320x512, text-conditional) | 550.61 | 33.45 | - | Align your Latents: High-Resolution Video Synthesis with Latent Diffusion Models |  | 
| LAVIE (320x512, text-conditional) | 526.30 | - | - | LAVIE: High-Quality Video Generation with Cascaded Latent Diffusion Models |  | 
| DIGAN (128x128, class-conditional) | 465 | 59.68 | 39.6 | Generating Videos with Dynamics-aware Implicit Generative Adversarial Networks |  | 
| MeBT (128x128, unconditional) | 438 | 65.93 | - | Towards End-to-End Generative Modeling of Long Videos with Memory-Efficient Bidirectional Transformers |  | 
| TATS (128x128, unconditional) | 420 | 57.63 | - | Long Video Generation with Time-Agnostic VQGAN and Time-Sensitive Transformer |  | 
| MMVG (128x128, unconditional) | 395 | 58.3 | - | Tell Me What Happened: Unifying Text-guided Video Completion via Multimodal Masked Video Generation |  | 
| LVDM (256x256, unconditional) | 372 | - | 27 | Latent Video Diffusion Models for High-Fidelity Long Video Generation |  | 
| Make-A-Video (Zero-shot, 256x256, class-conditional) | 367.23 | 33 | - | Make-A-Video: Text-to-Video Generation without Text-Video Data |  | 
| PYoCo (Zero-shot, 64x64, text-conditional) | 355.19 | 47.76 | - | Preserve Your Own Correlation: A Noise Prior for Video Diffusion Models | - | 
| VideoPoet (text-conditional) | 355 | 38.44 | - | VideoPoet: A Large Language Model for Zero-Shot Video Generation | - | 
| VideoAssembler (Zero-shot, 256x256, class-conditional) | 346.84 | 48.01 | - | MagDiff: Multi-Alignment Diffusion for High-Fidelity Video Generation and Editing |  |