3 months ago

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan Yunzhi Zhang Pieter Abbeel Aravind Srinivas

Abstract

We present VideoGPT: a conceptually simple architecture for scaling likelihood based generative modeling to natural videos. VideoGPT uses VQ-VAE that learns downsampled discrete latent representations of a raw video by employing 3D convolutions and axial self-attention. A simple GPT-like architecture is then used to autoregressively model the discrete latents using spatio-temporal position encodings. Despite the simplicity in formulation and ease of training, our architecture is able to generate samples competitive with state-of-the-art GAN models for video generation on the BAIR Robot dataset, and generate high fidelity natural videos from UCF-101 and Tumbler GIF Dataset (TGIF). We hope our proposed architecture serves as a reproducible reference for a minimalistic implementation of transformer based video generation models. Samples and code are available at https://wilson1yan.github.io/videogpt/index.html

Code Repositories

wilson1yan/VideoGPT

Official

pytorch

Mentioned in GitHub

alescontrela/viper

jax

Mentioned in GitHub

Alescontrela/viper_rl

jax

Mentioned in GitHub

Benchmarks

Benchmark	Methodology	Metrics
video-generation-on-bair-robot-pushing	VideoGPT	Cond: 1 FVD score: 103.3 Pred: 15 Train: 15
video-generation-on-ucf-101-16-frames-128x128	VideoGPT	Inception Score: 24.69

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started

Hyper Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Command Palette

VideoGPT: Video Generation using VQ-VAE and Transformers

Wilson Yan Yunzhi Zhang Pieter Abbeel Aravind Srinivas

Abstract

Code Repositories

Benchmarks

Build AI with AI

Hyper Newsletters