8 months ago

Image Generation

Convolutional Neural Network

Method/Architecture

Computer Vision

Esser Patrick ; Rombach Robin ; Ommer Björn

Abstract

Designed to learn long-range interactions on sequential data, transformerscontinue to show state-of-the-art results on a wide variety of tasks. Incontrast to CNNs, they contain no inductive bias that prioritizes localinteractions. This makes them expressive, but also computationally infeasiblefor long sequences, such as high-resolution images. We demonstrate howcombining the effectiveness of the inductive bias of CNNs with the expressivityof transformers enables them to model and thereby synthesize high-resolutionimages. We show how to (i) use CNNs to learn a context-rich vocabulary of imageconstituents, and in turn (ii) utilize transformers to efficiently model theircomposition within high-resolution images. Our approach is readily applied toconditional synthesis tasks, where both non-spatial information, such as objectclasses, and spatial information, such as segmentations, can control thegenerated image. In particular, we present the first results onsemantically-guided synthesis of megapixel images with transformers and obtainthe state of the art among autoregressive models on class-conditional ImageNet.Code and pretrained models can be found athttps://github.com/CompVis/taming-transformers .

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

8 months ago

Image Generation

Convolutional Neural Network

Method/Architecture

Computer Vision

Esser Patrick ; Rombach Robin ; Ommer Björn

Abstract

Designed to learn long-range interactions on sequential data, transformerscontinue to show state-of-the-art results on a wide variety of tasks. Incontrast to CNNs, they contain no inductive bias that prioritizes localinteractions. This makes them expressive, but also computationally infeasiblefor long sequences, such as high-resolution images. We demonstrate howcombining the effectiveness of the inductive bias of CNNs with the expressivityof transformers enables them to model and thereby synthesize high-resolutionimages. We show how to (i) use CNNs to learn a context-rich vocabulary of imageconstituents, and in turn (ii) utilize transformers to efficiently model theircomposition within high-resolution images. Our approach is readily applied toconditional synthesis tasks, where both non-spatial information, such as objectclasses, and spatial information, such as segmentations, can control thegenerated image. In particular, we present the first results onsemantically-guided synthesis of megapixel images with transformers and obtainthe state of the art among autoregressive models on class-conditional ImageNet.Code and pretrained models can be found athttps://github.com/CompVis/taming-transformers .

Build AI with AI

From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.

AI Co-coding

Ready-to-use GPUs

Best Pricing

Get Started View Pricing

HyperAI Newsletters

Subscribe to our latest updates

We will deliver the latest updates of the week to your inbox at nine o'clock every Monday morning

Powered by MailChimp

Taming Transformers for High-Resolution Image Synthesis | Papers | HyperAI