Command Palette
Search for a command to run...
Esser Patrick ; Rombach Robin ; Ommer Björn

Abstract
Designed to learn long-range interactions on sequential data, transformerscontinue to show state-of-the-art results on a wide variety of tasks. Incontrast to CNNs, they contain no inductive bias that prioritizes localinteractions. This makes them expressive, but also computationally infeasiblefor long sequences, such as high-resolution images. We demonstrate howcombining the effectiveness of the inductive bias of CNNs with the expressivityof transformers enables them to model and thereby synthesize high-resolutionimages. We show how to (i) use CNNs to learn a context-rich vocabulary of imageconstituents, and in turn (ii) utilize transformers to efficiently model theircomposition within high-resolution images. Our approach is readily applied toconditional synthesis tasks, where both non-spatial information, such as objectclasses, and spatial information, such as segmentations, can control thegenerated image. In particular, we present the first results onsemantically-guided synthesis of megapixel images with transformers and obtainthe state of the art among autoregressive models on class-conditional ImageNet.Code and pretrained models can be found athttps://github.com/CompVis/taming-transformers .
Code Repositories
Benchmarks
| Benchmark | Methodology | Metrics |
|---|---|---|
| deepfake-detection-on-fakeavceleb-1 | VQGAN | AP: 55.0 ROC AUC: 51.8 |
| image-generation-on-celeba-256x256 | VQGAN | FID: 10.2 |
| image-generation-on-celeba-hq-256x256 | VQGAN+Transformer | FID: 10.2 |
| image-generation-on-ffhq-256-x-256 | VQGAN+Transformer | FID: 9.6 |
| image-generation-on-imagenet-256x256 | VQGAN+Transformer (k=600, p=1.0, a=0.05) | FID: 5.2 |
| image-generation-on-imagenet-256x256 | VQGAN+Transformer (k=mixed, p=1.0, a=0.005) | FID: 6.59 |
| image-outpainting-on-lhqc | Taming | Block-FID (Right Extend): 22.53 Block-FID (Down Extend): 26.38 Block-FID (Left Extend): - Block-FID (Up Extend): - |
| image-reconstruction-on-imagenet | Taming-VQGAN (16x16) | FID: 3.64 LPIPS: 0.177 PSNR: 19.93 SSIM: 0.542 |
| image-to-image-translation-on-ade20k-labels | VQGAN+Transformer | FID: 35.5 |
| image-to-image-translation-on-coco-stuff | VQGAN+Transformer | FID: 22.4 |
| text-to-image-generation-on-conceptual | VQ-GAN | FID: 28.86 |
| text-to-image-generation-on-lhqc | Taming | Block-FID: 38.89 |
Build AI with AI
From idea to launch — accelerate your AI development with free AI co-coding, out-of-the-box environment and best price of GPUs.