8 months ago

Abstract

In this work, we propose TediGAN, a novel framework for multi-modal imagegeneration and manipulation with textual descriptions. The proposed methodconsists of three components: StyleGAN inversion module, visual-linguisticsimilarity learning, and instance-level optimization. The inversion module mapsreal images to the latent space of a well-trained StyleGAN. Thevisual-linguistic similarity learns the text-image matching by mapping theimage and text into a common embedding space. The instance-level optimizationis for identity preservation in manipulation. Our model can produce diverse andhigh-quality images with an unprecedented resolution at 1024. Using a controlmechanism based on style-mixing, our TediGAN inherently supports imagesynthesis with multi-modal inputs, such as sketches or semantic labels, with orwithout instance guidance. To facilitate text-guided multi-modal synthesis, wepropose the Multi-Modal CelebA-HQ, a large-scale dataset consisting of realface images and corresponding semantic segmentation map, sketch, and textualdescriptions. Extensive experiments on the introduced dataset demonstrate thesuperior performance of our proposed method. Code and data are available athttps://github.com/weihaox/TediGAN.

Source PDF View Code