4 months ago

Abstract

Learning to generate natural scenes has always been a challenging task incomputer vision. It is even more painstaking when the generation is conditionedon images with drastically different views. This is mainly becauseunderstanding, corresponding, and transforming appearance and semanticinformation across the views is not trivial. In this paper, we attempt to solvethe novel problem of cross-view image synthesis, aerial to street-view and viceversa, using conditional generative adversarial networks (cGAN). Two newarchitectures called Crossview Fork (X-Fork) and Crossview Sequential (X-Seq)are proposed to generate scenes with resolutions of 64x64 and 256x256 pixels.X-Fork architecture has a single discriminator and a single generator. Thegenerator hallucinates both the image and its semantic segmentation in thetarget view. X-Seq architecture utilizes two cGANs. The first one generates thetarget image which is subsequently fed to the second cGAN for generating itscorresponding semantic segmentation map. The feedback from the second cGANhelps the first cGAN generate sharper images. Both of our proposedarchitectures learn to generate natural images as well as their semanticsegmentation maps. The proposed methods show that they are able to capture andmaintain the true semantics of objects in source and target views better thanthe traditional image-to-image translation method which considers only thevisual appearance of the scene. Extensive qualitative and quantitativeevaluations support the effectiveness of our frameworks, compared to two stateof the art methods, for natural scene generation across drastically differentviews.

Source PDF