I. Introduction
Image-to-Image (I2I) translation aims to learn the mapping between the source and target domain, and begins to emerge as the proposal of Generative Adversarial Networks [2]. Since then, increasing attention has been paid to this task because several visual tasks could be transformed into I2I translation such as: style transfer [3], [4], super-resolution [5], portrait synthesis [6], [7], [8], label-to-image [9], [10] and image-inpainting [11]. Moreover, great progress has been made in recent years. For example, CycleGAN [12] proposes to exert cycle consistency on the generators during the training process. Furthermore, UNIT [3] extends the Coupled GAN [13] based on the assumption of a shared latent space. To meet the demand of generating diverse and multi-modal images, MUNIT [14], DRIT [15], etc. are introduced by recombining the disentangled image representation. It is noteworthy that the methods above only focus on transferring styles on the whole image without considering the characteristics of instances.