Unbiased Image Style Transfer

Image style transferring process generates an output image in the target style with a specific strength for a given pair of content and target image. Recently, feed-forward neural networks have been employed in this process to fastly decode a linearly interpolated feature in encoded feature space. However, to date, no studies have been conducted to analyze the effectiveness of this style interpolation method. In this article, we tackle the missing work of the in-depth analysis of style interpolation and propose a new method that is more effective in controlling the strength of the desired style. The existing methods are biased because the training of a network is performed with one-sided data of full style strength. Therefore, such methods do not guarantee the generation of a satisfactory output image in an intermediate style strength. To resolve this problem of a biased network, we propose an unbiased learning technique, which uses unbiased training data and loss to allow a feed-forward network to learn the desired regression of style consistent with a specific interpolation function in encoded feature space. The experimental results verified that our unbiased method achieved a better regression learning between style control parameter and output image style, and more stable style transfer that is insensitive to the weight of style loss without adding complexity in image generating process.


Introduction
Recent fast image style transferring methods [1,3,4,8,9] uses feed-forward networks to generate output stylized image from an input content image or the input pair of a content image and a target style image. Here, the feed-forward networks were trained to learn how to encode feature to represent content and style of an image (encoder), how to change the style of image in feature space (transformer), and how to generate an image from the style-changed feature (decoder). Those approaches also utilize linear interpolation technique to generate images of intermediate style between content image and target style image corresponding to a style control parameter α ∈ [0.0, 1.0]. Although they achieved good results in both processing speed and style quality of output stylized image, it is not guaranteed that the c 2018. The copyright of this document resides with its authors. It may be distributed unchanged freely in print or electronic forms. To date, such problem was not dealt in depth.
In this paper, we tackle the problem of style interpolation caused by the biased network training and propose a method that is superior to the current linear interpolation technique. First, we interpret the task of style strength control as a regression learning between style control parameter α and style strength of output image. In this aspect, the feed-forward networks from previous methods are strongly biased to full style strength (α = 1.0), lacking training data for intermediate style strength (α < 1.0). Therefore, here, we alternatively propose an unbiased learning of style transfer network by using additional training data and style loss for α < 1.0 in training phase. As shown in fig.1(b), we use the unbiased training data and corresponding loss for α = 0.0 to make the trained network to reconstruct input content image when α = 0.0 as well as to generate target stylized image when α = 1.0. This unbiased training also helps in selecting an appropriate weight for style loss by reducing network's sensitivity to the weight of style loss. Moreover, with additional anchor data and the corresponding loss for 0.0 < α < 1.0 as shown in fig.1(c), our method allows the network to learn the desired regression between style control parameter and output style strength consistent with a specific style interpolation function in encoded feature space. Figure 2 shows the whole network architecture and losses of our method.
In the remained of this paper, details of our unbiased style regression learning method will be described in sec.2, experimental results and analysis of verifying our method will be presented in sec.3, and we will conclude this work in sec.4.

Related works
For the first neural approach for image style transfer, Gatys et al. [2] adopted a part of VGGnet [14], which is a pre-trained convolutional neural network for image classification task, as a feature extractor for content and style of an image. They generated output image similar to an input content image in content and to a target style image in style by updating pixel values to minimize the summation of content difference and style difference using an online gradient-based optimization technique. For the content and style similarity measure, they used the difference in VGG feature space as the content loss and the difference in Gram space of the feature as the style loss. Their method resulted in a good quality of transferred style but it had a very slow image generating speed because of its online optimization scheme. By inserting a feed-forward network between the input and output images [6,17,18], the problem of a very slow image generating speed was solved. The feed-forward network was trained to generate output stylized image from input content image for a target style. This changed the online pixel optimization process [2] into an offline network training and sped up the image generating process as a network feed-forward calculation.
Soon after, implementation of modified instance normalization layer [1,3,4,9] allowed the trained network to embed multiple or arbitrary styles and to generate the output image of mixed style or intermediate style strength. Dumoulin et al. [1] used learnable affine parameters for multiple styles in their conditional instance normalization (CIN) layer to efficiently switch the style of the output image to a desired style by changing 2nd order statistics in VGG feature space. In addition, they proposed a simple style interpolation technique to generate an output image of mixed style by linearly interpolating affine parameters of embedded styles. Huang and Belongie [4] proposed the alternative adaptive instance normalization (AdaIN) layer for transferring the style of an unseen target image to the content image. Instead of using learnable parameters, they used human-designed parameters, mean and standard deviation, of VGG feature for changing feature statistics. Their method also used linear interpolation of the mean and standard deviation in their AdaIN layer to control style strength of the output image.
Li et al. [9] used a correlation-aware feature alignment technique called whitening and coloring (WCT), also known as correlation alignment (CORAL) [15,16] in object classification. Here, they used covariance instead of standard deviation to consider inter-channel correlation in feature statistics. They also used linear interpolation technique for style strength control as previously demonstrated [4] but achieved a better quality of the reconstructed content image with zero style strength because they trained decoder network in the manner of minimizing both pixel reconstruction loss and feature loss. However, they did not deal with exact regression between style control parameter and style strength of output image.
Additionally, generative adversarial network (GAN) approaches similar to Pix2pix [5], CycleGAN [20], and BicycleGAN [21], also dealt with image style transfer as an application of their image-to-image translation task. These methods relieved the requirement of welldefined loss for image style difference and training image pairs which are necessary for encoder/decoder network. While these approaches achieved a high quality of the generated image by focusing on realistic image generation, these did not focus on style strength control.

Method
Our unbiased image style transferring method consists of two strategies. One is unbiased network learning to generate the zero-style image of content image style. The other is regression specification to control style strength of output image in desired characteristic function between style control parameter and output style strength.

Unbiased learning for unbiased and stable style transfer
Output images of the previous feed-forward networks [1,4] with the style control parameter α = 0.0 are not same to the content image but an image of some biased style as shown in fig.3. The biased images of zero style strength occurred because the decoder networks of the previous methods were trained only with the biased data of {content image, target style image} pairs which are corresponding to full style strength (α = 1.0). As the result, the biased decoder does not guarantee the unbiased output of zero style strength when style control parameter is zero.
As a simple but effective way to solve this problem of the biased decoder, we add the unbiased data of {content image, content image} pairs in every batch iteration of network training phase as shown in fig.1(b) and the corresponding unbiased loss L unbiased as fig.2. This means that the losses corresponding to the unbiased data, i.e., unbiased content loss L ucontent , unbiased style loss L ustyle , and unbiased total variation loss L utv , are added to the biased loss L biased which was calculated as a weighted summation of the losses of biased data, i.e., L content [1,2], L style [4], and L tv [2]. These unbiased losses give additional constraint to the decoder network to generate an output image of original content style while the biased losses encourage the decoder network to be optimized to generate an output image of target style. Total loss considering our unbiased loss is represented as eq.1.
where we add additional L1 loss, L reconstruct (eq.1), between the zero-style image I u and the content image I c into the total loss L unbiased to reconstruct content image when style strength is zero. L reconstruct is consistent to that of [9] where L2 loss was used to train the decoder network. However, using L2 loss is known for blurred reconstructed image [5,11,13,19], and [9] used only unbiased losses of unbiased data (content images) to train the decoder network. In the loss equation form, [9] is a specific case of our unbiased learning scheme because eliminating the biased losses and style losses reduces our total loss (eq.1) into that of [9].

Regression specification for style control
As shown in fig.1(a), training a network only with the biased data of {content image, target style image} pairs cannot guarantee to learn a linear regression between style control parameter and style strength of output image which was used for style interpolation in the previous image style transferring methods [1,4,9].
To learn a specific regression between style control parameter and style strength of output image, we need to use additional anchor data as shown as green dots in fig.1(c) and corresponding anchor losses L anchor as shown in fig.2 for intermediate values of style control parameter α. The anchor loss L anchor is represented in eq.2 in the same manner of L biased in eq.1. Here, the anchor-style loss L astyle is the style distance between the output anchor image I α and target anchor-style image I s (α). However, it is not possible to calculate L astyle directly from the images because we do not have the target anchor-style image I s (α). Therefore, as an alternative of I s (α), we use the linear interpolated style feature of full style feature f s (I s ) and zero-style feature f s (I c ) as the target anchor-style feature. Then, the anchor-style loss can be calculated as the L2 distance between the target anchor-style feature and the output anchor-style feature f s (I α ) as in eq.2.
L anchor (α) = w c · L acontent + w s · L astyle + w t · L atv , L acontent = L content (I α , I c ), L atv = L tv (I α ), L astyle = L style (I α , I s (α)) = || f s (I α ) − (α · f s (I s ) + (1 − α) · f s (I c ))|| 2 . (2) This anchor loss for desired value of α is added to the total loss of eq.1 in every iteration of the training phase. Once a network is trained as a linear regressor, then we can specify arbitrary regression by using a desired characteristic function f (α) instead of α in transformer of the network (fig.2).

Experiments
In this section, we will analyze our unbiased learning and regression specifying methods experimentally in the aspect of loss and image quality. And we will prove the benefits of our method by comparing to the previous image style transferring methods.

Experimental Setup
We used the encoder-transformer-decoder architecture of AdaIN [4] as the common network configuration but VGG16 feature extractor as the encoder and its mirrored network as the decoder respectively as shown in fig.2. The output tensors of {relu1_2, relu2_2, relu3_3, relu4_3} layers were used as the style features and that of {relu3_3} layer as the content feature. This follows the layer configuration of [6], which uses VGG16 feature extractor in loss calculation. We set the weights of losses as w c = 1.0, w t = 10 −3 , w r = 10 2 · w s , and varying weights (w s = 50, 10 2 , 10 3 , 10 4 ) to analyze how the learned networks work as the weight of style loss increases. For training data, we used MS COCO train2014 dataset [10] as content images and the training dataset of painter by numbers [12] as a large set of target style images. Additionally, we used our collection of 22 style images as a small set of target style images to analyze network performance as the number of embedded style increases and to compare our method to CIN [1] which can be applied to a small number of target styles. Those images were resized into 256 pixels in short side and cropped into 240 by 240 pixels for data augmentation while containing a reasonable amount of image content. With those training images, the networks of CIN layer or AdaIN layer with or without our unbiased learning scheme were trained by Adam optimizer [7] with learning rate 10 −4 (with smaller learning rate 10 −6 when w s = 10 4 ), batch size 4 and epoch number 4 on Pytorch v0.3.1 framework with CUDA v9.0, CuDNN v7.0, and NVIDIA TITAN-X Pascal. In test phase, we used MS COCO test2014 dataset [10] and test dataset of painter by numbers [12] as the content images and the target style images respectively, and all the test images were resized into 256 pixels in short side without cropping before fed into the networks.

Results of unbiased learning
As shown in fig.3, the networks with CIN layer or AdaIN layer trained by using the previous biased training schemes [1,4] with a small set of style images generated full style images of high style quality but heavily biased zero-style images. In contrast, the trained networks with our unbiased scheme generated unbiased zero-style images while maintaining almost the same quality of full style images.
For more generalized performance comparison, we trained several networks with a large set of style images and with varying weights of style loss. Afterword, we measured the average values of content losses L content , style losses L style , and unbiased style losses L ustyle in eq.1 for test style transfer with 100 pairs of {unseen content image, unseen target style image}. Figure.4 shows the measured average losses and its standard deviations times 0.1. When α = 1.0 (full style transfer), our unbiased learning scheme achieved the smaller average content loss than that of the original AdaIN (blue lines on fig.4(a)) while maintaining almost the same average style loss and unbiased style loss of original AdaIN (blue lines on fig.4(b), (c)). This means that the fully stylized images of our method (odd rows of fig.5(b)) have less degradation in content than those of the previous method (odd rows of fig.5(a)) for the same stylization quality. When α = 0.0 (zero-style transfer), our unbiased learning scheme achieved a much smaller average content loss and unbiased style loss than those of the original AdaIN (red lines on fig.4(a), (c)) while maintaining the higher average style loss (red lines on fig.4(b)). This means that the zero stylized images of our method (even rows of fig.5(b)) are almost reconstructed into the original content images while those of the previous method (even rows of fig.5(a)) are quite different from the original content images.
As the weight of style loss w s increases, content losses at α = 1.0, 0.0 and unbiased style loss at α = 0.0 also increase as shown in fig.4(a) and (c). However, the increment is much smaller with our unbiased learning scheme compared to that of the original AdaIN. This means that our method achieved stable stylization performance by maintaining desired content and style of output image insensitive to the large style weight. This stableness in stylization is also verified in fig.5. The full style and zero style results of original AdaIN ( fig.5(a)) shows good output style quality with small style weights but degraded style quality with large style weights. In contrast, the results of our unbiased scheme ( fig.5(b)) shows a comparable quality of full style and zero-style images stable with a large range of style weight variation.

Results of specifying style regression
To verify how our regression specification method works, we trained additional style transfer networks with AdaIN layer for linear style regression learning by using two additional anchor losses (eq.2) at α = 1 3 , 2 3 for intermediate target styles to the total loss of eq.1 (we could add only two additional anchors because of memory limitation on GPU device). Then, we calculated the average content losses L content and the average anchor-style losses L astyle of the output stylized images of the varying style control parameter α by feeding 100 test pairs of {content image, target style image} into the trained networks. Figure 6 shows how the losses change as the style control parameter changes. As shown in fig.6(a), The content losses of our unbiased method (dash-dot lines) and 2-anchored method (solid lines) smoothly increases as the style control parameter increases, maintaining  The anchor-style loss L astyle (distance from the target anchor-style of eq.2) of our unbiased method with additional two anchor data (solid lines) has lower value (indicating better regression matching) than that without anchor data (dash-dot lines) and much lower value than that of AdaIN without our unbiased learning scheme (dashed lines).
Universal [9]  much lower value than that of original AdaIN (dashed lines). This is the expected contentstyle trade-off in style strength control but shows that considering additional anchor data of intermediate style did not degrade the stable content-preserving property of our unbiased learning scheme and even achieved more stable (slightly lower content loss) than the unbiased method did as α increases. As shown in fig.6(b), anchor-style loss of our 2-anchored learning method (solid lines) is lower than that of our unbiased learning method (dash-dot lines) for almost all α values and much lower than that of original AdaIN (dashed lines) for lower α values. This means that the learned style regression with additional anchor losses is closer to the desired linear regression than those of the original AdaIN and the unbiased AdaIN. Figure 7 shows the intermediate stylized images of the previous and our methods. The results of our unbiased scheme with 2 anchor data are presenting the unbiased smooth style transitions for regressions of f (α) = α (second row of fig.7) and f (α) = √ α (third row of fig.7) as the style control parameter α changes while the original AdaIN is presenting the style transition starting from the biased zero-style image (first row of fig.7). The result of Universal [9] (bottom row of fig.7) seems to present good style transition especially in pixel color but shows blurred images caused by its L2 reconstruction loss, lack of stroke patterns, and style saturation for α > 0.6 caused by using only content images in detector learning. Here, the style transfer networks of Universal were trained as a cascade of 3 networks under the common network configuration described in sec.3.1.

Conclusion
In this paper, we proposed the unbiased learning scheme and the style regression specifying technique for fast image style transfer based on feed-forward neural networks. Our unbiased learning scheme used biased loss and unbiased loss simultaneously in network training and achieved a stable style transfer with a wide range of style loss weight and a contentpreserving style transfer enough to reconstruct content image when style control parameter is zero. Moreover, by considering additional anchor data of intermediate styles, our method improved to learn a regression between output style strength and style control parameter closer to the linear regression. This resulted in arbitrary regression of style strength by using a desired regression function of style control parameter in the transformer of style transfer networks. These achievements were verified experimentally by analyzing the losses and the generated images from a number of trained networks for the state-of-the-art methods and ours with a wide range of style loss weights and varying style control parameters.