Block Shuffle: A Method for High-resolution Fast Style Transfer with Limited Memory

Fast Style Transfer is a series of Neural Style Transfer algorithms that use feed-forward neural networks to render input images. Because of the high dimension of the output layer, these networks require much memory for computation. Therefore, for high-resolution images, most mobile devices and personal computers cannot stylize them, which greatly limits the application scenarios of Fast Style Transfer. At present, the two existing solutions are purchasing more memory and using the feathering-based method, but the former requires additional cost, and the latter has poor image quality. To solve this problem, we propose a novel image synthesis method named \emph{block shuffle}, which converts a single task with high memory consumption to multiple subtasks with low memory consumption. This method can act as a plug-in for Fast Style Transfer without any modification to the network architecture. We use the most popular Fast Style Transfer repository on GitHub as the baseline. Experiments show that the quality of high-resolution images generated by our method is better than that of the feathering-based method. Although our method is an order of magnitude slower than the baseline, it can stylize high-resolution images with limited memory, which is impossible with the baseline. The code and models will be made available on \url{https://github.com/czczup/block-shuffle}.


Introduction
Fast Style Transfer [1,2] uses feed-forward neural networks to learn artistic styles from paintings and uses the learned style information to render input images. This tech-nology improves the speed of Gatys et al.'s algorithm [3] and promotes the industrialization of Neural Style Transfer. For example, Prisma [4] is a famous mobile application based on Fast Style Transfer. It has set off the trend of using photos for artistic creation, and more and more people are enthusiastic about using this application to render their photos and share them on social networks. For such a simple application scenario, it does not require high-resolution images. However, in recent years, people try to apply Fast Style Transfer to some new scenarios, such as customizing decorative paintings, making video special effects, and synthesizing art posters. Unlike sharing photos on social networks, these new application scenarios need to stylize high-resolution images.
However, due to memory limitations, most ordinary devices are unable to stylize high-resolution images directly. Specifically, Fast Style Transfer includes a feed-forward neural network for image transformation and a pre-trained network for loss calculation. The image transformation network is a fully convolutional neural network, which can process images of arbitrary size. But in practice, the maximum resolution of the input image is determined by the memory of the device, and oversized images will cause outof-memory (OOM) errors.
There are two existing solutions to this problem. One is to buy more memory to meet computing needs, but this approach increases the cost and does not completely solve the problem. Another one is to divide the input image into many overlapping sub-images, stylize them respectively, and then use the feathering effect [5,6] to stitch them (hereinafter referred to as feathering-based method). This method does not need to upgrade the hardware, but its output image has obvious seamlines. For example, Painnt [7] is a mobile application that uses the feathering-based method to stylize arXiv:2008.03706v1 [cs.CV] 9 Aug 2020  high-resolution images locally.
To solve the problems in the above two methods, we delve into the characteristics of Fast Style Transfer models and propose a novel method named block shuffle. Its main contributions are as follows: 1. This method converts a single task with high memory consumption to multiple subtasks with low memory consumption. It enables more ordinary devices to support high-resolution style transfer, extending the scope of application of Fast Style Transfer.
2. Compared with the feathering-based method, our method eliminates the seamlines and small noise textures, which significantly improves the quality of generated images.
3. This method is non-invasive, which only adds preprocessing and post-processing steps before and after the image transformation network, and does not need to retrain the model.

Related Work
In 2016, based on the previous work of texture synthesis [9], Gatys et al. first proposed a Neural Style Transfer algorithm [3]. By reconstructing representations from the feature maps in the VGG-19 network [10], they found it has a strong feature extraction capability. Its lower layers can capture the content information of the input image, and its upper layers can capture the style information of the input image. Therefore, they designed content loss and style loss based on the VGG-19 network and achieved high stylization quality. However, their method is based on online image optimization, and each style transfer requires several hundred iterations, which takes a long time.
In order to speed up the process of Neural Style Transfer, Johnson et al. [1] and Ulyanov et al. [2] respectively proposed a method of training a feed-forward neural network, which can stylize the input image only through a forward pass. Images generated by their methods are similar to that of Gatys et al.'s method [3], but the speed is three orders of magnitude faster, so these methods are collectively called Fast Style Transfer. For example, using an Nvidia Quadro M6000 GPU to stylize a 512×512 image, the method of Gatys et al. takes 51.19 seconds, while the methods of Johnson et al. and Ulyanov et al. only take 0.045 seconds and 0.047 seconds [11].
In addition to speed, researchers have made many improvements in the quality of style transfer. For example, Ulyanov et al. proposed the instance normalization [12], which applies normalization to every single image rather than a batch of images. Using instance normalization instead of batch normalization [13] can not only promote convergence but also significantly improve the quality of generated images. Besides, Gatys et al. reviewed their previous style transfer algorithm [3] and found that the stroke size is related to the receptive field of the VGG-19 network. For a high-resolution image, the receptive field is much smaller than the image, so this algorithm cannot produce large stylized structures. Therefore, they proposed a coarse-to-fine method that could generate high-resolution images with large brush strokes [14].
At present, the research of high-resolution Fast Style Transfer mainly focuses on the brush strokes, and there is no research to solve the uncomputability problem due to limited memory in practical application. For example, Wang et   al. [15], Zhang et al. [16] and Jing et al. [17] all studied the brush strokes of Fast Style Transfer, aiming at producing excellent high-resolution images. These methods enlarge the stroke size, but they cannot process oversized images due to the limitation of device memory. Therefore, we propose a novel method named block shuffle, which not only solves this problem effectively but also can produce higher quality images than the feathering-based method.

Pre-analysis
We use the most popular Fast Style Transfer repository [8] on GitHub as the baseline. In this section, we will briefly introduce the model architecture and loss function of the baseline and analyze the reasons for the poor performance of the feathering-based method.

Model Architecture
Based on previous research, Engstrom implemented Fast Style Transfer and shared source code on GitHub [8], which attracts many developers and researchers. As shown in Fig.  2, in this repository, the loss network is a VGG-19 network pre-trained on the ImageNet dataset [18], and the image transformation network is a 16-layer deep residual network.
The architecture of the image transformation network is as follows: The kernel size of the first and last convolutional layers is 9×9, and that of others is 3×3. The second and third layers are stride-2 convolutions, which are used for downsampling. The last two and three layers are fractionally-strided convolutions with stride 1/2, which are used for upsampling (i.e., transposed convolutions with stride 2). The middle ten layers are composed of five residual blocks [19], and each residual block contains two convolutional layers. All non-residual convolutional layers are followed by instance normalization and ReLU activation function.

Loss Function
The loss function in the baseline combines the design of Gatys et al. [3] and Johnson et al. [1], which consists of three parts: style loss L s , content loss L c and total variation loss L tv . The total loss is expressed as: where λ s , λ c , and λ tv are the tradeoff parameters for style loss, content loss, and total variation loss.

Style Loss
The style loss L s is used to measure the style consistency between the output imageŷ and the style image y s . First, inputŷ and y s to the VGG-19 network, then take the feature maps of layers relu1 1, relu2 1, relu3 1, relu4 1, and relu5 1 to compute the Gram matrix respectively, and finally calculate the Euclidean distance between the Gram matrix of these two images: where F l (·) represents the feature maps of layer l in the VGG-19 network, and G(·) represents the Gram matrix. When computing the Gram matrix, reshape the feature

Content Loss
The content loss L c is used to measure the content consistency between the output imageŷ and the content image y c . First, inputŷ and y c into the VGG-19 network, and then take feature maps of layer l = relu3 3 to compute the Euclidean distance: (e) RGB color histograms

Total Variation Loss
Total variation loss L tv can promote the model to produce a smooth image, which is defined as: where x i,j is a pixel on image x, and i, j represent the position of this pixel.

Problem Analysis
On devices with limited memory, the image transformation network cannot stylize high-resolution images directly. Therefore, we need to divide the input image into many subimages for processing ( Fig. 3(a)). For example, we can cut the input image x into many non-overlapping sub-images, stylize them respectively, and concatenate them to generate a complete image (Fig. 3(c)). However, this method results in a significant difference between the two adjacent stylized sub-images, which destroys the visual integrity of the output image. To improve this method, an intuitive idea is to generate overlapping sub-images and use the feathering effect to stitch them (i.e., feathering-based method), but this way still generates visible seamlines in the output image ( Fig.  3(d)). Currently, mobile application Painnt has taken such a flawed method to stylize high-resolution images.
Observing the architecture of the image transformation network, we found two points that lead to this phenomenon: the receptive field of the image transformation network and the instance normalization layer. In convolutional neural networks, the receptive field is a region of the input image that affects a particular value in the feature maps of the network. Specifically, for a Fast Style Transfer model, a pixel on the output imageŷ depends on the pixel distribution in the corresponding receptive field on the input image x. Besides, results of instance normalization also rely on the pixel distribution of the input image x. Therefore, in summary, the output imageŷ will be affected by the pixel distribution of the input image x.
Based on the above analysis, we drew the RGB color histograms of the input image x and its sub-images (Fig.  3(e)), from which we can observe that the pixel distribution of sub-images is quite different from that of the input image x. Therefore, we proposed a conjecture: if the pixel distribution of the sub-images matches that of the input image x, then its stylized results will also be similar and can be easily stitched.

Proposed Method
In this section, we designed the pixel distribution matching method, which proves the correctness of our conjecture. Based on that, we proposed the block shuffle method.

Pixel Distribution Matching
In order to produce sub-images whose pixel distribution matches that of the input image x, we proposed the pixel distribution matching method. First of all, we assume that the input image x is a 3-channel image with width W and height H, and then we process the input image x by the following steps: 1. Cut the input image x into non-overlapping blocks of w×w pixels and number them in sequence (to simplify the discussion, we assume that both W/w and H/w are divisible).
2. Shuffle the list of image blocks randomly and take out some image blocks every time to generate a sub-image.
This method uses simple random sampling without replacement (SRSWOR) to select image blocks. In the population, each image block has an equal chance of getting selected, so the sub-images generated by this method (Fig.  4(a)) can better represent the input image x. As shown in Fig. 4(d), the pixel distribution of these sub-images is similar to that of the input image x. Besides, the above steps are like the patch shuffle regularization proposed by Kang et al. [20]. It can be understood as a kind of regularization, which makes each sub-image contains not only local information but also global information of the input image x.
Next, stylize all sub-images. Then, process the stylized sub-images by the following steps: 1. Recut all stylized sub-images into image blocks of w× w pixels.

Sort the list of image blocks according to their number.
3. Concatenate all image blocks, then obtain an output image with width W and height H.
We observed the output image of this method (Fig. 4(c)) and found that the brightness and color of image blocks are slightly different. But overall, this output image is very similar to the result of the baseline (Fig. 4(b)). This phenomenon proves the correctness of the conjecture made in Section 3.3 and provides theoretical support for further research.

Block Shuffle
Based on the pixel distribution matching, we propose the block shuffle method, which improves the coherence of the output image. In this method, the four steps before the style transfer model are named pre-processing, and the four steps after that are named post-processing (as shown in Fig. 5). The specific process is as follows: Input parameters. This method requires four input parameters: style transfer model M, input image x, basic width w basic , and padding width w padding . As shown in Fig. 6, each image block is a square, consisting of a basic region and a padding region. For two adjacent image blocks, the overlapping part constitutes the "overlap region". The width of image blocks is expressed as: (1) Expand. In order to ensure the integrity of image blocks, we use reflection padding to expand the input image x from W × H to W × H . The expanded image is represented as x , whose width W and height H are expressed as: W = W/w basic × w basic + 2w padding H = H/w basic × w basic + 2w padding (6) (2) Cut. First, cut the image x into overlapping square blocks with a width of w block , and then number them in order. Specifically, we use a sliding window to crop the image, the size of the window is w block × w block , and the  stride of the window is w basic . After that, the number of image blocks in the horizontal and vertical direction are respectively presented as W/w basic and H/w basic , and the total number of blocks is: (3) Shuffle. Shuffle the list of image blocks.
(4) Concatenate. Suppose our device can directly stylize an image of w max × w max pixels at most, so the size of sub-images must be less than or equal to this size. In the largest sub-image, the number of blocks is expressed as: Therefore, every time we take N block image blocks from the list in sequence and concatenate them into a square subimage of ( √ N block × w block ) × ( √ N block × w block ) pixels. The total number of sub-images is: (5) Style transfer. Use the Fast Style Transfer model M to stylize all sub-images.
(6) Recut. First, recut the stylized sub-images into square image blocks with a width of w block . Then, in order to reduce the boundary effect (i.e., the border area of stylized image blocks is contaminated by surrounding image blocks), remove the border area of 8 pixels wide around the image blocks, so the final width of image blocks is w block − 16.
(7) Sort. Sort the list of image blocks according to their number.
(8) Restore. First, use the feathering effect [6] to stitch all the image blocks. Then, remove the padding area added in step one and restore the image to its original size. Concretely, this feathering effect blends the left and right images by calculating the weighted average values in the overlap region: where p l and p r are the pixels in the overlap region of the left image and the right image. d l and d r are the distance between the overlapping pixel and the border of the left and right images.
Smooth. Finally, in order to eliminate the seamlines and small noise textures, we apply bilateral filters to the generated image, which smooth the image while preserving edges. To reduce the time spent, we use four small bilateral filters (sigmaColor=10, sigmaSpace=10) instead of a large bilateral filter (sigmaColor=40, sigmaSpace=40).

Implementation Details
In this paper, we adopt the most popular Fast Style Transfer repository [8] on GitHub as the baseline. At training time, we used MS-COCO dataset [21] to train the network, and all the images are cropped and resized to 512 × 512 pixels. In addition, the Adam optimizer [22] was used during training, with a learning rate of 1 × 10 −3 . The batch size is 4, and the number of iterations is 40,000. The tradeoff parameters λ s , λ c , and λ tv are set to 100, 7.5, and 200, respectively. At test time, we use the baseline, the feathering-based method, and our block shuffle method to stylize high-resolution images. In our method, the maximum resolution w max × w max is set to 1000 × 1000.

Hyper-parameters Selection
There are two crucial hyper-parameters in our block shuffle method, the basic width w basic and the padding width w padding , which determine the structure of image blocks. According to the description in Section 4.2, the padding region will increase the calculation amount of style transfer. To estimate the computational complexity of our method compared to the baseline, we defined a parameter α as: where the ratio of w padding to w basic determines the computational complexity of this method. When this ratio decreases, the amount of computation will gradually approach to the baseline, but meanwhile, the overlap region will also decrease, which reduces the quality of generated images.
To balance the amount of calculation and the quality of generated images, we let w padding equals to w basic , which means α = 9. Through experiments, we found that with the decrease of w basic and w padding , seamlines on the output image gradually disappeared. As shown in Fig. 7, the result with w basic = w padding = 16 is the best, so we use this value in subsequent experiments. Fig. 8 shows the high-resolution results of our method and two aforementioned solutions, from which we observe that the results of our method are more similar to the baseline. In contrast, the results of the feathering-based method have obvious seamlines, are quite different from the baseline.

Visual Evaluation
The results of these three methods at different resolutions are shown in Fig. 9. For a high-resolution image, the receptive field is much smaller than the image. Therefore, with the increase of resolution, more and more stylized textures are produced, which reduces the aesthetics of the generated image. However, compared with the other two solutions, our method performs better. More precisely, our method eliminates the noise textures and seamlines, which improves the quality of high-resolution stylized images.

Speed Evaluation
We tested the speed of our method on three devices: a mobile phone, a personal computer, and a GPU server. The information about these devices is as follows: 1. The mobile phone is Xiaomi Mi 9, which runs on Android 10.0, powered by the Qualcomm Snapdragon 855 processor, with the Adreno 640 GPU and 8GB RAM.
2. The personal computer runs on Windows 10, powered by the Intel Core i7-6700HQ processor, with the NVIDIA GeForce GTX 965M GPU and 4GB video RAM.

Style
Content Baseline Baseline+Feathering Baseline+Block Shuffle Figure 8: Comparison of baseline, baseline+feathering-based method, and baseline+block shuffle (ours). The resolution of the above images is all 3000 × 3000.
3. The GPU server runs on CentOS 7.0, powered by the Intel Xeon E5-2650 v4 processor, with the NVIDIA Tesla K80 GPU and 12 GB video RAM.
On the mobile phone, we used Xiaomi's Mobile AI Compute Engine (MACE) [23] to deploy models. On the personal computer and GPU server, we used Google's Ten-sorFlow [24] to test the speed. In all tests, Fast Style Transfer models were run in GPU mode.
In this experiment, we tested images with resolution ranging from 1000 × 1000 to 10000 × 10000. Tab. 1 shows the average time for the baseline and our method to stylize images of different resolutions on the three devices, where "−" means the image of this resolution cannot be processed due to OOM error.
From the results, we observed that the baseline has an advantage at speed, but it can only process images with low resolution. For example, the mobile phone and personal computer can stylize images up to 2000×2000 pixels, and the GPU server can stylize images up to 4000 × 4000 pixels. Compared with the baseline, our method breaks through the limitation of image resolution and can stylize high-resolution images with limited memory, but at the cost of an order of magnitude slower.

Memory Evaluation
We tested our method with the Xiaomi Mi 9 and showed the memory usage of stylizing images with different resolutions in Fig. 10. From this figure, we can observe that the  memory usage of the Fast Style Transfer model is a constant value of 0.33 GB. This is because the objects processed by the model are sub-images with the same resolution. Compared to the baseline, our method significantly reduces memory usage. More concretely, the baseline cannot stylize images above 3000 × 3000 pixels due to OOM error, but our method even stylize a 10000 × 10000 images will not cause OOM error. In addition, the memory usage of stylizing an image below 4000 × 4000 pixels is less than 1 GB. It means that our method can enable most mobile devices and personal computers to support high-resolution Fast Style Transfer, which will contribute to the industrialization of Fast Style Transfer.

Conclusion
In this paper, we proposed the block shuffle method for high-resolution Fast Style Transfer with limited memory. Experiments show that the quality of high-resolution images generated by our method is superior to that of the feathering-based method. Besides, although our method is an order of magnitude slower than the baseline, it breaks through the limitation of image resolution, which enables more devices to support high-resolution Fast Style Transfer. In future work, we will further study this subject, improve the image quality and speed, and promote the industrializa-

Content
Baseline Baseline+Feathering Baseline+Block Shuffle Figure 11: Additional comparison of baseline, baseline+feathering-based method, and baseline+block shuffle (ours). The resolution of the above images is all 3000 × 3000.
tion of Fast Style Transfer.
Appendix A: More Comparisons Figure 11 and Figure 12 show more comparisons with baseline, baseline+feathering-based method, and base-line+blocks shuffle (ours).

Content
Baseline Baseline+Feathering Baseline+Block Shuffle Figure 12: Additional comparison of baseline, baseline+feathering-based method, and baseline+block shuffle (ours). The resolution of the above images is all 3000 × 3000.