Flexible Style Image Super-Resolution using Conditional Objective

Recent studies have significantly enhanced the performance of single-image super-resolution (SR) using convolutional neural networks (CNNs). While there can be many high-resolution (HR) solutions for a given input, most existing CNN-based methods do not explore alternative solutions during the inference. A typical approach to obtaining alternative SR results is to train multiple SR models with different loss weightings and exploit the combination of these models. Instead of using multiple models, we present a more efficient method to train a single adjustable SR model on various combinations of losses by taking advantage of multi-task learning. Specifically, we optimize an SR model with a conditional objective during training, where the objective is a weighted sum of multiple perceptual losses at different feature levels. The weights vary according to given conditions, and the set of weights is defined as a style controller. Also, we present an architecture appropriate for this training scheme, which is the Residual-in-Residual Dense Block equipped with spatial feature transformation layers. At the inference phase, our trained model can generate locally different outputs conditioned on the style control map. Extensive experiments show that the proposed SR model produces various desirable reconstructions without artifacts and yields comparable quantitative performance to state-of-the-art SR methods.


Introduction
Finding a high-resolution (HR) counterpart from a given low-resolution (LR) image is referred to as single image super-resolution (SISR). The SISR is an ill-posed problem in that infinitely many HR images correspond to a single LR image. Despite such ill-posedness, recent convolutional neural networks (CNNs) are shown to map an LR to a plausible HR [1]. SRCNN [2,3] first showed the effectiveness of a CNN for SISR, and various CNN architectures have been proposed for better performance afterward [4,5,6,7,8,9,10,11,12,13,14,15,16]. Earlier works used mean square error (MSE) as a loss function to train the network. However, since it tends to produce blurry HR outputs, researchers are finding new loss functions to generate more realistic outputs [17,18]. Specifically, perceptual losses [19] are introduced to optimize the super-resolution (SR) model in the feature space instead of pixel space. Ledig et al. [20] proposed to use adversarial loss [21] in combination with the perceptual loss to encourage the network to favor perceptually superior solutions residing in the manifold of natural images. More recently, Wang et al. [22] investigated class-conditional SR. It employed Spatial Feature Transform (SFT) capable of altering an SR network's behavior conditioned on semantic segmentation probability maps. However, since most of the existing methods calculate perceptual losses on an entire image in the same feature space, the results tend to be monotonous and unnatural. For this reason, Rad et al. [23] optimized SR models with a targeted objective function that penalizes images at different semantics using the corresponding terms. But, since the segmentation label needs to be fed to the SR network to calculate the targeted perceptual loss, the users cannot easily adjust the objective function. In summary, most early SR networks provide a designated HR output among many possible ones, not allowing us to explore more plausible arXiv:2201.04898v3 [cs.CV] 8 Mar 2022 Flexible Style Image Super-Resolution using Conditional Objective outputs at the test phase. To alleviate this problem, Lugmayr et al. [24] proposed the SRFlow using a normalizing flow method capable of learning the conditional distribution of the output given the low-resolution input. As a result, it can learn to predict diverse photo-realistic high-resolution images. Though great strides have been made, the natural and flexible reconstruction of local regions is still challenging. As stated previously, there can be diverse HR solutions for a given LR, meaning that one LR input can be restored to different HR results depending on the context and situation. Particularly because of various shapes and textures in the real world, the one-to-many problem becomes even more serious if the SR network's capacity is not large enough. To solve this problem, first, the SR model should be able to generate more diverse styles of HR reconstruction while keeping consistency with the given LR image. Second, the recovery style needs to be locally controlled. Third, training and storing too many redundant SR models with different parameters should be avoided. Achieving these requirements would enable us to explore various HR solutions for each region effectively. In this respect, some recent methods made it possible to continuously generate and adjust intermediate results between two objective functions, i.e., perception and distortion functions [25,26,27]. However, there can be some improvements in these approaches, as they defined just two objective functions and controlled the entire image, not the local regions needing adjustment.
In this paper, we attempt locally adjustable HR generation by exploring the SR model optimization, focusing on the development of conditional objectives that can generate various reconstruction styles. The proposed objective consists of the weighted sum of several perceptual losses from different feature levels. The weights vary according to the condition, which is the recovery style information in our work. Experiments show that training an SR model with our multi-level perceptual losses generates various recovery styles effectively, which also enables us to finely control the styles of local regions.

Loss Functions for SISR
The choice of the objective function affects the recovery style and reconstruction performance. For instance, adversarial loss [21] encourages an SR network to generate perception-oriented solutions [28,29,30,31]. Perceptual losses [32,19] are proposed to optimize SR models by minimizing the error in the feature space instead of pixel space. Dovovitskiy et al. [18] and Ledig et al. [20] proposed to use adversarial loss in combination with the perceptual loss to encourage the network to favor solutions that look more like natural images. With these loss functions, the overall visual quality of reconstruction is significantly improved [33,34,35]. Recently, some studies [36,37,38] proposed to use GAN with losses based on perceptual quality assessment metric. Another perceptual loss is proposed in [23], using different levels of features according to semantic segmentation labels such as objects, boundaries, and backgrounds. In these approaches, once an SR model is trained, a fixed HR is produced for the LR input.

Network Conditioning
The feature normalization techniques generally change networks' behavior based on the input properties. The representative normalization methods may be batch normalization (BN) [39] and instance normalization (IN) [40]. The IN normalizes a single image while the BN does a whole batch of images. Conditional Instance Normalization (CIN) has also been introduced in [41], which uses the learned representations to model multiple styles simultaneously. Huang et al. [42] proposed adaptive instance normalization (AdaIN) to adjust features to arbitrary new styles. Perez et al. [43] proposed Feature-wise Linear Modulation, called FiLM, as a general-purpose conditioning method for neural networks. FiLM layers influence neural network computation via a simple, feature-wise affine transformation based on conditioning information. Inspired by these works, Wang et al. [22] proposed a spatial feature transformation (SFT) layer to modulate the features of some intermediate layers in a single network conditioned on semantic segmentation probability maps. Our approach is partially inspired by the above feature normalization methods, which can alter the behavior of deep CNNs to influence the output. In terms of network architecture, we use the Residual-in-Residual Dense Block (RRDB) [44] equipped with SFT layers.

Continuous Imagery Effect Transition
Since the restored image's perceived quality is relatively subjective, and the perception-oriented methods sometimes generate artifacts, users may wish to control the reconstruction result according to the preferences or image characteristics. In recent years, there have been some tunable models that produce intermediate images between the goals of two different objective functions. Specifically, these methods start by training several separate models and then propose different ways of interpolating between them, specifically by directly interpolating the output pixels or network weights [44,26], or by using specialized adaptor blocks in the networks [27]. They considered trade-off relationships between two objectives, such as perception-distortion balance in SR, noise reduction vs. detail preservation in denoising and style transfer [45,27,25,26]. However, these methods have some limitations: the number of objective functions is two, and they cannot adjust local regions, i.e., the algorithm is equally applied to the entire region of an image. It is also inefficient that they have to train and store multiple separate models. On the other hand, Bahat et al. [46] proposed an explorable SR framework that enables local restoration control. However, users have to manually edit the texture in a few steps through a user interface. For easier and more effective quality control, we propose a controllable SR model that can produce various recovery styles for each region with a simple adjustment method. Besides, we can generate intermediate results between two or more different styles at fine control levels.

Multi-task Learning
Learning one task at a time is a typical methodology in machine learning because it is hard to simultaneously optimize multiple objectives due to model capacity limitation or conflicting losses. For this reason, such multi-objective problems are commonly scalarized by a linear combination of the losses, with weights defining the trade-off between the loss term [47]. On the other hand, Multi-task Learning (MTL) is an inductive transfer mechanism whose goal is to improve generalization performance by leveraging useful domain-specific information contained in multiple related tasks [48]. Specifically, since the MTL networks use shared layers trained in parallel on all the tasks, what is learned for each task can help others to learn better when tasks are closely related [47,49]. Recently, Dosovitskiy et al. [50] proposed loss-conditional training of deep networks for MTL that can improve model efficiency by exploiting the redundancy of multiple related models. They demonstrate style-transfer trained in this way and utilize feature-wise linear modulation [43] that affects the whole image style.

Targeted Perceptual Loss
In general, the choice of feature space significantly influences perceptual reconstruction performance and the styles. For example, Figure 1 shows the effect of choosing different feature spaces in computing the perceptual loss. In this paper, four different layers, ReLU 2-2, ReLU 3-4, ReLU 4-4, and ReLU 5-4 of the VGG-19 network [51] are considered, denoted as VGG22, VGG34, VGG44, and VGG54, respectively. As shown in Figure 1, while the low-level feature space VGG22 seems more suitable for reconstructing simple edges with less distortion and over-sharpening, the midand high-level feature spaces of VGG44 are more appropriate for recovering complex textures. Therefore, it is difficult to determine a single feature space that works best for the entire image. In our work, we use more than two feature spaces at the same time to train a flexible SR (FxSR) model capable of generating various reconstruction styles. We define two kinds of FxSR models, namely FxSR-PD (perception-distortion) and FxSR-DS (diversity). The FxSR-PD is the main model in our work, which controls the output style between the distortion-oriented and perception-oriented by Figure 2: The architecture of our proposed flexible SR network. We use the RRDB equipped with SFT as a basic block (Figure 3(c)). The condition branch takes a style map for reconstruction style as input. This map is used to control the recovery styles of edges and textures for each region through SFT layers.
(a) RB with SFT [22] (b) Residual-in-Residual Dense Block (RRDB) [44] (c) The proposed Basic Block (RRDB equipped with SFT layer) Figure 3: RRDB with SFT for basic blocks combining the reconstruction loss (for distortion) and VGG22 feature loss (for perception), along with the adversarial loss. The FxSR-DS uses the same architecture as the FxSR-PD but is trained with different losses, including all the VGG features stated above. Hence, the aim of FxSR-DS is to produce diverse styles of outputs related to different VGG features rather than to control between distortion and perception. Unlike previous works where there is no control data, we adjust the network by applying different objective functions for each local region through a style control map 1 . As a result, we can explore various HR solutions that are generated using multiple objective functions and thus reconstruct an image with the desired style or an image closer to the original HR.

Proposed SR with Flexible Style
Given a single LR image I LR , SISR is to estimate an HR imageÎ HR , which is as similar as possible to its corresponding HR counterpart I HR . Most of the current CNN-based methods use feed-forward networks to directly learn a mapping function G θ parameterized by θ asÎ To optimize G θ on the training samples, we design a specific objective function O as where Z = I LR , I HR is sampled from given a training distribution of pairs P Z . Many recent studies [20,52] use perceptual loss and adversarial loss for designing O to recover realistic textures. Although these losses greatly improve the perceptual quality, the generated textures tend to be monotonous and unnatural [23,22]. To further improve the restoration performance, Wang et al. [22] used semantic segmentation probability maps as the categorical prior Ψ and reformulated (1) asÎ However, the perceptual loss was applied to the entire region of images, like in previous works. Specifically, the same level of features was used both on simple edges and complex textures, which has a limitation in restoring images composed of various types of objects. In addition, once model training is completed, there is no way to adjust the SR results without retraining. Hence, instead, we propose a novel method to apply different objectives to each region for reconstructing desired images or images closer to the original. Specifically, the proposed flexible SR model is optimized with a conditional objective, which is a weighted sum of several perceptual losses corresponding to different feature levels, where each weight changes depending on the style map. Formally, our objective is described as: where T is a map delivering spatially varying style control. That is, the map T is an LR-sized matrix, which is fed to the condition network to change the SR styles. Since the purpose of training is to let the network learn various styles corresponding to given control parameters, we feed various T randomly to the network during the training. Specifically, we feed a flat map T = t × 1 during the training, where 1 is the matrix with all the elements 1, and t is a variable related to the feature combinations, which will be detailed in the following subsection. For training with various feature combinations, we change t randomly at each epoch. At the inference, if we feed a flat map as defined above, the network will deliver an SR style globally corresponding to the t. If we wish to control the styles locally, we feed a spatially varying map, which will be demonstrated in the experiment.

Proposed Network Architecture
An overview of the architecture is shown in Figure 2. The generator network G θ consists of two streams, an SR branch and a condition branch. The SR branch is built with basic blocks consisting of RRDB equipped with the SFT layers [22], which take the shared conditions as input and modulate feature maps by applying the affine transformation. This structure is shown in Figure 3(c), where the residual block with SFT [22] and RRDB [44] are also shown in Figures 3(a) and (b) for comparison. The SFT layer learns a mapping function that outputs a modulation parameter based on a style condition T. This modulation layer allows the SR branch to optimize the changing objective during the training and also to generate SR results with spatially different styles according to the style map. The condition branch is used to produce shared intermediate style conditions that can be broadcasted to all the SFT layers for efficiency. As in the study of [22], all the convolution layers in the condition branch are restricted to use 1 × 1 kernels to avoid the interference of different regions. For discriminator network, we use VGG network [51] that contains ten convolution layers gradually decreasing the spatial dimensions.
The notations will be explained one by one below. First, the reconstruction loss is calculated as: We use the adversarial loss using Relativistic average Discriminator RaD [53] that performs better for learning sharper edges and more detailed textures compared to standard GAN [21]. While the standard version estimates the probability that one input image I is real and natural, the RaD predicts the probability that a real image I HR is relatively more realistic than a fake oneÎ HR . In addition, for adversarial training, RaD benefits from the gradients from bothÎ HR and I HR , while onlyÎ HR takes effect in the standard version. Specifically, the adversarial and the discriminator losses are: where D I HR = sigmoid C I HR − EÎHR C Î HR (12) where C (·) represents the output logit of discriminator. The conditional perceptual loss is a weighted sum of multiple perceptual losses in different levels of feature spaces: where L l denotes the distance in each feature space, l ∈ {V GG12,V GG22, · · · , V GG54}, and the weights w l changes according to T. Precisely, the distance L l is defined as where φ l denotes feature maps in the feature space l. The weights w rec , w adv , and w l are functions of t as described in Figure 4, where t is a random variable having uniform distribution in [0, 1] during the training.

Implementation details
This subsection explains how we design the combination of feature losses depending on the change of t. The left column of Figure 4 shows the weight function for FxSR-PD (using only VGG22 for perceptual loss), and the right for FxSR-DS (using more feature spaces for diversity). When t=0, the figure shows that FxSR-PD corresponds to distortion-oriented SR (perceptual and adversarial losses are zero). When the value of t approaches 1, then it becomes perception-oriented (weight for the reconstruction loss becomes zero, while adversarial and perceptual losses grow to 1). In the case of the right column, various feature distances are involved in the perceptual loss, and hence FxSR-DS can deliver diverse styles. Specifically, note that t = 1 corresponds to a perception-oriented SR with VGG54 as the feature space. Also, even when t approaches 0, the FxSR-DS still produces perception-oriented SR results of different  [31] .
styles corresponding to VGG22, unlike the FxSR-PD that is distortion-oriented at t = 0. Regarding the style control, as stated previously, we use a uniform map T = t × 1 at the training phase. That is, a flat map is fed to the condition branch, with its intensity t randomly changing during the training. Since the SR network is a fully convolutional neural network, it inherits the local connectivity property that the local image and the map region determine the output pixel. Hence, SR models trained with uniform maps can handle spatially varying cases.

Experiments
In the experiment, we compare our FxSR-PD and FxSR-DS with several state-of-the-art SR methods on benchmark datasets. We start the section with a description of the datasets and evaluation methods. Next, we present the comparison results. We also provide examples of local style control and validate the effectiveness of our approach for compressed images. Finally, we report complexity analysis for the proposed methods. and (b) BRISQUE [60] for DIV2K according to condition parameters.

Datasets
For the experiments, we train the FxSR with DIV2K [31] dataset, which contains 800 training images, 100 validation images, and 100 test images. We use BSDS100, General100, and DIV2K 100 validation images as our test datasets. We also use JPEG-compressed images for training and testing FxSR models to show that our proposed method is still effective on the real-world compressed LR images. The scaling factors of 4× and 8× are tried for experiments.

Evaluation Method
To evaluate the perceptual distance to the Ground Truth, we report LPIPS [56] as default [63], and additionally use DISTS [57] as structure and texture similarity in some cases. PSNR and SSIM [54] are reported as fidelity-oriented  [54] Vs. LPIPS [56], and (c) SSIM [54] Vs. DISTS [57] for DIV2K according to condition parameters.  Figure 11: Visual comparison with state-of-the-art perception-driven SR methods on DIV2K validation set [31]. The proposed method produces competitive results compared to other modern techniques and can also generate reconstructed images of various styles of LR images.
metrics. Furthermore, we report the no-reference metric NIQE [56]. Since the consistency with the LR image is also an important factor, we report the LR-PSNR, computed as the PSNR between the downsampled SR image and the original LR. To measure the meaningful diversity of SR methods that can actively sample from the space of plausible super-resolutions, we also report the SR-Diversity score, which is used for the evaluation protocol on the Super-Resolution Space Challenge learning track in the NTIRE Challenge 2021 [64,65]. Specifically, we sample 11 images and densely calculate LPIPS [56] metric between the samples and the ground truth. To obtain the local best score, we pixel-wisely select the best score out of the 11 samples and take the full image's average. The global best score is calculated by averaging the whole image's score and selecting the best. Then, the diversity score is calculated as follows:

Training Method
For the scaling factor 4×, sub-images are cropped with the sizes of 320 × 320 with a stride of 160 and 80 × 80 with 40, for the HR and LR training images, respectively. For the scaling factor 8×, the LR sub-images are cropped to the size of 40 × 40 with a 20 strides. Then, the batch image pairs for each iteration of training are randomly cropped from these sub-images. The HR batch size is 128 × 128 and the LR batch sizes are 32 × 32 and 16 × 16 for scaling factors of 4× and 8×, respectively.

Evaluation of Flexible SR for Perception-Distortion (FxSR-PD)
By adjusting a single parameter t, the FxSR-PD model can generate various SR results for the trade-offs between distortion and perception objective at the inference phase, as shown in Figure 5. It shows that t = 0 generates blurry outputs as the FxSR objective is distortion-oriented, and t = 1 generates sharp textures as the FxSR becomes perception-oriented. Also, the t between 0 and 1 generates different trade-offs, with less or more distortions, and more or less blurriness.
Since there is a trade-off between the distortion-oriented metrics and the perception-oriented metrics, it is necessary to evaluate the performance of the SR models in a perception-distortion 2D plane [67], as shown in Figure 9. The vertical axis denotes perceptual loss LPIPS [56], and the horizontal axis the PSNR (distortion-oriented measure). Hence, the lower left part is the desired place where both MSE and perceptual loss are low [67], and we can see that our method is comparable to others in this respect. Note that the RRDB [44] and ESRGAN [44] are the results of using distortion-oriented and perception-oriented loss, respectively. Others drawn in solid lines are adjustable methods. Pixel interpolation (Pix-Interp) and network weight interpolation (Net-Interp) methods utilize two differently trained models, i.e., the RRDB and ESRGAN stated above. The number of parameters for each method is also provided for complexity comparison. More details about complexity analysis will be provided in Section IV.F.
Since various metrics examined in Figures 6-8 have different characteristics and performance, we present additional performance comparisons for the perception-distortion plane with these metrics in Figure 10. These comparisons show trends similar to those in Figure 9. Table 1 shows the evaluation of FxSR-PD and other SR methods for the specific t values. The proposed FxSR-PD obtains the best PSNR and SSIM at t = 0 among perception-oriented methods and the best LPIPS values at t = 0.8 for all datasets.

Qualitative Comparison
Visual comparison between our proposed FxSR-PD and other state-of-the-art methods for 4× and 8× are shown in Figures 11 and 12, respectively. We can see that our FxSR-PD provides stronger edges and fine details than the distortion-oriented method RRDB [44], and other perception-oriented ones. Also, there are fewer artifacts in our method compared to others.

Diverse Style HR Generation
Unlike the FxSR-PD that attempts flexible trade-offs between perception and distortion, the FxSR-DS aims to generate various styles of HR textures with perceptually high scores for all t values. As shown in Figures from 7 to 8, the FxSR-DS scores better overall with a relatively narrow dynamic range regarding the perception-oriented metrics other than VIF [58]. On the other hand, it scores relatively lower for distortion-oriented metrics as in Figure 6. The loss terms and their weights for the conditional objective of the FxSR-DS model are described in Figure 4. Different from FxSR-PD with one perceptual loss term, four perceptual loss terms at different feature levels are used. In Figure 13, we can see that the SR results for different t values have different types of styles that are clearly distinct from each other. While Figure 5 shows the trade-off results between perception and distortion, Figure 13 visualizes our method's scalability to generate various styles of textures by employing more feature spaces into the loss.   Table 2 compares with DNI [26] and SRFlow [24] in terms of LRPSNR (low-resolution PSNR), LPIPS and Diversity metrics which are evaluation protocol on the Ntire 2021 Challenge [64,65] stated previously. Table 1 is the evaluation of SR results for a specific t value, while Table 2 is the average of all of the SR results for 11 different t values, from 0 to 1, with the step size of 0.1. Specifically, in Table 2, the FxSR-DS generally scores the best mean LPIPS and Local best (L-best) LPIPS, while the FxSR-PD achieves the best Global best (G-best) LPIPS score. This proves that the perceptually distinct diverse SR results generated by FxSR-DS in Figure 13 are of high quality in terms of perception-oriented metrics. Since Local Best LPIPS is the maximum performance of the SR model in terms of perceptual measurement, the proposed FxSR-DS shows an improvement of about 2.7% compared to the SRFlow. Figure  (a) The conventional method of using multiple SR models trained separately for a different objective each.

Quantitative Comparison
(b) The proposed method of using single FxSR-PD model trained on the training distribution of objectives. Figure 15: Comparison of the SR results of the conventional method (a), which applies one objective to the entire image, and the FxSR-PD method, which applies different objectives for each area (clothes and letters) through a local map. We can see that the proposed FxSR-PD in (b) can more accurately produce the locally intended and suitable SR results without side effects such as blurry textures and broken characters.   . Figure 18: An example of applying a user-created depth map to enhance the perspective feeling with the sharper and richer textured foreground and the background with more reduced camera noise than the ground truth.
14 also demonstrates that while the FxDR-PD scores better G-best LPIPS compared to FxDR-DS, the FxDR-DS scores rather superior L-best LPIPS than FxSR-PD. Meanwhile, the SRFlow [24] produces the highest diversity, which learns the sample distribution during training while the proposed models are trained to optimize objectives in the training distribution of objective. However, it is also important to note that the diversity scores are normalized by the G-best  as Eqn. 16. This means that the higher the G-best LPIPS, that is, the lower the absolute perceptual quality level, the higher the diversity score.

Per-pixel Style Control
In this section, we demonstrate some examples of applying local style control. First, Figure 15 is an example where the LR image has both text and texture areas. In the conventional methods for the SR of Figure 15(a), multiple SR models are trained with one objective each. Then a model is selected, and the entire image is optimized with the model's objective. If the SR model 0 is selected, which is RRDB [44] representing the distortion-oriented model, the textures of the clothes are blurred while the text edges are restored without artifacts. Conversely, suppose we select the SR model N − 1, which is ESRGAN [44] representing the perception-oriented model. In that case, some characters in the text area are broken while the textures of the clothes are naturally restored. On the other hand, the proposed FxSR-PD in Figure 15(b) can restore both the textures of clothes and characters at the same time by applying different objectives to each area through the locally-manipulated style map.
As the second example, let us consider the structural edges of the building and textures of the tree area in Figure 16. In a typical approach of using multiple SR models in Figure 16(a), when the SR model 0 (RRDB) is selected, the structural edges of the building are restored without artifacts, but the tree textures are blurred. Conversely, if the SR model N-1 (ESRGAN) is chosen, the overshoot side-effect occurs around the edges. As shown in Figure 16(b), similar to the previous example, when a properly adjusted local style map is fed along with the input image, the proposed model FxSR-DS can restore both the tree textures and building edges naturally.
The next is an example of enhancing the perspective feeling when depth information is available, as shown in Figure  17. Precisely, input image and depth map pairs used in this example are from the Make3D data set [68,69]. When the distance map is used as T in our FxSR, the foreground region is super-resolved in a perception-oriented way (with emphasized texture), and the background region in distortion-oriented (somewhat blurry). Depth information obtained by some equipment such as Kinect [70] and Time-of-Flight (ToF) camera [71,72], or depth estimation algorithms [73] can be used. It is also possible for users to directly generate a depth map from an input image using image editing S/W, as shown in Figure 18. This makes the foreground clearer with sharp details and avoids the unnaturalness of the background becoming as sharp as the foreground. In addition, the camera noise in the background can be reduced. As seen in the examples so far, the proposed method can be used for most cases in various fields that require different processing for each area for a specific purpose.

Compressed LR Image Restoration
Since real-world SR is challenging due to unknown degradation and various noise [34,74,75,76,77,78,79,80], we also validate the effectiveness of our method for compressed inputs in Figure 19. Unlike previous experiments, FxSR and SRGAN [20] are re-trained using LR images compressed with JPEG quality factor 90, called FxSR-CA (compression artifacts) and SRGAN-CA. We can see that while compression artifacts are amplified in the results of SRResNet [20] and SRGAN [20] trained with clean images, the proposed FxSR-CA, generates different style and details according to the change of t. To test the effectiveness of the proposed method for the case of real-world compressed images, two videos 2 which are filmed, edited and copyrighted by Milosh Kitchovitch are used by courtesy of him. Details of the video are provided in the Table 3.

Complexity Analysis
We compare the running time, computation costs, and storage size of our methods with other SR methods in Table 4. We measure the complexity for the SR 4× processing of one 128 × 128 LR input image on the environment of NVIDIA RTX3090 GPU. According to Table 4, ESRGAN with high-complexity RRDB architecture in Figure 3(b) requires about 10 times the number of Mult-Add and Run-time than SRGAN. Compared to ESRGAN, FxSR with the proposed RRDBs with SFT in Figure 3(c) has almost the same number of Mult-Adds and parameter size, but the Forward Pass Size is about 4 times, and the run-time is also increased by 4 times due to the additional memory usage related to the SFT layers. However, it needs to be noted that we use a single network for diverse output generation, whereas the existing methods need at least two networks for producing varying outputs. This is specifically observed in Figure  9, where it is observed that the FxSR requires less or comparable parameters than the network/image interpolation methods that use multiple ESRGAN models.

Ablation Study
The goal of classic multi-objective optimization is to find a set of solutions as close as possible to Pareto optimal front and as diverse as possible [81,82]. To investigate the performance depending on network architecture and complexity, we observe the change in the perception and distortion (PD) curve while training two versions of FxSR-PD using 16 RBs with SFT in Figure 3(a), and 23 RRDBs with SFT in Figure 3(b), respectively. As the number of training iterations increases, the PD curve of FxSR-PD converges to the desired place (lower left), and at the same time, the possible SR range on the curves is also expanded as shown in Figures 20(a) and (b). However, after a certain amount of iterations, the performance does not improve further. Figure 20(c) shows the performance comparison between the two FxSR-PD versions at the 250,000th iteration.

Benefits of FxSR
A single FxSR model can produce different styles corresponding to employed feature losses and is also able to generate intermediate results between the different styles. Moreover, we can control the local regions differently by feeding a control map to the network. Hence, we can have more natural SR outputs by focusing on the foreground or salient regions more than the backgrounds, using user-edited or automatically generated segmentation/depth/saliency maps. Also, we can remedy unnaturally generated regions by controlling the parameters as the post-processing step.

Limitations of FxSR
As shown in Table 2, our method can generate comparable or superior results to the existing methods in terms of perceptual quality. But it shows a lower diversity score than the SRFlow because flat control maps are tried in this experiment. Hence, we need more studies on effective control map generation along with other feature spaces and their combinations to increase diversity.

Future works
We have used a one-dimensional control parameter t for adjusting SR styles in this work. By defining more than one-dimensional SR style space with various style objectives, we can explore the n-dimensional SR spaces, possibly producing more diverse styles. Also, we may consider expanding the work to the image denoising and deblurring to control the degree of restoration locally. Furthermore, leveraging meta-learning would make it possible to improve adaptation to new samples and target objectives.

Conclusion
We have presented a novel training method and a network structure for the SISR, enabling us to explore various regionwise HR outputs. From this, we can flexibly reconstruct the images between perception-oriented and distortion-oriented ones. This is achieved by defining a conditional objective function with the weights related to the perceptual losses in various feature space levels. Also, our network is designed to modulate the network's intermediate features to change the operation according to these control inputs. As a result, we can generate an image with a desired restoration style for each area. Experiments show that the proposed FxSR yields state-of-the-art perceptual quality and higher PSNR than other perception-oriented methods. Also, we can find many solutions by controlling a single parameter at the inference phase. We will release our code for further research and comparisons.