Skip-Concatenated Image Super-Resolution Network for Mobile Devices

Single-image super-resolution technology has been widely studied in various applications to improve the quality and resolution of degraded images acquired from noise-sensitive low-resolution sensors. As most studies on single-image super-resolution focused on the development of deep learning networks operating on high-performance GPUs, this study proposed an efficient and lightweight super-resolution network that enables real-time performance on mobile devices. To replace the relatively slow element-wise addition layer on mobile devices, we introduced a skip connection layer by directly concatenating a low-resolution input image with an intermediate feature map. In addition, we introduced weighted clipping to reduce the quantization errors commonly encountered during float-to-int8 model conversion. Moreover, a reparameterization method was selectively applied without increasing the cost in terms of inference time and number of parameters. Based on the contributions, the proposed network has been recognized as the best solution in Mobile AI & AIM 2022 Real-Time Single-Image Super-Resolution Challenge with PSNR of 30.03 dB and NPU runtime of 19.20 ms.


I. INTRODUCTION
The single-image super-resolution (SISR) technology transforms a low-resolution (LR) image into a high-resolution (HR) image that provides higher pixel density and more textual information than the LR image. In general, superresolution is utilized in computer vision applications such as remote sensing [1], [2], underwater applications [3], [4], medical image processing [5], [6], and multimedia applications [7]. However, when transforming LR to HR images, the super-resolution is known as an ill-posed problem and multiple types of HR images exist. To predict a suitable corresponding HR image, traditional methods such The associate editor coordinating the review of this manuscript and approving it for publication was Alvis Fong . as interpolation-based [8] and representation-based [9], [10] methods have been proposed.
However, the majority of the superior SISR methods have focused on the reconstruction quality of the HR images from LR images utilizing expensive high-performance GPUs.
To enhance the quality of reconstruction, they proposed various techniques and networks that evolved into the SISR model with an extensive number of parameters and high computational complexity [22]. However, the complex network structure, various types of deep learning techniques, and numerous parameters create challenges in deploying traditional SISR methods in mobile device environments. Owing to the limited computing resources and hardware of mobile devices compared with desktop or cloud resources, the SISR model requires a lightweight network and hardware-friendly deep learning techniques such as quantization with INT8.
To perform the SISR task on mobile devices with deployment requirements, we propose a skip-concatenated image super-resolution network (SCSRN) that can transform LR images to HR images with substantial accuracy and real-time inference speed. Concisely, the major proposed contributions are stated as follows: 1) We propose a highly efficient super-resolution network (SCSRN) that can deliver higher accuracy at faster speed compared to previous mobile SR models. Notably, we excluded the element-wise addition operation that is a labor-intensive task on mobile devices, and instead, introduced a lighter skip-concatenated layer that can avoid memory replications to equalize the input dimensions. 2) This study proposes a quantization error robust training method. With skip connection, the distribution of the kernel weights tends to become asymmetric during training. However, the asymmetric distribution causes serious degradation of image quality in quantization as a mobile device supports only symmetric quantization for kernels. To this end, the valid range of the weights were constrained during training. 3) We selectively applied a reparameterized convolution (RepConv) layer to improve the image quality, while maintaining the model size and inference speed. Interestingly, based on the experiments, the application of the RepConv layer to all layers in SCSRN could compromise the reconstructed image quality. 4) We decompose a clipped rectified linear unit (ReLU), which was originally introduced to prevent the incorrect output overflow and underflow in the inference, into min(x) and ReLU operations and merge the ReLU into the last convolution layer. Before merging, the ReLU takes the latency up to 2.5 ms in mobile devices, but we successfully removed latency of ReLU in the inference time.
The remainder of the paper is organized as follows. In Section 2, we discuss the related works of super-resolution. The proposed method is described in Section 3. The effectiveness of our SCSRN model is validated in Section 4. Finally, the conclusions of this study are summarized in Section 5.

II. RELATED WORK A. SINGLE IMAGE SUPER-RESOLUTION METHODS
The SISR methods can be classified into traditional SISR and data-driven deep learning methods. Traditional SISR methods have been proposed as interpolation-based [8] and representation-based methods [9]. In principle, the interpolation-based method considers the relationship of neighboring pixels, whereas the representation-based method reconstructs the HR image by deriving a mapping function between the LR and HR cropped patches. However, both methods exhibit limitations in reconstructing detailed features and patterns. The recently proposed CNN-based method delivered excellent performance by solving several problems that cannot be resolved using traditional methods.
A CNN-based SISR model is trained to target HR from a given LR. With the advent of CNN-based SISR networks such as SRCNN [11], this pipeline produced remarkable performance in the SISR task. First, the CNN-based SR model stacked more deep layers to improve the performance, but this assessment caused a gradient-vanishing problem and exhibited a limitation in terms of the image quality. Thereafter, very deep super-resolution [23] and deeply-recursive convolutional [24] networks employed deeply stacked residual blocks to resolve this issue. In addition, enhanced deep super-resolution network (EDSR) [18] demonstrated batchnormalization (BN), which exhibited remarkable performance in classifying tasks and normalizing the features of the SISR model with degraded performance. To improve the stability of training without BN layers, EDSR uses the residualscaling method. Consequently, EDSR achieved a state-ofthe-art result by enhancing the feature representation of the model.
Recently, residual channel attention network [20] and storage area network [25] significantly improved the performance by adopting channel attention mechanism. However, this attention mechanism requires considerable memory during inference owing to its spatial and nonlocal operation. Moreover, it offers limited application in low-power devices such as mobile or IoT devices.

B. LIGHTWEIGHT SUPER-RESOLUTION METHODS FOR MOBILE
Although CNN-based SISR models significantly improve the restored image quality, these advancements have increased the extent of computation and memory required in the inference stage. Furthermore, the demand for such applications in low-power devices (e.g., mobile and IoT) has increased in various computer-vision tasks. To satisfy this demand, research is currently being conducted to reduce the computational complexity and design efficient network structures for the SISR network.
Research on lightweight super-resolution network for mobile devices can be classified into network optimization for enhancing the performance in the same network structure and the design of hardware-friendly architecture for reducing the inference time. The representative methods of network optimization include pruning, quantization, and knowledge distillation.
Pruning is categorized into filter pruning (a.k.a structured pruning) [26] and weight-level pruning [27] (a.k.a VOLUME 11, 2023 unstructured pruning). Unstructured pruning does not save inference time, whereas structured pruning reduces the complexity of inference. Filter pruning for SR network [26], [28] constitutes a promising approach for achieving a reasonable trade-off between performance and complexity by eliminating the filters that do not influence the network performance.
The representative quantization methods include mixed precision [29] and quantization-aware training (QAT) [30]. Mixed precision training can improve the performance by searching the optimized bits per layer. In contrast, QAT is a fake quantization in the network training process, which simulates an 8-bit integer training process with clamping and approximation by employing fake quantization during training. Thus, the complete model training is still performed with the original precision, but the training and inference processes are simulated with 8-bit quantization. This approach minimizes performance degradation during deployment by reducing the quantization error in the quantization options of the model format.
Knowledge distillation is a method of transferring knowledge from the teacher model to a lightweight structure of the student networks [31]. In principle, knowledge distillation can be segmented into feature distillation [32], [33] and image domain distillation [34]. For instance, feature distillation trains a feature map in a student network to resemble that in the teacher network. In contrast, the representation distillation trains the output of a student network (HR image in case of SISR) to resemble that of the teacher network.
Finally, the approaches to design a hardware-friendly network structure seek appropriate structures based on the profiling results derived during the inference process [35], [36], [37]. Thus, it is a process of searching a structure that can be compromised in terms of performance and speed. Fig. 1 shows the mobile-friendly network structures. The ABPN [35] consists of seven 3 × 3 convolution layers and utilizes the element-wise add operation between channel-wise duplicated LR images and output feature maps. The NCNET [36] introduces a nearest convolution layer, which is operated as the nearest interpolation of the LR image, instead of the channel-wise duplication. XLSR [37] uses channel split blocks (GBlock) and concatenation operations.
In this paper, we propose an efficient approach to design a hardware-friendly network (e.g., reparameterized block and removing the element-wise operation) and develop an appropriate training strategy for it, such as robust training of quantization errors.

III. THE PROPOSED METHOD
The proposed method is detailed in this section. First, we illustrate the proposed network architecture, SCSRN, and explain its novelty with respect to the anchor-based plain net(ABPN) [35] that served as our inspiration. Subsequently, we describe the reparameterized block (Rep-Conv block) that is an over-parameterized strategy employed to improve network performance. Third, we introduce the weight-constrained QAT method to minimize the quantiza- tion error. Lastly, the speed-up method and training strategy are described for the SCSRN model.

A. NETWORK ARCHITECTURE OF SCSRN
The overall structure of the SCSRN is illustrated in Fig 2, which is an ABPN-inspired structure comprising four components. The first component is a feature extraction layer that extracts the features from an LR image. The second component is a backbone comprising four RepConv blocks to learn deeper features. The third component includes two transition layers for the residual learning effect after directly concatenating the feature maps and LR. The final component (depth_to_space) involves the pixel re-arrangement for restoring the HR image.
For deeper comprehension, let I LR and I HR denote the input and output of the network. We obtain the features F 0 as follows: where H FE (·) denotes the function that extracts features from an image. Subsequently, we obtained the i-th features F i by where H BB i represents the function for the i-th deep feature, which contains high frequency and texture information. Thereafter, we concatenated the feature and I LR with the channel axis, expecting a residual effect to pass through the two transition layers for obtaining F HR as follows: where H TR denotes the transition layer. Using the pixel rearrangement function H RA , we derived I HR from F HR and clipped all the pixels between 0 and 255. Thereafter, the ReLU function was applied in the tail layer of the network to constrain the lower bound (0) of the pixel value, and the min function was utilized to maintain the upper bound value (255).
Unlike the ABPN [35] and NCNET [36], the input LR image and feature maps were concatenated directly in the middle of the network, and the last two transition and depth_to_space layers provide a smooth transformation from the concatenated features to the SR image. The concatenation operation helps to reduce the quantization error because the input LR image containing 8-bit pixel values is not corrupted by the INT8 quantization. Moreover, the ABPN duplicates the LR image on multiple instances to obtain two equal input dimensions for the element-wise addition operation. In contrast, the skip-concatenated operation saves inference time by omitting the multiple memory-copy operations in the ABPN.

B. RepConv BLOCK
In the inference stage, we applied the reparameterization method to improve the reconstructed image quality without any architectural variations. According to [38] and as depicted in Fig. 3, the reparameterization can be reconstructed if it maintains the linearity property, even if the convolution layer overlaps in various manner. To express this mathematically, we applied it as follows: In Fig. 3-(Left), let the weight and bias of Conv_3 × 3 be W 1 and b1, those of Conv_1 × 1 be W 2 and b2, and the input and output are x and y. Therefore, it can be expressed as Eq. .
where I represents identity matrix caused by the addition operator. Similarly, Fig. 3-(Right) presents Eq. Therefore, we can see Fig. 3-(Left) becomes Fig. 3-(Right) through Eq. (7) and (8). Considering Eq. (7) and (8), the left-hand side learns high-level information in the training step and can be simplified in the inference step according to the above-mentioned operation.
We experimented by applying the RepConv to each convolution layer, which confirmed the most advantageous method of application as that displayed in Fig. 2. In addition, the ''Xavier normalization'' is an appropriate weight initialization method, and the weight initializing setting can be further improved. To reduce the quantization error in the QAT stage, we converted the RepConv train form into a simplified inference form (convolution 3 × 3 layer) after the fine-tuning stage. The results obtained with the application of RepConv are described in the experimental results section.

C. WEIGHT-CONSTRAINED QUANTIZATION AWARE TRAINING
To execute deep learning models on mobile devices with low memory or computational power, TensorFlow [39] supports two types of model optimization methods, namely, posttraining quantization (PTQ) and QAT. As both techniques perform quantization, performance degradation is inevitable. In particular, PTQ performs quantization after completing training, and the operation method is converted from float32 to float16 or int8. In contrast, QAT performs quantization during the training step. Generally, the application of PTQ after QAT yields less performance loss compared to the direct application of the PTQ to the model. Therefore, we applied the training (scratch) − → QAT − → PTQ quantization method in sequence.
Despite the application of the mentioned procedure, the quantization error was larger than expected, because the weights of the first convolution layer exhibited an asymmetric distribution, as portrayed in Fig. 4-(Left). In the 8-bit quantization specification of Tensorflow Lite [40], the symmetric quantization for weights was allowed only because the distribution of the weights was assumed to be symmetric. Thus, the accumulation of quantization errors in the first layer degraded the overall performance. To mitigate the performance degradation, the mean value of the distribution was constrained to zero using the weight clipping technique in the range [−2, 2], which altered the distribution after training, as depicted in Fig. 4-(Right). Notably, the range of weight clipping was set as [−3, 3] for RepConv Block and [−2, 2] for others.

D. SPEED UP METHOD
To further improve the speed, the network was analyzed using an external tool called Netron [41]. A visualization of ABPN [35] and SCSRN (Right) is presented in Fig. 5. As depicted in Fig. 5, ABPN [35] typically omits the ReLU function in the terminating convolution layer of the super-resolution network and uses a Clipped ReLU function to avoid the normalization of an incorrect output [42]. The clipping operation is performed after pixel rearrangement (i.e., DepthToSpace operation of Tensorflow) in Fig. 5-(Left).
Although the latency of the combined convolution with ReLU was less than 1 ms, that of the single ReLU operation was up to 3∼4ms based on the layer-wise profiling. This is potentially caused by the hardware (HW) architecture in which Convolution-BatchNormalization-ReLU modules are designed and operated in a single HW unit, and ReLU is not singularly implemented in mobile devices.
ClippedReLU (F HR ) = ReLU (min(F HR , 255)) = min(ReLU (F HR ), 255) Based on Eq. (9), Clipped ReLU function can be expressed in various forms. Herein, the ReLU function was merged with the terminating convolution layer and the minimum operation was excluded to maintain the upper bound (255). The DepthToSpace forms the pixel rearrangement operation and does not harm the merging of convolution and ReLU functions. Consequently, the inference time can be saved with no visual degradation.

E. TRAINING STRATEGY
Only the training dataset of DIV2K was used in the training process. We trained our model in three steps including the scratch training step, fine-tuning step with different loss function, and QAT step.

1) SCRATCH TRAIN STEP
In the first step, our model was trained from scratch. The LR patches were cropped from LR images with 128 × 128 size and 16 mini-batch sizes. The Adam optimizer was used with a 0.001 learning rate during scratch training. The cosine warm-up scheduler was used with a 0.1 percentage warmup ratio. The total number of epochs was set to 800. We use l1 loss is expressed in Eq. (10).
where θ represents the trainable parameters of the proposed network, and n denotes the number of training patched images. I i lr and I i hr indicate the LR patch images and corresponding HR patch images. f (·) denotes the function of the proposed work.

2) FINE-TUNING STEP
In the second step, the model was initialized with the weights trained in the first step. To improve the accuracy, we used l2 loss as expressed in Eq. (11). Fine tuning with l2 loss improves the peak signal-to-noise ratio (PSNR) value by 0.01 ∼ 0.02 dB. In this step, the initial learning rate was set as 0.00002, and the Adam optimizer was used along with a step scheduler (i.e., learning rate halved at every 40 epochs). The total epoch was set to 200 epochs. Moreover, we applied a channel shuffle augmentation.

3) QUANTIZATION-AWARE TRAINING STEP
In the third stage, the same training setting was used as step two with the exception that the QAT model was initialized with weights trained from the second step and the total training epoch was set to 300. In addition, the learning rate was set as 0.00001. Furthermore, a discrete cosine transform (DCT) domain l1 loss function was applied between the ground truth HR and predicted HR images, expressed in Eq. (12).
where DCT (·) represents the DCT domain transformation operation.

IV. SIMULATION RESULTS
The simulation setup and the results obtained with the proposed model are described herein. In particular, we elaborate the performance improvement step applying the concatenation method, weight clipping, and RepConv. Moreover, we demonstrate the reduction in inference time by excluding the Clipped ReLU. Thereafter, the proposed model was compared with previous studies, wherein FSRCNN [12], XLSR [37], SESR [38], ABPN [35], and NCNET [36] were tested on five standard datasets. Eventually, we compared the proposed model with the previous studies on the devices for scale 3 on Samsung Galaxy Z Fold4 with Snapdragon 8+ Gen 1 and Galaxy Note20 with Snapdragon 865+.
To ensure a justified comparison, all experiments were conducted in the same experimental environment. The training process was executed using RTX A6000 GPUs. As discussed earlier, we used TensorFlow 2.5.0 version for the all three training steps and the TFlite generation step. In particular, the DIV2K train dataset was used for training (i.e., 800 images of DIV2K). VOLUME 11, 2023

A. ABLATION STUDY
We analyze the contributions of each module in terms of five keywords: concatenation, weight clipping, DCT domain loss, MinClip and RepConv. In Table 1, we compared the performance according to the network structure of SCSRN.

1) CONCATENATION
The baseline (ConvBase) is a stack of seven successive convolution layers with 32 channels. To preserve the input dimensions, an ''Add'' operation was included with the input LR image replication before the transition layers. The replication operation was used to render the input LR image channels to out_channel * scale 2 , which improved the PSNR by 0.004 dB during the scratch training step. However, the TFlite performance decreased by 0.182 dB, and owing to the replicating operations, the inference time increased by 17.7 ms. Instead of the ''Add'' operation, we applied the concatenation (Concat) layer before the transition layers to preserve the input information. Consequently, the overall image accuracy improved during the fine-tuning stage by 0.089 dB and 0.163 dB at Tflite (int8).

2) WEIGHT CLIPPING
Thus, we applied weight clipping (WC) to resolve the asymmetric weight distribution and improve these results using the TFLite (int8) by 0.033 dB. This result depicts that the conversion loss of FP32 into INT8 was reduced by WC. However, upon applying the RepConv block with WC = [−2,2] for specific layers, the same input and output dimensions were obtained, the scratch performance was deteriorated marginally, and the performance with TFLite (int8) did not improve. When we modified the RepConv block with WC = [−3,3], the final performance with TFLite (int8) increased slightly by 0.002 dB.

3) DCT DOMAIN LOSS
Furthermore, we introduced the DCT domain l1 loss in the QAT step, which improved the PSNR value by 0.02 at TFlite (int8). The results of DCT domain l1 loss and without DCT domain l1 loss at QAT step provided 0.01dB better result at DIV2K validation dataset. To select a more better one, we tested these models on five benchmark datasets. The results are listed in Table 3, wherein the average PSNR of the training method including the DCT domain l1 loss was 0.006 dB higher than that of other methods. Thus, the DCT domain l1 loss was selected for our final SCSRN model.

4) MinClip
Overall, the experimental investigation revealed the significant influence of the Clipped ReLU on the inference time. Thus, we resolved this issue by replacing the Clipped ReLU with the minimum function (MinClip), and the inference time was reduced by 2.5 ms. The inference time in Table 1 measured at Galaxy Note20 mobile device with AI-Benchmark tool [43]

5) RepConv
We discovered that the application of the RepConv block on the entire network degrades the performance by 0.23 dB at TFlite(int8). Thus, the RepConv block was applied only on the backbone and transition layers, which exhibited the same input and output dimensions. These layers bear an identity connection that aid in information propagation, prevents the occurrence of vanishing gradients in deep networks [38], [44], and improves the the PSNR performance. The results are summarized in Table 4.

B. COMPARISONS WITH THE STATE-OF-THE-ARTS
We compare the proposed algorithm with conventional algorithms: FSRCNN [12], XLSR [37], SESR [38], ABPN [35], and NCNET [36]. The proposed model and previous works   are tested on five benchmark datasets: Set5, Set14, B100, Urban100 and DIV2K validation set. For fair comparison, We measure PSNR and SSIM of each algorithm on RGB domain. ABPN and NCNET officially provide the source codes that operates in the RGB domain, so we obtained the results using this source code. In case of FSRCNN [12] and SESR [38] to support the RGB channels instead of using only Y. Therefore, the number of channels in the first and last convolution layers were adjusted from one to three. There is no officially released code for XLSR [37], so we reimplemented it. Three different scales are tested X2, X3 and X4.  Fig. 6. SCSRN successfully reconstruct edges in the HR images and reveal better shaped SR outputs compared to the previous works. We note that FSR-CNN [12] was designed without considering quantization, thus the image quality is significantly degraded by INT8 quantization. VOLUME 11, 2023

C. THE INFERENCE TIME AT THE MOBILE DEVICE
The model inference time were reported using the Galaxy Note20 with Snapdragon 865+ and Galaxy Z Fold4 with Snapdragon 8+ Gen 1 on the commercial mobile devices. To measure the inference time, we used AI Benchmark [43] to obtain the NNAPI execution periods, as expressed in Table 5. After 100 iterations, the inference times of all iterations were averaged. The proposed model involved a larger number of parameters and operations compared to other methods. However, the results of the inference time demonstrate that the proposed SCSRN model is faster than all such existing models. This proves the superior efficiency of the developed skip-concatenation network on mobile super-resolution image tasks.

D. THE MAI2022 REAL-TIME SUPER-RESOLUTION CHALLENGE
This research was conducted for participating in the MAI2022 Real-Time Super Resolution Challenge. The final results of MAI2022 are displayed in Table 6. The PSNR result of MAI2022 was obtained on the DIV2K test dataset, and the inference time was evaluated on Synaptics Dolphin smart TV platform with a dedicated NPU (VS680). The score of each final submission was evaluated according to Eq. (13) (C denotes a constant normalization factor):

V. CONCLUSION
This study proposed an efficient and lightweight superresolution network by directly concatenating an input LR image and an intermediate feature map at the middle of the network. To reduce the quantization error, we introduced weight clipping. Moreover, the reparameterization method was selectively applied and provided the improved super-resolution image quality without any performance degradation in terms of inference time. Based on the contributions, the proposed network achieved 30.03 dB in PSNR and 19.20 ms in NPU runtime at the Mobile AI & AIM 2022 Real-Time Single-Image Super-Resolution Challenge.

ACKNOWLEDGMENT
Their work is primarily motivated by the previous challenge works in Mobile AI Workshop 2021, especially two pioneering research [35], [37]. Moreover, the authors deeply thank the challenge organizers of the Mobile AI and AIM 2022 Workshop [46] for the opportunity of participating in this challenge.