Residual Forward-Subtracted U-Shaped Network for Dynamic and Static Image Restoration

Advanced image sensors with high resolution are now being developed for specially purposed electro-optical systems, with research focused on robust image quality performance in terms of super resolution and noise removal under various environmental conditions. Recently, machine-learning and deep-learning methods have been studied as the best practical techniques for restoration to improve the deteriorated image quality of sensors. However, these methods show limitations and side effects of image degradation such as image non-uniformity. In this paper, we analyze and randomly generate additive white Gaussian noise, non-uniform line noise, and dark saturation as representative image degradations. We then propose an advanced U-net model based on global and local residual learning in order to restore complexly deteriorated images. The proposed method shows unparalleled performance compared to alternative models and previous studies. In particular, various complex noise components are minimized and improved with equal quality so that variation between sequential images is minimized. These findings leverage mutual corroboration of quantitative and qualitative evaluation metrics. In the future, the proposed model is expected to contribute to a wide range of field applications such as defense, surveillance, and video media for image quality enhancement technologies.


I. INTRODUCTION
As image system technology advances, image restoration for high-resolution and special-purpose images is among the key technologies for application in fields such as multimedia, intelligent vehicles, defense, surveillance, and reconnaissance. Image restoration comprises the task of restoring image damage such as noise, motion blur, focus error, excessive light, and insufficient light. The success of this task is determined by how completely and efficiently the clean image (prediction image or restored image) is restored from the degraded image, in comparison to the ground truth (GT) image. Denoising, one of the most substantial tasks in image restoration, is the task of finding noise components in the input component and restoring clean images that come The associate editor coordinating the review of this manuscript and approving it for publication was Qiangqiang Yuan. as close to the original as possible. As an example of such work, Fig. 1 shows (a) the noise component corresponding to Fig. 1 (b), which is analyzed in the generation of a clean image as shown in Fig. 1 (c).
In order to perform the image denoising task, it is necessary to analyze the causes and types of image degradation. A classification of these types is as follows: 1) Signal noise of photon detector and non-uniformity from dead or bad pixels of image sensors 2) Vertical line noise from a sensor that realizes 2D images using vertical scanning by a high-performance 1D sensor (line scan camera) 3) Dynamic noise with characteristics that change in units of frames in successive images 4) Black and white low dynamic images due to lack of sensor signal strength or environmental influence As examples of these types, Fig. 2 shows image degradation due to complex noise in a thermal infrared (TIR) image or a 1D scan-based hyper-spectral (HS) image of a special design structure [1]. Image deterioration due to image degradation and noise can be inconvenient for users and causes the loss of important image information. In particular, techniques used in application systems such as object detection, tracking, and segmentation can show various performance limitations when using deteriorated images. The background registration-based adaptive noise filtering (BRANF) algorithm has been studied to solve both static and dynamic noise issues, And also general machine learning-based algorithms and the cross fusion-based adaptive contrast (CFACE) have been studied to solve these problems, but they show limitations in effectively removing noise with high uniformity and speed [2]- [5].
Recent advances in convolutional neural networks (CNNs), software, and hardware have shown that deep learning [6]-based algorithms (VDSR [7], DnCNN [8], FFDNet [9], MWCNN [10], Noise2Noise [11]) rank high in state-of-theart (SOTA)-based paperwithcode. 1 However, side effects of these techniques include image distortion, and limitations are found in the removal of complex noise components, as well as in overall speed. In the present paper, we propose the performance of image improvement by constructing a network based on global residual learning (GRL) and local residual learning (LRL).
Also, as mentioned in MWCNN [10], In this paper, we consider the correlation between computing cost and 1 [Online]. Available: http://paperswithcode.com/sota performance caused by a sufficient acceptance field. As the receptive field grows, the consumption of computing resources increases, leading to a trade-off between efficiency and performance. However, we also argue that a sufficient receptive field is helpful for image reconstruction, since the noise is unpredictable and appears throughout the range. In addition, we establish a receptive field of sufficient size (192 × 192 pixels) in the trade-off process of performance and resources and consider this a problem to be solved with the development of software and hardware. Although there is a risk of information loss, the pooling layer is used in this paper to effectively cope with the distribution of noise according to the image scale. As a result, our method shows excellent noise removal and low variation in performance compared to other algorithms.

II. RELATED WORK
The task of image denoising is representative works in the field of image restoration research. As a traditional method for removing noise, basic image processing algorithms such as average filter, median filter, and Gaussian filter have been studied [12]. Image restoration has also been developed into machine learning-based algorithms such as Block-Matching and 3D filtering (BM3D) [3] and Weighted Nuclear Norm Minimization (WNNM) [4]. However, BM3D [3] and WNNM [4] are very slow in real-time application and show poor image restoration performance. Recently, with advances in software and hardware, CNN-based algorithms (VDSR [7], DnCNN [8], FFDNet [9], MWCNN [10], Noise2noise [11], etc.) are dramatically developed and shown outstanding performance and high speed. Among them, Multi-level Wavelet-CNN (MWCNN) [10], one of the best-performing algorithms, applies discrete wavelet transform (DWT) and inverse wavelet transform (IWT) operations between networks based on enlarged receptive field. Complementing the deep learning method that produces somewhat blurry results, it shows excellent resilience by utilizing textural detail and sharp structures. In addition, most existing denoising tasks require both a noise image and a clean image as supervised learning methods, but in Noise2Noise [11], restoration was performed with only noisy images without clean data, suggesting a new paradigm for denoising tasks. Also, in this task, research on video denoising is actively underway based on the Recurrent Neural Network (RNN) algorithm [13].
However, most denoising algorithms aim to remove AWGN, one of the noise types, and show poor performance when applied to the complex and various types of noise generated in software and hardware for a wide range of multimedia applications. Research is also being conducted using a scene-based non-uniform correction (SBNUC) method [1] to remove line noise that is often observed in HS systems. Also, several CNN-based methods have been studied to solve these problems [14], [15], [16]. Among them, the two-stream wavelet enhanced U-net (TSWEU) [16] method analyzed noises of various line patterns and has shown observable performance.
Image enhancement, one of the other image restoration fields, is a technique that corrects values when an undesirable brightness value is obtained from an image due to errors or malfunctions such as lighting, exposure time, aperture value, sensor, etc. For example, in Fig. 2 (c), the defect is not correctly found due to the abnormal operation of the lighting, and Fig. 3 (a) shows that detection is not performed properly in the dark image. Fig. 3 (b) shows successful detection after yolo-v3 tiny [17] restoration as part of the method proposed below. In order to resolve these issues, one of the most common and popular methods is general histogram equalization (GHE). However, in images with dynamic brightness, GHE shows poor reconstruction performance when comparing the restored image to the ground truth. Engineers and researchers have attempted to solve this problem using vision-based algorithms such as contrast-limited adaptive histogram equalization (CLAHE) as well as machine learning-based algorithms (SVD-DWT [18], AGCWD [19], CegaHe [20], etc.). Also, under study is restoration using the generative adversarial network (GAN) algorithm [21], [22] using a deep learning technique. In the future, this technique may be applied to technologies such as high dynamic range (HDR).
Yet another restoration task is the improvement of old or degraded images and videos through technologies such as Single-Image Super Resolution, JPEG artifact removal, deblur, and defocus [23]- [26]. Research is actively underway for this task within image restoration. Image restoration tasks such as image denoising and enhancement are essential for viewing outdated or deteriorated images or videos, as well as for other applications. Reflecting this, ETH Zurich's computer vision laboratory in Switzerland has been leading the field of image restoration by holding the NTIRE (New Trends in Image Restoration and Enhancement workshop 2 ) challenge every year since 2016.
In the restoration task proposed in this paper, practicality is considered for application in various applications. Unlike the conventional method of removing only AWGN (a single noise type), we propose a method to simultaneously correct multi-type noise based on Global Residual Learning (GRL) and Local Residual Learning (LRL). GRL is a process of subtracting noise features and inputs, which are the final  Since the existing single-type noise imag3e denoising task generates noise based on the sigma level in the input image, it is limited to various types of noise generated in the application system. We focus on multi-type noise image denoising, which removes complex noise by generating line noise and AWGN, which is one of the noise types that appear frequently in IR and HS systems.
However, there are limitations in obtaining a real dataset from non-uniform noise observed in IR and HS systems. In particular, in the IR system, there are functions such as fixed pattern noise (FPN) correction for removing uniform noise, but it is difficult to completely remove non-uniform noise [27]. Therefore, it is not easy to build a dataset because of the characteristic of CNN, where input (Noise image) and label (GT image) are simultaneously needed. In order to overcome the limitations, this study analyzes the non-uniformity of real IR systems and generates at an equivalent level of real IR noises.
The dataset used for training and validation is the DIV2K [28] dataset (training images: 800, validation images: 100). Using this dataset, noise components are generated as similar as possible, as shown in Fig. 5.
First, white and black layers (WL, BL) of the same size as the input in the prepared dataset were randomly generated as where NL is number of lines, Int is the intensity of the generated lines, and rand ( * ) is the function that randomly generates * values from 0 to 1. Next, a Gaussian filter, one of the smoothing operations, was applied to create a line noise layer that looked as equally as possible as follows: where σ is a parameter for determining blurring and is called a scale parameter, and p i , p j are the width and height of each pixel.
Adding each dynamic noise layer to the input image produced a GT + line noise image as shown in Fig. 4 (c). Finally, by adding AWGN (σ = 25), a dataset with multi-type noise was created as shown in Fig. 4 (d).
Then, in the augmentation process, the image was divided into patches, and rotation (180 • ) and flip (up/down, left/right) were randomly performed based on the scale change (×0.6∼1.2). The key point of this process was to focus on the scale change by reflecting the noise characteristics appearing randomly overall, and to exclude the 90 • and 270 • reorientations during rotation because the purpose is to remove the vertical line noise. Finally, in this process, a dataset for network learning was generated by augmenting noise characteristics equivalent to those of reality.

B. NETWORK ARCHITECTURE DESIGN
This section discusses the overall network architecture design. We explain why we designed the network architecture and why we decided upon this particular architecture.

1) NETWORK ARCHITECTURE
The correlation between GT image (x), noise image (y), and noise ingredient (n) can generally be defined as x = y − n. In general, most algorithms predict the x component based on y. In our network, however, the clean image (x) is restored by predicting the noise feature (n) component from the residual learning perspective. In fact, existing residual learning has been introduced in that deeper networks are likely to run into gradient vanishing or exploding problems, and deep learning cannot be performed well, for example increased training error. Reference [29]. However, referring to the method used in DnCNN [8] in an attempt to approach from a different perspective than the existing residual learning, we applied the new residual learning method which is modified overall to the network. This residual learning method removes the clean image components from the noisy image and then calculates the residual features (n) to finally derive them aŝ where y is the input (noise image), R(y) is the residual feature through the network, andx is the final prediction image (clean image). The biggest difference from previous research is the application of Local Residual Learning (LRL) and Global Residual Learning (GRL) in the U-shaped [30] network structure. Residual operations are applied to both global and local parts to calculate the residual features effectively. In this way, the noise image is ultimately restored to a clean image (prediction image).
First, in this network, LRL is a residual operation applied to each layer immediately after the max-pooling or interpolation-based resize up-sampling. This operation is added to correspond to the scale of the input before obtaining the final feature. In fact, pooling operations tend to lose information, so researchers and engineers prefer not to use them. However, we added this process to deal with noise of various distributions at various scales. As will be mentioned again in the next section, checkerboard artifacts that occur when using the transposed convolution operation degrade the image restoration. Therefore, to prevent the artifact, we use an interpolation-based up-sampling operation instead. As shown in Fig. 6 (b), LRL is a combination of convolution layer + batch normalization [31] + ReLU [32], and four sets are connected to each group. Finally, the residual operation is performed through a subtract operation in groups 1 and 4.
GRL is the network output, and the final clean image is obtained by calculating the difference between the final residual image and the input. The distinguished two residual learnings (GRL, LRL) are the key point of this network, since noise is additional information not wanted by users, and eventually added unwanted features coming out of the network can be removed by difference calculations. The overall network is designed as shown in Fig. 6 (a).
For the validation and analysis of the effectiveness of residual learning on the network, we have confirmed through the ablation study. In the ablation study, quantitative and qualitative evaluations were performed. The equations used for quantitative evaluation are peak signal-to-noise ratio (PSNR) and structural similarity (SSIM) [33].
where I x is the GT image, and Ix is the predicted image. I max is the max dynamic range of input and output images. I max is 255 for 8-bit image. The MSE is the mean square error of the output image in comparison with the original image.  where µ x and µx are the respective average values of input and output images. σ x and σx are the respective variances of input and output images.
As shown in Fig. 7, in the validation process of training, both methods are applied to LRL and GRL, that are shown the improved performance compare to the method no applied residual learning. Also, in the inference, as shown in Table 1, it is analyzed that the quantitative evaluation showed a performance difference. In addition, quantitative evaluation was performed based on the restored image. Most of the participants in the experiment responded that the results were good when they used LRL and GRL together.

2) INTERPOLATION-BASED UP-SAMPLING
The commonality between the transposed convolution (deconvolution) operation and the up-sampling operation is mainly used to enlarge the image reduced by operations such as pooling. The difference between the two operations (based on the Keras 3 API) is that the up-sampling is an interpolationbased image resize, while the transposed convolution operation enlarges the size of the image based on the learned filters. In this process, as mentioned in Distil's blog [34], when using transposed convolution operation, uneven overlap may occur depending on filter size and stride, causing checkboard artifacts.
Similarly, artifacts such as those in Fig. 8 were also often observed in this study when transposed convolution is used. Therefore, we modified the network with up-scaling in order to reduce the effects of incorrect restoration. The number of 3 [Online]. Available: https://keras.io FIGURE 8. Comparison using interpolation-based resizing with transposed convolution. When using up-sampling (c, d), these phenomena disappeared (a, b). VOLUME 8, 2020 hyper-parameters in the network is slightly increased, but this is not significantly affecting the performance.

C. NETWORK TRAINING
For learning we used the ADAM [35] algorithm for the optimizer (learning rate = 0.001, β 1 = 0.9, β 2 = 0.999, epsilon = None), and a mini batch randomly shuffled based on the maximum size that the graphics processing unit could accommodate (batch size: 128). In order to select the loss function, we referred to [36], which analyzed the field of image restoration in detail, and [37], which analyzed which loss function should be used in the image restoration task. Through various experiments, we selected the loss function (L) according to the mean absolute error (MAE): where h is the network parameter learned through this proposed network, and N is the number of pairs of clean-noise images (x, y) during training (patch). Experiments were selected by comparison with a total of two control groups. The first control was a mean square error (MSE)-based loss function: The second control was the loss function of the combination of MAE and SSIM [33]. This expression can be redefined as in (8), reflecting the characteristics of the CNN: Finally, this control is combined as where α, β are hyperparameters for the ratio of loss function. As shown in Fig. 9, and in Table 2, control groups showed good performance in learning when the loss function was used.  However, in the evaluation, control groups showed better performance quantitatively and qualitatively when MAE was used. We confirmed that control groups show outstanding performance throughout the experiment. In fact, we found it best to combine the necessary loss functions depending on the task.
In addition, loss of information after learning may result from the relatively high depth and relatively large receptive field (192 × 192) of this network. Therefore, we verified the image that activated after passing the convolution layer of each layer. From this, we could see that the network was working correctly as shown in Fig. 10. The software used in this paper was Keras (2.2.4, backend TensorFlow 4 : 1.13) based on cuda 10 and cudnn 7.5. The server computer specification used for network learning consisted of an Intel (R) Core (TM) i9-7900X with a 3.3 GHz CPU, 64 GB of RAM, and NVIDIA Titan V 2way system. The evaluated computer specification was 99900KF with a 3.6 GHz CPU, 32 GB of RAM, and NVIDIA GeForce RTX 2080Ti. Due to sufficient hardware specifications, we quickly performed large batches of learning, inference, and evaluation.

IV. EXPERIMENTS A. DATASET FOR EVALUATION
For the evaluation of the multi-type noise image denoising task, we used 100 randomly selected videos from pixabay.com, 5 a website that provides copyright-free photos and videos. This dataset was regenerated into an evaluation dataset using the data generation process of section III.A. These datasets were edited to 30 fps, up to 10 seconds long, because of image frame length deviation. The reason why video is used for multi-type noise image denoising task evaluation is that video is composed of frames. Based on this factor, it serves to evaluate how effectively different dynamic noise is removed for each frame. In addition, for evaluation of σ the image enhancement task, 155 images randomly selected from pixabay.com were darkened, and evaluation was performed on a total of 2635 images. The degraded dataset for evaluation was created dynamically for each frame.

B. QUANTITATIVE MEASUREMENT
Machine learning-based algorithms such as BM3D [3] and WNNM [4] did not need to be trained to evaluate multi-type noise image denoising tasks, but deep learning-based algorithms needed to be trained anew. We proceeded implementation of training each algorithm based on the paper in which it was referenced. Based on the same dataset (DIV2K [28] + random multi-type noise in each frame), training and validation were conducted in the same way as for the proposed method. After applying the algorithm to the dataset, evaluation proceeded, using a total of six formulas as quantitative indicators. The indicators were MAE, frame-to-frame variation, root-mean-square error (RMSE), PSNR, SSIM [33], and multi-scale SSIM (MS-SSIM) [38], as respectively quantified in (10), (11), (12), (13), (4), (5), and (14): where x is the GT image, andx is the clean image (prediction image or restored image), and p is the pixel value.
where AE is absolute error between pixels, and f n is the n th frame; variation is calculated as the difference between the (n − 1) th frame and the n th frame.
where RMSE is an operation that takes the square root of the mean square error (MSE).
where MS-SSIM [37] is based on (5). It calculates the SSIM score at multi-scale (M ) and then weights it to obtain a final score. l is luminance, c is contrast, and s is structure. These parameters are the basic building blocks of SSIM [33]. For convenience, we used the default values as A = B = 1, for j = {1, . . . , M }. These formulas were used to evaluate how properly the noise was removed in successive frames and how completely the image was restored in comparison with the GT image. The supplementary data include some videos and graphs with quantitative indicator results, presented frame by frame. Table 3 lists the average values of quantitative indicators for 103 videos. In terms of numerical values, the RMSE, PSNR, and SSIM [33] methods showed lower (worse) values than MWCNN [10] and Noise2Noise [11], while MAE, variation, and MS-SSIM performed well. In some images, the features in the video were slightly blurry, but in terms of video footage or frame-by-frame, AWGN and line noise were clearly removed, demonstrating excellent image restoration quality. As shown in Fig. 11, three algorithms showed high performance: MWCNN [10], Noise2Noise [11], and the proposed method.
We observed that networks with relatively large receptive fields showed excellent image restoration performance. There were distortions in the videos when algorithms other than the proposed method were applied; for these same algorithms, line noise was not properly removed, remaining faint. Also, as shown in Fig. 12, it is clearly confirmed that non-uniform noises when the histogram equalization is applied to the restored image.
As mentioned in the SRGAN [26] paper, the high values for PSNR and SSIM do not necessarily mean that the video is visually appealing or that the noise is removed well; rather, they reflect the characteristics of deep learning optimization based on the loss function. As shown in (4), the use of MSE loss would produce high values in the evaluation metrics (PSNR), but at the potential cost of blurry output or improper noise removal. Therefore, in accordance with [39], which analyzes the distortion measure and the human perceptual measure, we also adopted the mean opinion score (MOS) method and used it as a quantitative indicator. Table 4 provides a quantitative average index of the image enhancement task based on 2635 images (dark degraded FIGURE 11. Quantitative evaluation of multi-type noise denoising task by various algorithms (PSNR, SSIM, MAE). In the proposed method, multi-type noises are effectively removed. However, from a numerical point of view, it can be observed that there is a difference in image restoration quality. images). Indicators used for quantitative evaluation were MAE, PSNR, SSIM [33], MS-SSIM [38], and additionally Natural Image Quality Evaluator (NIQE) [40], which does not require a reference (GT image) to evaluate image quality.
As shown in Fig. 13, three algorithms performed properly: AGCWD [19], CegaHe [20], and the proposed method. Particularly numerically, the proposed method showed the image restoration closest to GT images by improving dark saturation. However, among the algorithms with good performance (AGCWD [19], CegaHe [20]), there were cases in which the quantitative evaluation was bad even though the quality of the restored image was good. This suggests that evaluation using qualitative indicators is necessary.

C. QUALITATIVE MEASUREMENT
As mentioned above, we conducted the MOS test because there appeared to be a limit to the usefulness of quantitative indicators in the evaluation of algorithm performance. For MOS testing, we asked a human evaluator to assign to   each reconstructed image a point value from 1 (the worst quality) to 5 (excellent quality). Table 5 provides a qualitative index of the MOS for the multi-type noise image denoising task. We showed 10 selected videos to 21 different study subjects. For each, we asked in a questionnaire survey whether the noise was properly removed so that there were no inconveniences in viewing the images, i.e., if the videos were smooth and natural. Most of the participants who conducted the evaluation presented the following common opinions.
1) Quality was considered worse when noise remained in successive scenes or when the transition between frames was unnatural. 2) Quality was considered better when the transition between frames was natural and clear.
3) The usefulness of quantitative algorithm evaluation metrics was considered quite limited (based on subjects' comparisons of their qualitative scores with quantitative evaluation metrics provided to them afterward). VOLUME 8, 2020  MWCNN [10], Noise2Noise [11], and the proposed method received good qualitative evaluations. According to the opinions of evaluators, the proposed method not only effectively removed multi-type noise, but also minimized the variation among frames, resulting in a natural image reconstruction with uniform quality. In the denoising task, the restoration of features and backgrounds needed to be harmonious and natural. Although numerical evaluation is important for quantitative measurement, the qualitative evaluation of the proposed method demonstrates that the perceptual measure is important in terms of quantitative measurement [39]. Table 6 shows the MOS of the image enhancement task. We showed 35 selected images to 21 different study subjects. For each, we asked in a questionnaire survey how closely it was restored to the GT, i.e., how natural the images were. Through this quantitative measurement, three methods received good qualitative evaluations: AGCWD [19], CegaHe [20], and the proposed method. In the image enhancement task, the restoration of dynamic brightness needed to be harmonious and natural. Among these three winning methods, the proposed method was evaluated to give the most natural and harmonious restoration. Table 5 shows that the proposed method also provided the best performance for this task in terms of quantitative metrics.

D. TOTAL MEASUREMENT
In this section, quantitative and qualitative evaluations are combined and evaluated. Fig. 14 uses an integrated indicator to show that the performance of the proposed method for the multi-type noise image denoising task is excellent. The performance of the proposed method is also good the image enhancement task.
Our study once again emphasizes that both quantitative and qualitative indicators should be used to determine how well image restoration has been performed.

E. REAL EXPERIMENT
In this section, actual datasets are created and evaluated to demonstrate the diversity of the proposed methods.
In IR systems, various noises caused by scene changes or heat generation can be observed frequently. For this reason, NUC method is essential to obtain high quality images. NUC method works to address common multi-type noise (AWGN, line noise) and temperature compensation.
However, as mentioned by FLIR [41], it takes about 20 minutes to warm up for accurate temperature measurements. During preheating, the NUC method continues to work for 20 minutes, and no image can be obtained during operation.
Because of these limitations, the proposed method is used to replace the NUC method, and the following Fig. 15 is the result.
As shown in Fig. 15, It can observe that the NUC method is not always perfect. Because of the characteristic of the noise, it is not easy to obtain an image (GT Image) that is perfectly and harmoniously removed. The proposed method not only shows results equivalent to images processed with the NUC method, but also shows the same and excellent level of restoration quality without any deviation between images, and better results depending on the specific scene.

V. CONCLUSION
This paper presented a residual learning based RFSUNET architecture for image restoration, which consists of GRL and LRL. For multi-type noise image denoising tasks, actual multi-type noises such as TIR images and HS images were analyzed and used to generate equivalent evaluation datasets. The performance of RFSUNET in restoration tasks was evaluated through quantitative and qualitative means. In particular, the proposed method was found to effectively remove various types of noise in comparison with conventional algorithms and networks. In addition, the results show low variation and uniform quality in both dynamic and static images. Also, we performed comparative analysis experiments on how effectively to remove multi-type noises, which are frequently observed in real IR systems, using the NUC method and the proposed method. Through these experiments, we were able to prove the excellence of our research once again.
However, most CNN-based algorithms, including this proposed method, show slightly blurring results due to the deep learning structure. This flaw produces admittedly imperfect restoration results compared to the GT image.
In future work, we will create a network that sharpens and blurs the somewhat blurry image once again. As a two-stage network, we aim to design a CNN that enhances the performance of restoration as close as possible to the GT image.