Mathematical Analysis of DCN-Based Super-Resolution

Although DCN-based super-resolution (DCN-SR) techniques have shown impressive performance, the working mechanism has not been completely understood and DCN-SR methods still produce some artefacts. In this paper, we analyze the working mechanisms of DCN-SR methods. We derive mathematical formulations of the DCN-SR methods and provide some experimental analyses, which show that the effective receptive ﬁelds of the DCN-SR methods are considerably smaller than the theoretical receptive ﬁelds. Based on the mathematical formulations, experiments were performed. The results indicates that current DCN-SR methods may have some fundamental problems and new types of DCN structures are needed for reliable super-resolution performance


I. INTRODUCTION
Recently, deep convolutional networks (DCN) have been successfully applied in many signal processing areas [1]- [5] such as super-resolution [11], and noise removal [13], de-mosaicking [12], etc. In particular, a number of researchers have studied DCN-based super-resolution (DCN-SR) methods, which have provided noticeably better performance compared to traditional super-resolution and interpolation methods [14]- [30]. However, the working mechanism of DCN-based super-resolution (DCN-SR) methods has not always been well understood. Some authors have studied the working models of the DCN methods [6]- [10]. In [6], the authors provided some insight into the intermediate layers with visualization techniques. Also, it was shown that the first-layer features may not be specific to a particular task and can be transferable to other tasks [7]. Some visualizing tools were proposed for DCN [8], which provided some insight into the DCN working mechanism. In [9], the author presented some analysis results of the DCN operations. In [10], the authors found some interesting properties of neural networks and showed that imperceptible perturbation may produce errors in neural networks. In [31], the authors investigated the mathematical model of deep learning frameworks for inverse problems.
The associate editor coordinating the review of this manuscript and approving it for publication was Yong Yang .

II. MATHEMATICAL FORMULATIONS OF DCN-SR METHODS
In general, DCN-SR methods have a number of convolution layers, ReLU layers, etc. They may further include other types of layers (e.g., channel attention layers, sigmoid functions, etc. [16]). In most DCN-SR methods, networks are trained using a number of image patches (K×K images). These patches can be expressed as vectors (N×1) with N = K 2 . For example, the first convolution layer followed by a ReLU layer can be expressed by the following matrix operation: where the superscript represents the layer index. A 0 64N ×N is a filter matrix that represents the convolution operations with 64 filters and b 0 64N ×1 is a bias vector where A 0,k N ×N is a single filter matrix (N×N) of the 0 th layer for the k th filter. X 0 N ×1 is an input patch and X 1 64N ×1 is the output of the first layer (convolution layer + ReLU layer).
In (1), If an element of the vector is negative, the ReLU operator sets it to zero. Thus, we can express (1) as follows: where the left subscript (R) represents the ReLU operation.
In other words, a row of matrix R A 0 64N ×N and an element of as a layer matrix, which applies convolution and ReLU operations. For the VDSR [17], there are 20 layers and the final output image can be expressed as follows:

N×1
Consider the center pixel of the output image ( Fig. 1), which can be expressed as follows: where ϕ center,19 1×64N is the row of A 19 N ×64N corresponding to the center pixel. Using , the center pixel of the output image is given by where In other words, an output pixel is obtained by taking the inner product between the weight mask (W 1×N ) and the input patch (X N ×1 ) and adding the total bias (b total 1×1 ). Also, the total bias term (b total 1×1 ) can be expressed: where n k bias is the number of bias terms (number of filters) of the k th layer and N b is the total number of biases.  W 1×N b is a bias weight vector and B N b ×1 is a bias vector. Thus, if a DCN-SR method consists of convolution and ReLU layers, the output can be expressed as follows: It is noted that W 1×N and W 1×N b are functions of the input patch. In other words, the two terms can be expressed as follows: Some DCN-SR methods (e.g., EDSR [15]) use residual blocks (Fig. 2). Similarly, we can express the residual block as follows: 64N ×1 is the bias vector of the second convolution operation (V l  64N ×64N ). In other words, any output pixel value can be expressed as a linear function. Thus, it can be shown that the output of the EDSR (enhanced deep super resolution network) can be expressed in a way similar to (3) and (4).
The RRDB (residual in residual dense block) [17] has a much larger number of layers and the theoretical receptive  field can cover the entire image. Nevertheless, the linear model of (4) is still valid for the RRDB since it only uses convolution and ReLU layers. The operation that combines several images into a higher resolution image can be also expressed as a matrix operation and is still a linear operator. Consequently, the dynamic linear model of (4) can be used to model the output pixel values of the RRDB.
For the RCAN (residual channel attention networks), the linear model of (4) may not be valid due to the multiplication operations and sigmoid functions of the channel attention layer (Fig. 3-4) [18]. Assuming 64 filters are used, the residual channel attention block (RCAB) can be expressed as follows: where ⊗ denotes element-wise (channel-wise) product and The first term of (6) is not a linear operation since C l 64×1 is computed by a series of operations that include average operations and sigmoid functions. Thus, for the RCAN, the output pixel value needs to be expressed as follows: where f is a non-linear function. Thus, the gradients of C l 64×1 with respect to the input image and the bias terms cannot be expressed as linear functions. However, due to the average operation, the contribution of each pixel would be much smaller in the non-linear function f . In other words, if the image size is M × N , C l 64×1 always include a term ((1/MN ) m with m > 0) as follows: Since MN is very large (in the order of 10 5 ∼ 10 6 ), the gradients of f X 0 N ×1 , {b k j } with respect to X 0 N ×1 and {b k j } tend to be very small compared to the gradients of W 1×N b B N b ×1 and  W 1×N b B N b ×1 . Thus, even for the RCAN, we may approximate the output value using (4): Fig . 5 shows the difference histogram between the actual output (y of (7)) and the linear approximation (ỹ of (8)). The maximum difference was 7.02 × 10 −4 for the 8-bit images. Thus, the dynamic linear model of (4) is still valid for the RCAN.

III. EFFECTIVE RECEPTIVE FIELDS
As can be seen in (4), the output pixel value consists of two terms: Fig. 6 shows the log ratios of W 1×N X 0 N ×1 /W 1×N b B N b ×1 , which were computed using 1600 training images that were randomly selected from the DIV2K database. It appears that W 1×N X 0 N ×1 is about 30-280 times larger than W 1×N b B N b ×1 . Fig. 7 shows the contributions of the pixel and bias values when the RRDB DCN-SR methods were used for the baboon image. The other three methods (VDSR, EDSR, RCAN) have very small bias values so that their bias contributions are invisible. The bias contribution mainly appeared around the edges. Also, W 1×N X 0 N ×1 and W 1×N b B N b ×1 can have negative values, though the sum is rarely negative. The perceptual RRDB method showed considerably larger bias contributions than the other methods.
Since W 1×N and W 1×N b are a function of the input patch, the weight mask (W 1×N ) of (4) will be different for each output pixel and may have negative values as can be seen in Fig. 7. Fig. 8 shows the average energy of W 1×N that was computed using the 1600 training images for the five methods (VDSR, EDSR, RCAN, RRDB(PSNR), RRDB(perceptual)). Table. 1 shows the theoretical receptive fields of the four  methods, even though the training patch size was 41 × 41 for the VDSR and 48×48 for the other methods. Since the RRDB and RCAN have a large number of layers, their theoretical receptive fields can cover the entire image, even though the networks were trained using the 48 × 48 training patches. As can be seen in the figure, it is observed that the effective receptive field is much smaller. Fig. 9 shows the color coded energy distribution of the weight mask (W 1×N ).
It appears that the weight mask reflects image characteristics. For example, for the baboon image, the image mask had a circular shape (Fig. 10). On the other hand, for the coastguard image, the weight mask had an elliptic shape reflecting the horizontal river waves.

IV. EFFECTS OF BOUNDARY PIXELS
In (4), W 1×N and α k j are functions of the input image. Thus, we can rewrite (4) as follows:  where L represents the number of layers that include bias terms. Although the effective reception field is much smaller than the theoretical reception field, it is still possible that   boundary pixels may affect W 1×N and α k j . Consequently, the boundary pixels may play a certain role in determining the output images. In order to investigate how the boundary pixels might affect W 1×N (X ) and α k j (X ), we extracted a patch from an image. The patch size was the same as used in the training procedure (41 × 41 for the VDSR and 48 × 48 for the others). Then, we set the boundary pixels of input patches to zero (Fig. 12) and applied a DCN-SR method. Next, we computed the gradient of the center pixel and kept only the center pixel as the output. We repeated this procedure for the 1600 patches. We increased the zero padding from 1 to 20 (Fig. 12). We also compared the weight mask (W 1×N ) of the original patches and the zero-padded patches, and computed the angles between the two vectors (masks) as follows: 1×N represents the weight mask for the zeropadded patch, · represents the dot product and || · || represents the norm of the vector. Fig. 13 shows the output pixel value differences of the original patches and the zero-padded patches (averages of the 1600 samples). It can be seen that zero padding up to 10 pixels produced very small differences except for with the RRDB (perceptual). These results indicate that the effective receptive field DCN-SR methods may be much smaller than expected.
Next, we conducted another experiment. We extracted a block from a target image (Fig. 14(a)) and put it into a background image (Fig. 14(b)). The background image with embedded blocks was reduced (Fig. 14(c)) and then enlarged using a DCN-SR method (( Fig. 14(d)). Finally, only the pixels of the center block (4 × 4) were retained (Fig. 14(e)). We repeated this procedure so that the entire target image was processed using the DCN-SR method. We computed the PSNR between the enlarged target image through block padding and the original high-resolution image. We compared this PSNR with conventional PSNR values that were computed without block padding. We varied the block size (from 4 × 4 (B16) to 10 × 10 (B40) in the low resolution images). Table 2 shows the PSNR comparison (averages of seven images). A PSNR decrease by 0.4∼0.5dB was observed for the EDSR, RCAN and RRDB when the block size was 7 to 10 in the low resolution images. For the VDSR, when the block size was 5 to 10, the PSNR performance  slightly improved. These results indicate that DCN-SR methods may not always effectively use information from large receptive fields.
In traditional SR performance evaluation, a high-resolution image is reduced and a super-resolution method is applied. Finally, the PSNR between the original high-resolution image and the enlarged image is computed. However, this process has a serious problem. When an image is reduced, low-pass filtering effects are introduced, which may affect performance. So in the next experiments, we applied the DCN-SR methods to the original images without first reducing them. Fig. 15 shows the enlarged images along with the enlarged image when using nearest neighbor interpolation (NN). As can be seen, the EDSR, RCAN and RRDB produced annoying artefacts. The bilinear and bi-cubic interpolation methods produced no such artefacts, even though they produced blurred images (not shown). These results indicate that current DCN-SR methods may not use valid features to perform enlargement operations. In general, DCN-SR methods are not linear systems. Nevertheless, we investigated the DCN-SR responses to some basic patterns. First, we applied the DCN-SR methods to the unit pulse image (a single white dot at the center of an RGB image). Fig. 16 shows the unit pulse responses of the five DCN-SR methods along with the enlarged image using the nearest neighbor interpolation (NN). Except for the VDSR, the four methods showed somehow sinc-function-like artefacts, even though the side lobs were irregular. Next, we made a diagonal line and applied the five DCN-SR methods. Fig. 17 shows the results. Also in this example, except for the VDSR, the four methods (EDSR, RCAN, DBNN (PSNR), DBNN (perceptual)) showed somehow sinc-function-like artefacts. The four methods appeared to be non-symmetric in the three-color channels since they produced chromatic images for an achromatic input image (white line). In Fig. 18, we applied the five methods to a vertical bar image. Again, except for the VDSR, the four methods showed somehow sinc-function artefacts by having additional bar patterns. Furthermore, the four methods produced color artefacts. Finally, we applied the five methods to a diamond pattern (Fig. 19). The VDSR showed an expected output image, even though the eight dots became connected. The EDSR produced large artefacts (color distortions, false stripe patterns). Also, the EDSR showed different directional responses since the stripe patterns occurred in one diagonal direction. The RCAN produced some color artefacts with additional sinc-function-like artefacts. The RRDB (PSNR) showed similar artefacts, though the color artefacts were mostly green. The RRDB (perceptual) also showed large colorful artefacts. Also, it showed different directional responses since the output images were not symmetric even though the input image was symmetrical in the vertical and horizontal directions. All these results indicate that current DCN-SR methods may have some fundamental flaws. It appears that new DCN structures are needed to produce reliable super-resolution performance.

V. CONCLUSIONS
In this paper, we formulated the working mechanism of DCNbased super-resolution methods and showed that DCN-SR methods can be modelled as dynamic linear operations, which take an input image (patch) and bias terms. Based on the formulation, we analyzed the effective receptive fields of several DCN-based super-resolution methods, which were considerably smaller than theoretical receptive fields. These results indicate that significant complexity reduction may be possible without sacrificing performance. Based on the mathematical formulation, a series of experiments were conducted, which indicates that current DCN-SR methods may have some fundamental flaws and new types of DCN structures are needed to produce reliable super-resolution performance.