One-to-One Mapping-Like Properties of DCN-Based Super-Resolution and its Applicability to Real-World Images

Although super-resolution techniques based on deep neural networks (SRDNN) have drawn significant interest and numerous algorithms have been proposed, they still have reliability problems and produce artefacts when applied to new datasets. In this paper, the working mechanisms of SRDNN techniques are analyzed in terms of data mapping. Since most SRDNN techniques can be viewed as dynamic linear projections, we analyzed a large number of projection vectors (over 70 million) and found that the SRDNN method performs one-to-one mapping-like operations and may be vulnerable to unknown data patterns. Then, we applied several SRDNN techniques to real-world images and analyzed the output images. The current SRDNN methods failed to distinguish the blurred edges/lines due to low resolutions from coding artefacts and enhanced both, even though the SRDNN methods were trained using compressed low-resolution (LR) images. These analyses and results indicate that current SRDNN methods may not be able to provide robust performance and new structures may be necessary for reliable super-resolution performance.


I. INTRODUCTION
Recently, super-resolution based on deep neural networks (SRDNN) has drawn significant interest and numerous algorithms have been proposed [1]- [15]. Although it has been reported that these SRDNN algorithms produce much better performance compared to traditional interpolation methods such as bi-cubic interpolation, SRDNN methods still tend to produce unexpected artefacts in some cases and this reliability issue can restrict the use of SRDNN methods.
In typical SRDNN methods, a reduced image is enlarged by an integer factor (e.g., 2, 3 or 4). Recently, the perceptual quality of SRDNN methods has been studied [16]- [19]. These SRDNN methods aim to improve perceptual image quality instead of the conventional PSNR. Furthermore, applying SRDNN methods to real-world images has been investigated by assuming that paired HR (high-resolution) and LR images are unavailable [21]- [23]. Most existing SR (super-resolution) methods use degradation models that are The associate editor coordinating the review of this manuscript and approving it for publication was Wei Wei. not related to real images. Typically, bicubic down-sampling is used to generate low-resolution images to train the model. Using bicubic down-sampling is similar to applying a lowpass filter, which reduces high-frequency components in low-resolution images. Consequently, performance may be degraded when applied to real images. To address this problem, Ji et al. [25] proposed a degradation method using an estimation kernel and noise injection. Zhang et al. [26] proposed a degradation model consisting of randomly blended blur, down-sampling, and noise.
Super-resolution deals with an extremely ill-posed problem. Once the resolution of an image is reduced, some information can be permanently lost and never recovered. In 4x super-resolution, a pixel in the reduced image can be viewed as an average of 16 pixels (as a 4 × 4 block). For example, the images shown in Figure 1 will be identical when they are reduced by 1/4 assuming the average of 16 pixels is used.
When using 8-bit images, the 4×4 block shows a very large number of combinations, which are mapped into the same 8-bit value (0-255). Although this is an irreversible process, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ it is conjectured that SRDNN methods can utilize contextual information to recover the lost detail information. However, since most SRDNN methods produce unexpected artefacts, this conjecture should be examined and tested.
In this paper, we examined the working mechanism of SRDNN methods and showed that SRDNN methods show one-to-one mapping-like properties as they can be viewed as a dynamic linear transformation. Then, we analyzed the images enlarged by some SRDNN techniques as applied to compressed real-world images.

II. ONE-TO-ONE MAPPING-LIKE PROPERTIES OF SRDNN A. DYNAMIC LINEAR TRANSFORMATION
A basic building block for SRDNN is a convolution layer followed by a ReLU layer, though some SRDNN methods use residual blocks, sigmoid functions, channel attention layers, etc. [6]. The depth of neural networks determines the receptive field size. For example, the receptive field of the VDSR [2] is 41 × 41. For some deep SRDNN methods, the receptive field can be the entire image. In other words, a pixel value in the output layer can be theoretically affected by the entire image. However, most SRDNN methods are trained using a large number of patches (K × K images). The typical value of K is 41 to 48.
In [20], it was shown that a convolution layer followed by a ReLU layer can be expressed by matrix operations. The patch was expressed as a vector (N × 1) with N = K 2 . If there are 64 filters in the convolution layer, the convolution layer followed by the ReLU layer was expressed as follows [20]: where the superscript represents the layer index. A j 64N ×N is a filter matrix of the j-th layer and is a bias vector of the j-th layer: As shown in [20], the output (X j+1 64N ×1 ) can be also expressed as a vector (64N × 1). The ReLU operator replaces negative elements with zeros, which is equivalent to setting the corresponding row of A j 64N ×N and the corresponding element of b j 64N ×1 to zero. This operation makes SRDNN with ReLU a non-linear function. Thus, (1) can be rewritten as follows: ×N is a matrix that reflects the ReLU operations and R b j 64N ×1 a vector that reflects the ReLU operations. In other words, a convolution layer followed by a ReLU layer can be modelled as a dynamic linear transformation.
After the DNN is trained, all the filter and bias coefficients are fixed. Without the ReLU operations, the transformation matrices would be identical for all input images. However, the ReLU operator produces a different transformation matrix depending on the images. In particular, the sign of the elements of the vector ( ). If all the element signs are identical, the transformation matrix and the bias vector will be the same. In VDSR [2], there are 20 layers and the filter size is 3 × 3. Thus the receptive field is 41 × 41. In [20], it was shown that an output pixel can be expressed as follows: where X 0 N ×1 represents an input patch (21 × 21 image), W 1×N and W 1×N b represents weight vectors. Theoretically, 1681 pixels of the input image affects the output pixel ( Figure 2).
On the other hand, the first layer (convolution and ReLU) will produce 64 images and 97344 (39×39×64) pixels of the 64 images of the first layer affect the output pixel. Before the ReLU operator, some of the pixel values may be negative and the ReLU operator will set the values to zero, which in turn will produce a different transformation matrix ( R A j 64N ×N ) and a bias vector ( R b j 64N ×1 ). Therefore, a total of 682177 pixels of layer output images can affect an output pixel in the VDSR method (Table 1). In particular, the signs of the 682177 pixels determine the final linear transformation matrix. In other words, the output pixel can be modelled as follows: where P is the set of the 682177 pixels included in the pyramid, as illustrated in Figure 2. In this paper, P is defined as a pyramid pixel set, which is a set of pixels of the output images of the layers. The output images of each layer are 64 in VDSR except for the last layer whereas the input is a single-channel image. If the signs of the 682177 pixels before the ReLU operation are identical, the linear transformation (W 1×N (P) and b 1×1 (P)) will be identical. This property can be applied to any SRDNN method that uses the ReLU function. Using this paradigm, most SRDNN methods can be modelled as a dynamic linear transformation. The input patch (K x K ) and the filters determine the pyramid pixel set (P), which in turn determines W 1×N (P) and b 1×1 (P). Although W 1×N (P) is a function P, P is also a function of the input patch (X 0 N ×1 ). Thus, W 1×N and b 1×1 can be viewed as functions of X 0 N ×1 : Figure 3 illustrates this mapping procedure. Thus, the SRDNN method can be understood as first generating the weight vector and bias term (W 1×N & b 1×1 ) and then applying a linear transformation. In the case of VDSR, the network generates W 1×N and b 1×1 from X 0 N ×1 , and then uses (6) to compute the output pixel, though these operations are simultaneously performed in the VDSR network.

B. ONE-TO-ONE MAPPING-LIKE PROPERTIES
Recently, numerous SRDNN methods have been proposed and they have shown impressive performance. Figure 4 shows  some examples of SRDNN methods (RRDB [12]). These methods produced impressive enlarged images (4x) from a reduced image. From what appears as blurred lines at the center of the LR image, the method successfully reconstructed the two fine line structures. Also, the SRDNN method impressively reconstructed the detailed structures of the beam on the right. Since SRDNN methods can be modelled as shown VOLUME 9, 2021 in Figure 3, one may claim that SRDNN methods effectively utilize surrounding structures to successfully restore lost detailed information. However, erratic behaviors for new input images also suggest that SRDNN methods may fail to use relevant information and they might suffer from reliability problems. In this paper, we examined the signs of the pyramid pixel set (P) for over 70 million output pixels when the VDSR was used. From each pyramid pixel set, we generated a sign sequence of 682177 numbers. If two output pixels have the same sign sequence, then the two output pixels will use the same linear transformation of (6). Another interpretation can be made from a point of input space division. The convolution filters will divide the input space (1681 dimensions) into a very large number of hyper-polygons. Then, all the pixels within the same hyper-polygon will use the identical linear transformation (W 1×N (P) and b 1×1 (P)). Figures 5-6 show some output pixels with identical sign sequences. The left image is an SR image and the right image shows where the output pixels with the same sign sequence are marked as red pixels. It can be seen that all those output pixels correspond to almost constant regions (either white or black). These results indicate that almost every input patch produced a different linear transformation of (6). Almost every hyper-polygon was occupied by a single pixel and the number of hyper-polygons may significantly exceed the number of training patches.
In other words, the weight vector generating mapping function shown in Figure 3 shows behaviors like a one-to-one mapping operator. Since this one-to-one mapping operator is designed from finite training samples, the mapping may not reflect valid logic using relevant information. More complex SRDNN methods (e.g., RCAN, RRDB, SAN, etc.), which have a larger number of parameters than VDSR, might suffer from the same problem since they may divide the input space into a much larger number of hyper-polygons.
Consequently, SRDNN methods may have a fundamental reliability problem due to this one-to-one mapping property. In other words, when SRDNN methods are applied to real-world images that are not used for training, they may show erratic behaviors. SRDNN has been reported to produce clear and sharp high-resolution images from lowresolution images with blurred edges and lines. However, since most SRDNN methods produce one-to-one mapping solutions, they may not be able to distinguish blurred edges and lines from the compression artefacts that can be produced by coding. To investigate this vulnerability, we applied various SRDNN methods to some real-world images in the next section.

III. APPLICATIONS TO REAL-WORLD IMAGES A. APPLICATION TO COMPRESSED IMAGES
In general, SRDNN performance is evaluated using standard databases. Usually, low resolution (LR) images are generated      In the next experiment, we applied several SRDNN methods (VDSR [2], EDSR [5], RRDB [12], ESRGAN [12], RCAN [6], SAN [7], CAR [27], HAN [28]) to the compressed LR images. The pre-trained models were downloaded from the authors' sites [29]. After the HR images were reduced to LR images, we compressed them using JPEG coding at various quality levels. When creating the compressed images, we used a JPEG function (built-in function in Ubuntu, ver. 18.04). Figure 7 shows the compressed LR images at various quality levels. Then, we used the SRDNN methods to enlarge the compressed LR images to the original high resolution. Tables 2-4 show the PSNR performance of the SRDNN methods along with the bi-cubic interpolation. The PSNR of the SRDNN methods considerably decreased for the compressed LR images. Figure 8 shows the enlarged images when uncompressed LR images were used. Compared to the bi-cubic method, the SRDNN methods produced much better quality. Figure 9 shows the enlarged images when the LR image was compressed (JPEG 90). Although the bi-cubic method shows a similar output, the SRDNN methods showed more artefacts. In particular, ESRGAN, which aimed to maximize perceptual image quality, showed severe artefacts. It can be seen that the SRDNN methods immediately produced artefacts when the LR images were compressed. The VDSR generated the least number of artefacts for the compressed LR images, though its enlarged images were not as good as the outputs of the other SRDNN methods for the uncompressed LR images.  Figure 9 shows that the SRDNN methods produced artefacts similar to some 2D cosine transforms. For example, RCAN and SAN showed diagonal artefact patterns on the chin. Along strong horizontal edges such as eyebrows, a horizontal artefact pattern appeared. Figure 10 shows the enlarged images when the LR image was compressed using JPEG (quality level: 80). The artefact patterns appeared similar to 2D cosine transforms of lower frequency. This issue will be discussed in detail later.
Although the SRDNN methods successfully restored fine details from the blurred LR images in Figure 4, they also enhanced the coding artefacts of the compressed LR images and produced poor perceptual image quality (Figure 11(a)).  These results indicate that the one-to-one mapping-like properties of SRDNN methods may not be able to use relevant information that reflects the true nature of the target images.
To solve this reliability problem of SRDNN methods, new structures may be developed.

B. TRAINING USING COMPRESSED LR IMAGES
In the next experiments, we trained the EDSR and RCAN methods using uncompressed and compressed LR images. In other words, in addition to uncompressed LR images (LR_NC: LR no compression), we also used compressed LR images coded by JPEG for training. We encoded the training set using JPEG (quality level: 90) and decoded the JPEG LR images to produce the compressed LR training data. We used the DIV2K dataset for training and validation. In general, the 800 DIV2K images have been used for training and the 100 DIV2K images have been used for validation. To train the EDSR model, 32 residual blocks and 256 features were used. To train the RCAN model, we used 10 residual groups, 20 residual blocks, and 64 features. The models were trained for 300 epochs. When the EDSR was trained using only the uncompressed LR images, it produced annoying artefacts when it was applied to the compressed LR images (Figure 11(a)).
However, the EDSR trained using both uncompressed and compressed LR images did not produce these artefacts for the  compressed LR images (Figure 11(b)). However, there were PSNR decreases ( Figure 13). Similar performance patterns were observed for the RCAN (Figure 14).   Figure 12 shows enlarged images of a compressed LR image (JPEG 90). Although the bi-cubic interpolation method produced an image with some coding artefacts, it retained some naturalness (Figure 12(a)). The EDSR trained using  only uncompressed LR images enhanced the coding artefacts ( Figure 12(b)). Although the EDSR trained using both uncompressed and compressed LR images produced a clean image (Figure 12(c)), the output image was overly smooth and fine details were lost. For example, the gradation within the music notes was lost and the text lines look unnatural.
Next, we encoded the LR images with higher compression (JPEG 80). With larger coding impairments, the EDSR and RCAN enhanced the coding artefacts (Figure 15), though they were trained using uncompressed and compressed LR images (JPEG 90 & 80).
It appears that the SRDNN methods failed to distinguish the lost details from the coding impairments when those impairments were larger than a threshold. Figure 16 shows the SR images produced by the EDSR when the LR images were compressed. When the EDSR trained using only uncompressed LR images was applied to uncompressed LR images, it produced good outputs (Figure 16(b)). However, it produced severe artefacts when applied to compressed LR images (Figure 16(c)). Even when the EDSR was trained using uncompressed and   compressed LR images, it still produced some artefacts (Figure 16(d)). The corresponding LR images were shown at the lower-right.  Furthermore, when the images are highly compressed, the SRDNN methods enhanced the coding artefacts even when the compressed training data of the same level was used. For example, when the EDSR trained using the uncompressed LR images (NC LR: no compression low-resolution images) was applied to compressed LR images (JPEG 90), it produced coding artefacts ( Figure 18) whereas it produced a good image when applied to the uncompressed LR image ( Figure 17). When trained using both uncompressed and compressed LR images, the EDSR reduced the coding artefacts ( Figure 19). However, it still produced coding artefacts (Figures 20 & 21) when the image was more compressed (JPEG 80), though compressed images (JPEG 80) were also used for training. Also, when the EDSR was trained using both uncompressed and compressed LR images, it produced overly smooth images (Figures 22 & 23). Restoring the detailed information without amplifying the coding  artefacts and noise is still an unsolved challenge for SRDNN methods.

C. ENHANCING CODING ARTEFACTS
In the next experiment, we applied the SRDNN methods (VDSR, EDSR, RRDB, ESRGAN, RCAN, and SAN) to the images of the MICC logo database [21]. Figure 24 shows an original JPG image, which was enlarged (4x) by using the  SRDNN methods. Figure 25 shows a sub-image (enlarged box area of Figure 24) of the enlarged image produced by the bi-cubic interpolation. One can clearly see the DCT patterns. Figure 26 shows a sub-image of the enlarged image produced by the EDSR, which noticeably enhanced the coding artefacts (DCT patterns). Figure 27 shows a sub-image VOLUME 9, 2021   enlarged by ESRGAN, which produced severe coding artefact enhancement. Figures 28-32 show more examples of enhanced coding artefacts (e.g., mosquito noise due to encoding) produced FIGURE 32. Sub-image enlarged by HAN [28]. The enhanced coding artefacts (e.g., mosquito noise) are clearly visible. by the SRDNN methods. These kinds of enhanced coding artefacts were observed in many of the MICC logo images and other JPEG images. In particular, the ESRGAN method, which is a perceptual model, produced the most severe artefacts.

D. ALIASING AND NATURALNESS
It is well known that aliasing can occur when images are reduced. Many SRDNN methods have produced such aliasing artefacts (Figure 33). Depending on the training data, SRDNN methods might produce different aliasing artefacts. In Figure 33, the EDSR trained with different training datasets produced different aliasing patterns.
In some cases, the SRDNN methods tend to produce overly smooth output images that look unnatural compared to conventional methods ( Figure 34). Figures 37-42 show more of these examples. This unnaturalness was more easily perceived for human faces and characters. Figures 35-36 show some examples of enlarged text. It appears that the bi-cubic method provided the best naturalness and readability whereas the SRDNN methods produced unrecognizable/disturbing characters along with enhanced coding artefacts. Figure 43 shows enlarged tree branches. The SRDNN methods (RCAN, ESRGAN, CAR, HAN) produced very unnatural tree branch patterns. Also, they created visible vertical stripes at the left side ((c)-(f)).
Super-resolution is an ill-posed problem. Although it is claimed SRDNN aims to restore lost detail information using surrounding structures, it is a challenging task and the current SRDNN methods may not able to successfully handle all types of impairments.      high-frequency information mainly by enhancing blurred edges/lines. However, blurred edge/line-like structures can also occur when images are compressed. The current SRDNN  methods failed to distinguish the blurred edges/lines due to low resolutions from the blurred edge/line-like structures produced by encoding, even though the SRDNN methods were trained with compressed LR images. In some cases, when the SRDNN methods were trained with compressed LR images, they produced overly smooth unnatural images. It appears that the current SRDNN methods may not able to provide global solutions to all super-resolution VOLUME 9, 2021 FIGURE 43. Enlarged tree branches, (a) nearest neighbor, (b) bilinear, (c) RCAN, (d) ESRGAN, (e) CAR [27], (f) HAN [28]. The tree branch patterns of the SRDNN methods are very unnatural. All of them created visible vertical stripes at the left side.
problems, though they may be a good solution for specific applications.

IV. CONCLUSION
In this paper, we analyzed the working mechanisms of SRDNN techniques and investigated the one-to-one mapping nature. After we modelled SRDNN methods in terms of weight vector generating and dynamic linear transformation, we analyzed a large number of projection vectors (over 70 million) and found that the SRDNN method showed one-to-one mapping-like properties. After further analyses of real-world images, it appears that current SRDNN techniques are vulnerable to unknown data patterns since the one-toone mapping function is designed with a limited number of training samples, though the mapping space is enormous. Although it is desirable to restore the blurred edges/lines due to low resolutions without enhancing the blurred edge/linelike structures (e.g., coding artefacts, mosquito noise, etc.) produced by encoding, the current SRDNN methods failed to distinguish them and enhanced both, thereby generating undesirable artefacts. Using compressed LR images as training data failed to completely solve the problem and produced overly smooth unnatural images in some cases. Superresolution is an ill-posed problem, though recent SRDNN methods have shown promising results. However, to provide robust performance for real-world compressed images, new structures may be necessary for successful SRDNN applications.