Deep Region Adaptive Denoising for Texture Enhancement

Image denoising is a highly challenging problem yet important task in image processing. Recently, many CNN-based denoising methods have made great performances but they commonly denoise blindly texture and non-texture regions together. This frequently leads to excessive texture smoothing and detail loss. To address this issue, we propose a novel region adaptive denoising network that adjusts the denoising strength according to region textureness. The proposed network conducts denoising tasks for texture and non-texture areas independently to improve the visual quality of the resulting image. To this end, we first generate a texture map that separates the image into texture and non-texture region. Because the difference between texture and non-texture is more evident in the frequency domain than in the spatial domain, the classification is performed through discrete cosine transform (DCT). Second, guided by the texture map, denoising is performed independently in two subnets, corresponding to the texture and non-texture regions. This allows the texture subnet to avoid excessive smoothing of high frequency details, and the non-texture subnet to maximize noise reduction in flat regions. Finally, a cross fusion that takes into intra- and inter-relationship between two resulting features is proposed. The cross fusion highlights the discriminant features from two subnets without degradation when combining the output of two subnets, and thus helps enhancing the performance of regions adaptive denoising. The superiority of the proposed method is validated on both synthetic and real-world images. We demonstrate that our method outperforms the existing methods in both objective scores and subjective image quality, in particular showing outstanding results in the restoration of visually sensitive textures. Furthermore, ablation study shows that our network can adaptively control the noise removal strength by manually manipulating the texture map and that the details of the texture region can be further improved. This also can simplify the cumbersome noise tuning process when deploying deep neural networks (DNN) architectures into products.


I. INTRODUCTION
Image denoising is a fundamental and indispensable task in image processing. The common denoising approach reconstructs an original image by separating noises from its degraded version, typically modelled by y = x + n. Since The associate editor coordinating the review of this manuscript and approving it for publication was Hengyong Yu . the noise term n and the original image term x are not known, it is an ill-posed problem that makes it highly challenging and requires strong priors. Recently, CNN models [1], [2], [3], [4], [5], [6], [7] achieve a great denoising performances by learning strong priors from large scale data. Although many CNN-based denoising methods have made great performances, they still tend to excessively smooth high frequency components of texture regions. Basically, texture region has a larger amount of high frequency components rather than non-texture region. However, most of the existing methods commonly denoise blindly texture and non-texture regions together. This commonly leads to excessive texture denoising and consequently, results in high frequency detail loss. One solution to this over-smoothing problem would be to denoise texture and non-texture regions separately. This first requires to classify texture regions from an image by quantifying their textureness. Then, the extent of noise removal strength may be adaptively controlled according to the texture quantification. Based on this idea, we propose a novel region adaptive denoising method that conducts independently denoising tasks for texture and non-texture regions as illustrated in Fig. 1. The proposed method consists of three steps as follows.
First, a ground-truth texture map indicating the textureness of the image is obtained by learning the texture classification network. The difference between texture and non-texture regions is more distinct in the frequency domain. Therefore, the image patches are first transformed to the frequency domain using discrete cosine transform (DCT) and then fed to the network to obtain the ground-truth texture probability of each patch.
Next, region adaptive denoising is performed based on the texture probability. To this end, we propose a novel deep network with two subnets for texture and non-texture denoising. In the proposed network, We first estimate the texture map for the input image using the ground truth map in the first step. The estimated texture map serves to classify the input image into texture and non-texture spatially in the feature domain and the classified regions are sent to separate subnets. This ensures that the texture subnet avoids excessive smoothing of high frequency details, and the non-texture subnet maximizes noise reduction in flat regions.
Finally, the denoised features from two subnets are combined using the cross-fusion module. Existing methods for combining features commonly utilize concatenation or element-wise multiplication of multiple features [8], [9].
However, they may result in the loss of the important features to make a decisive contribution to the classification or regression. Sometimes, they suffer from an unbalanced and biased fusion toward one distinct output. To combine features more effectively, [10], [11], [12] proposed the attention module and its variants which consist of spatial and channel attention blocks to improve the fusion performance. However, these methods still cannot effectively take the advantage of the correlation between different features. The proposed cross-fusion module permutes the results of the self and cross attention to allow inter and intra feature fusion. This enables us to effectively learn the correlation from the output features of two subnets and preserve discriminative information for adaptive denoising.
Furthermore, the proposed method can control the degree of noise removal by simply adjusting the texture map without an iterative training process to visually find the optimal parameters. This is a great advantage when applied in real-world use cases. Data-driven neural network approaches are difficult to tune in, and different providers have different optimal criteria for imaging systems. Unlike existing denoising methods, in the proposed system, the texture map not only divides the image into textured and non-textured regions, but also controls the strength of denoising in textured regions. Therefore, it is easy to apply different levels of denoising by simply replacing the texture map. A detailed experiment is given in Section IV-D. In summary, our contributions are as follows: • A region adaptive denoising network is proposed to independently perform denoising tasks for texture and non-texture regions using two subnets.
• A texture map based on DCT classification is proposed to indicate the texture probability of the region, which enables the two denoising subnets to exclusively focus on their specific domain. The texture map is versatile for controlling the noise removal strength of a region, making it easy to find optimal tuning for deployment.
• A cross-fusion module is proposed to effectively fuse outputs of both subnets by exploiting intra-and inter-relationship between the textured and non-textured features.
• We conduct comprehensive denoising experiments and ablation studies to demonstrate the superiority of the proposed over existing methods.
We review related works in a section II and introduce our proposed network in a section III. The experiment results and analysis are provided in a section IV. Ablation studies are provided in a section V. Finally, we discuss the conclusions in a section VI.

II. RELATED WORKS
Recently, as deep learning models have achieved great performance in denoising, many CNN-based denoising networks are emerging. Jain et al. [3] firstly proposed to apply convolution neural network (CNN) for the denoising task. Comparison of high-frequency components distribution between texture and non-texture regions. The shape of high frequency distribution is clearly different for both texture and non-texture regions.
Burger et al. [13] achieved a higher performance than BM3D using a plain multi-layer perceptron (MLP). Based on success of CNN architecture, Zang et al. [1] demonstrated residual learning and batch normalization make the training of deeper CNNs faster and more reliable. To further improve residual learning, [14] proposed DC-ResBlock to obtain a large receptive field. [15], [16], [17] proposed diverse methods to incorporate non-local operations in DNN for image denoising. Although these methods have exploited both local and non-local correlations, it was still challenging to fully exploit both correlations. To overcome this problem, [18] performed sparsity modeling for local and non-local correlations separately, which lead to optimal sparsity, and thus improved the denoising performance. Also, [19] proposed a noise basis learning method through subspace projection to maintain a non-local structure of input. However, these methods assume that the noise is spatially invariant. To deal with spatially variant noises, [2] proposed FFDNet to utilize a non-uniform noise level map. Inspired by FFDNet, [20] proposed deep plug-and-play IR to handle a wide range of noise levels by taking advantages of Resnet and UNet architecture. [21] used a hybrid orthogonal projection and estimation framework to improve the generalization capability of the network in the context of spatially varying and unknown noises. TSDNET [22] proposed a multi-scale attention module which utilizes both spatial and frequency features to reconstruct dehazed images. The heterogeneous features contribute to better performance.
Recently, as transformer model has shown great performance in natural language tasks [23], diverse denoising methods have adapted transformer-based model instead of CNN-based model [24], [25]. Vision transformer [26] divides image into a local patches and provides the sequence of local patches as an input to learn their correlation. Inspired by transformer architecture, [24] proposed a pre-trained model IPT for low-level vision task based on the transformer. However, since the computational complexity of self-attention in transformer increases quadratically with the size of the image, it has a limitation in denoising task using high-resolution image as an input. To overcome this problem, [25] applies self-attention across channel dimensions instead of spatial dimension to have linear complexity rather than quadratic.
Although CNN-and tranformer-based denoising methods have shown tremendous success, they still suffer from preserving weak textures or high-frequency details. Generally noise and texture are very similar to each other in that both consist of high frequency. In other words, it is significantly challenging to distinguish noise from texture. On the other hand, the non-texture region composed of low-frequency components can be easily identified from noise. However, regardless of different frequency components depending on the regions, many denoising methods perform image denoising blindly with the designed loss aiming to restore the original pixel level on average, and thus, make it difficult to preserve edge and texture detail without degradation in the process. Therefore, denoising process frequently tends to smooth texture details as well as noises. This makes it difficult to keep original texture information during denoising. To overcome this problem, texture-preserving denoising methods such as Gradnet [27] and DCEF [28] have been proposed. Their common approach is to exploit edge map extracted by an edge detector in deep networks. However, these methods just feed the edge map into the network as a prior to preserve edge during denoising. Therefore, the map itself does not significantly affect the denoising task.
Unlike the previous works above-mentioned, the texture map in the proposed method represents the textureness score or probability for all regions of the image. This map divides the extracted features by regions and sends them to individual subnets for adaptive denoising. As a similar approach, we previously proposed NRIBP [29] which addresses to super-resolve a noisy image with texture-adaptive backprojection, but it is a non-deep-learning method for image super resolution. There is no work to adaptively perform deep learning based denoising depending on textureness.
Attention in neural networks is widely used to concentrate on important local regions and allows network to extract discriminative features. SENet [30] proposed channel attention to improve the representational capacity of network in image classification. CBAM [31] used dual attention modules that consist of channel and spatial sub-modules to better super-resolve image. It has been shown great performance in many applications, including image classification [32], action recognition [33], segmentation [34], [35], image captioning [36], [37], etc. Recently, attention methods are also used for diverse feature fusion tasks [10], [12], [38], [39], [40]. In [12], the attention method was used for fusing low dynamic images to a high dynamic range images. [10] proposed the cross attention network to combines contextual features and spatial information. [39] proposed a HSI-LIDAR fusion based on cross attention, where the attention is used to highlight the inter-modal features between LiDAR and HSI. Based on the success of transformers in computer vision, various methods utilized transformer units to execute attention mechanism. Wei et al. [41] designed a transformer based cross-attention model for image and sentence matching by jointly exploiting the intra-and inter-modal relationships of image regions and sentence words. [42] proposed a dual-branch transformer to fuse image patches of different scales for image classification. Motivated by these studies, we propose a CF module that effectively combines texture and non-texture features. The proposed method extracts noise features that are distinguished from texture and non-texture regions in separate self-attention modules, and incorporates cross-attention to fuse inter-relationship between the texture and non-texture features.

III. THE PROPOSED METHOD
Existing denoising methods optimize a loss function over the entire image region without identifying texture and non-texture areas with different frequency distributions. This makes smooth textures and leaving background noises. The proposed method solves this problem by independently denoising from two subnets and fusing them with the novel combining method. The texture map adaptively steers the denoising process.

A. MOTIVATION FOR THE USE OF TEXTURE MAP
In general, texture and non-texture regions have different spectral distributions. Specifically, the texture has a larger amount of high-frequency rather than the non-texture. Noise is typically characterized by high frequency distribution. If it is mixed with texture region, it is considerably difficult to separate noises from the textures, which have certain overlap with noise in the frequency domain. This makes texture denoising highly challenging. On the other hand, noises in the non-texture region can be easily identified. Thus, both regions should be denoised adjustably according to the spectral distribution of signals. Especially, in the texture region, we need to compromise the extent of noise removal to preserve original high-frequency components. However, if we blindly estimate noise through one loss function without separating texture and non-texture, the signals of all regions are trained to be similar to ground-truth with an average value, hence essentially leaving background noise and smooth texture. To overcome this problem, we propose a deep learning based network that independently carries out denoising tasks for texture and non-texture and can adjustably control the extent of removal strength through texture map. Prior to the detailed illustration of the our proposed network, we firstly verify that whether texture and non-texture have different spectral distributions as follows: We compare the high frequency magnitude between texture and non-texture. To obtain patches for texture and nontexture, we used sobel edge detection to extract the gradient components of images in Div2K dataset [43], then defined 100 × 100 regions with the highest gradient value for each image as texture, and conversely, 100 × 100 regions with the lowest gradient value as non-texture. To analyze the high-frequency components of each local region, we built the texture and non-texture local patches by randomly cropping images with 8 × 8 patches. Fig. 2 plots the high-frequency distribution acquired by DCT for 3000 pairs of texture and non texture local patches. It can be observed that high frequency distributions are clearly different for both texture and non-texture regions. This observation is effectively exploited for texture classification.

B. THE GENERATION OF GROUND-TRUTH TEXTURE MAP
In this paper, we first classify texture regions probabilistically in the image, and the classification output is named by texture map. To obtain a texture map, texture is classified with a deep network as shown in Fig. 3 (a). The classification is done on DCT domain because the difference between texture and non-texture is more evident on frequency domain rather than spatial one. An image is partitioned into 8 × 8 patches, and each patch is transformed through DCT. Note that 8×8 patch was found experimentally to be quite effective for texture classification. The transformed patch is put into the texture classification network, which is learned in a supervised manner. We built a dataset which has 3000 pairs of texture and non-texture patches with texture labels. The network is trained to generate a texture probability value between 0 and 1 through the sigmoid activation function for texture and non-texture patches, respectively.
The trained network is used to generate a ground-truth texture map of the target image as illustrated in Fig. 3 (b). We first extract patches of size 8×8 from the target image, and put them into the texture classification network after DCT. Since the textureness should be determined by reflecting the surrounding regional characteristics together, the texture probability value from the network output is equally mapped to the local window 8 × 8 patch (denoted P i ), and embedded into the texture map. For estimating the textureness of a pixel with its surrouding pixels, we calculate the average of the map values overlapped by P i , and it is defined as the textureness probability of each pixel. We can see in Fig. 3 (b) that as expected, the region of sun flowers is classified to texture (close to 1) while the regions of sky and leaves are to nontexture (close to 0).

C. REGION ADAPTIVE IMAGE DENOISING
With the texture maps obtained in the previous subsection, the proposed denoising network first generates a texture map for a noisy input, and then probabilistically decomposes the input image into texture and non-texture components based on this map. They are independently denoised through seperate subnets, and the outputs of both subnets are combined to obtain a denoised image eventually. As illustrated in Fig. 4, the proposed region adaptive denoising network consists of four stages: the map generation module, texture and non-texture denoising subnets, the cross-fusion module and the reconstruction module.

1) MAP GENERATION MODULE
The map-generation module of the proposed network generates a clean texture map from the noise input image as shown in Fig. 4. The detailed description is as follows: First, we generate the ground-truth texture map as illustrated in Fig. 3 (b) by putting the ground-truth image into the texture classification network. Second, the map-generation module in Fig. 4 is trained to generate the texture map from the noise image using the ground-truth texture map. Let I n ∈ R H ×W ×C be a noise image, M t ∈ R H ×W a texture map and G : R H ×W ×C → R H ×W is a map generation process. The map generation module can be written as: The module is trained by optimizing the loss which is the prediction error of the texture map, and it is expressed by where M gt is the ground-truth texture map obtained by DCT based texture classification in Fig. 3.

2) TEXTURE AND NON-TEXTURE DENOISING SUBNETS
As illustrated in Fig. 4, we first extract the features of the noisy input through the feature extraction layer composed of a 3 × 3 convolution layer, F m ∈ R HXWXK . By multiplying the input features with the resulting texture map, the texture feature is partitioned as where denotes the element-wise multiplication. Similarly, we obtain a non-texture map M nt by inverting the texture map, and non-texture features are separated by Then, the both separated features are fed into the non-texture subnet, D nt and the texture subnet, D t . And the outputs of each subnet are expressed by where I nt and I t the output of non-texture and texture subnet, respectively, and they describe the estimated residual (noise) of the proposed subnets. To adaptively control the intensity of denoising depending on the textureness of each region, the losses of both subnets are given by

3) CROSS FUSION (CF) MODULE
D t and D nt extract the relevant noise information, I t and I nt from each subnet. I t lacks high frequency components for D t to preserve the texture as much as possible. I nt extracted from D nt has a large portion of high frequency corresponding to noise to smooth flat areas. To effectively fuse I t and I nt with inherently different noise characteristics, the CF module consisting of self-and cross attention is utilized to consider the intra-and inter-relationship, which is motivated by [25]. The two self-attention blocks, as illustrated in Fig. 5, enhance the region of interest within each intermediate feature,F t and F nt . In contrast, the cross-attention relatesF t andF nt to obtain mutually exchangeable information on the context of textureaware denoising. Note thatF t andF nt ∈ R N ×K . Self attention module consists of two sub-layers, multihead attention layer and feed-forward layer as shown in fig  5. (c). In multi-head attention layers, the attention is obtained by projecting the queries, keys, and values. Specifically, given a layer normalized tensor X ∈ R H ×W ×C , we calculate query, key, and value for the input: where Q ∈ R HW ×C/n , K ∈ R C/n×HW , V ∈ R HW ×C/n , n is the number of heads and W ( * ) means convolution layers, which consist of 1×1 point-wise convolution and 3×3 depthwise convolution [25]. Next, we generate the attention map with dot product of key and query, and the attention function is performed as follows: where α is a learnable parameter that decides the weight of attention map. After that, we concatenate the multi-head attention maps together, and transform features using feed forward layers for improving the representation learning. Similar with the self attention module, the cross attention module involves the transformer unit as shown in Fig. 5  (b). Although the self attention module extracts the discriminative information from each domain feature, interrelationship (e.g., the relationship between texture and nontexture features) is not allowed. To achieve a robust multidomain fusion, two different domain feature maps have to be learned for fusion by sharing information about each domain feature. Therefore, we propose a 'cross-attention' mechanism to extract inter-relationship. The detailed formulation as follows: First, we feed two feature maps in separate multi-head attention layers, and generate query, key, value, given by where X and Y are texture and non-texture feature maps. Next, attention map is generated by dot product of key and query from different domains (texture and non-texture), and the attention function is carried out as By doing this, feature maps can learn correlation from different domain features. Then updated feature maps are fed into the feed forward layers, and finally we can obtain a inter-domain relationship by fusing these feature maps through convolution layer. An overall architecture of our proposed CF module is shown in Fig. 5 (a). In order to effectively reconstruct the final outputs by fusing both texture and non-texture features, we concatenate self-and cross-attention outputs, and finally, feed them into a self-attention module. Therefore, we can synthetically exploit intra-relationship within each domain and inter-relationship between texture and non-texture features through CF module.

4) RECONSTRUCTION FOR FINAL OUTPUT
The aggregated multi-channel feature map,F from the CF module is converted to the noise outputÎ fn by the reconstruction layers as follows.
where R represents the reconstruction layer. Finally, the denoised image is obtained byÎ f = I n −Î fn . The loss for the reconstruction module is given by

D. DETAILED NETWORK ARCHITECTURE AND LOSS FUNCTION
Our texture classification network consists of two 3 × 3 convolution layers and three fully-connected layers as shown in Fig. 3 (a). Each convolution layer is followed by ReLU activation function and max pooling layer. The network is trained to generate a texture probability value between 0 and 1 through the sigmoid activation function for texture and nontexture patches, respectively. As illustrated in Fig. 4, our proposed network consists of a map generation subnet (G) and two denoising subnets (D nt , D t ), and these subnets have the same structure based on UNet [54]. Each subnet has four encoder and decoder stages with skip connections, where feature maps are 1 2 scale downsampled with a 4 × 4-stride-2 convolution at each last encoder stage and 2 scale up-samples with 2 × 2 deconvolution before each decoder stage. Skip connections pass low-level feature maps which contain detailed raw information from each encoder stage to its corresponding decoder stage. The basic convolution blocks in map-generation subnet and two denoising subnets follow the same residual convolution architecture which is composed of two 3 × 3 convolution   Denoising examples from CBSD68 [56]. Compared to the other models, our results more effectively removes noise while preserving the texture region details. layers, each followed by LeakyReLU activation function. The reconstruction layer consists of one encoder and decoder stage, and each stage consists of the residual convolution block.
To classify the regions of the image and remove noises according to the characteristics of regions in an endto-end manner, we design the overall objective of our network as, where α and β denote the hyperparameters for the denoising subnet loss and the map generation loss. α = 0.2 and β = 0.01 were found to be experimentally appropriate and applied to the entire paper unless otherwise specified.

IV. EXPERIMENT A. TRAINING SETTINGS
Throughout the study, the DIV2K [43] and Color BSD (CBSD) [56] datasets are used to evaluate the denoising performance of synthetic noise images. DIV2K consists of 1000 natural images that are approximately 2000 x 1400 pixels in size. The CBSD dataset is part of the Berkely Segmentation Dataset and Benchmark, consisting of 500 images. Among them, 800 training images for DIV2K and 432 training images for CBSD are used in the training process. Also, VOLUME 10, 2022  Denoising examples from Kodak24 [59]. The image restoration quality of our results is more faithful to the ground-truth than other results.
we used 200 validation images for DIV2K in the validation process.
Adam optimizer [57] is used with the momentums of (0.9, 0.999), and the image are randomly cropped to 128 x 128 patches for the input of the network. The learning rate is initialized as 10 −4 and decays to 2 × 10 −1 every 500 epochs.
In this paper, we evaluate the proposed network with the noise images that are corrupted by spatially invaraint AWGN. To evaluate our proposed network for Gaussian denoising, we synthetically generate training dataset [43], [56] with the different Gaussian noise levels (σ = 15, 25, 50) and train our network for each noise level.

B. DENOISING PERFORMANCE
To quantitatively evaluate the denoising performance of the proposed method, we use five datasets: BSD68 [56], Set5 [61], LIVE1 [55], kodak24 [59], McMaster [58] with the different Gaussian noise levels (σ = 15, 25, 50). Table 1 summarizes the PSNR results of each Gaussian noise level compared to the CNN-based denoising methods. The proposed method outperforms all compared methods. GradNet utilizes a gradient map to better preserve texture details, and thus can be directly compared to the proposed method. However, the proposed method achieves an average 0.22 dB higher PSNR than GradNet. Compared to the recently proposed NBNet, our method shows better results at the sigma (15,25), but it is similar or slightly higher when noise variance is 50 2 . The proposed method not only reconstructs a clean image, but also trains the internal subnet to handle texture separately. Even in the case of heavy noises, where high frequency signals are dominant, as illustrated in Figs. 6 and 7, the proposed method tries to reconstruct the original texture  . Denoising result depending on the non-texture map manipulation. The meaning of M nt (x) indicates that the non-texture map is uniformly filled with x. The result shows that texture denoising subnet prevents high-frequency details and non-texture subnet maximizes noise reduction, hence our network can adaptively adjust the strength of noise removal through the non-texture map.
from the messy and unrecognizable image using the texture map. Therefore, when noise becomes strong (σ = 50), PSNR is slightly degraded due to the unique property that prioritizes the texture reconstruction. Even though the quantitative PSNR improvement of the proposed method is negligible compared to NBNet when noise variance is 50 2 , subjective visual quality is compelling, especially on texture regions as illustrated in Figs. 6 and 7. It can be observed that our results show superiority in preserving the weak-texture details than the existing methods. Table 2 illustrates a quantitative comparison between the previous methods. Especially, we compare with the transformer-based model IPT. Compared to the existing methods, our proposed method reports competitive results on quantitative evaluation. Even though IPT achieves higher PSNR than our proposed method for the noise level 50 on McMaster dataset, subjective visual quality of our method is superior to others in preserving the fine-scale textured regions and high-frequency components as illustrated in Fig. 8. Also, as shown in Fig. 9, our method shows better performance than other approaches in restoring the head and features of a person, which are important texture detail of the image. It is worth noting that, while other methods focus on only quantitative evaluation criteria such as PSNR, our proposed method improves quantitative and subjective image quality performances simultaneously by allowing texture region and non-texture region to be processed independently.

C. DENOISING REAL-WORLD NOISY IMAGES
In a real world, noise arises not only from the sensor, but also from the various image processing steps of the camera pipeline, such as gamma correction, quantization, white balancing, JPEG compression, etc. Therefore, it is difficult to model the noise distribution in a uniform or Gaussian manner. In order to evaluate the practicality of the proposed method in practical use cases, we exploit RNI 15 dataset [60], which contains 15 real noise images. Since the proposed method is a non-blind denoising model, we approximated the noise level as σ = 50 for the evaluation, and compared with existing VOLUME 10, 2022 FIGURE 12. Denoising results by non-texture map manipulation (a) The proposed method using original non-texture map, (b) The proposed method using non-texture map(x0.8), (c) The proposed method using non-texture map(x0.2), (d) Denoising result of restormer. (The Restormer does not use non-texture maps for denoising). In the proposed method, texture quality and noise in the texture region can be compromised by simply adjusting the texture map on the fly. This is a great advantage, facilitating fine-tuning suitable for imaging software. Also texture enhanced results show significantly improved texture quality over Restormer.
methods. Because RNI 15 dataset has no ground truth images, we illustrate some of resulting images in Fig. 10. It can be seen that, unlike the other methods smooth texture details and noises simultaneously, the proposed method preserves more textures and structures, similar to the evaluation results of synthetic data.

D. BOOSTING TEXTURE REGION WITH ADJUSTING NON-TEXTURE MAP
In general, CNN-based denoising methods conducts blind denoising, which finds optimal parameters by learning strong priors from large scale data. Thus, if we need to adjust the intensity of noise removal, these methods should control the denoising strength by conducting iterative training process. Unlike these methods, our network can simply control the degree of noise removal by adjusting the non-texture map. To be specific, our non-texture map acts as an indicator that informs the textureness of the input image to the network, hence our proposed network can control the denoising strength with the probability value of the non-texture map.
To prove this, we evaluate the non-texture maps manipulated with arbitrary values as illustrated in Fig. 11. When non-texture map is set to 1, all regions are assumed to be non-texture and all regions are strongly denoised, leading to over-smoothing as shown in Fig. 11 (a). However, when non-texture map is filled with 0 (i.e., all regions are assumed to be texture), the strength of denoising is relatively weak, and texture details are kept as shown in Fig. 11 (c). If nontexture map is uniformly 0.5, this means that the region is neither texture nor non-texture. So, it is denoised with medium FIGURE 13. Visual comparison with and without non-texture map. Note that the non-texture map plays a decisive role in reconstructing the fine pattern of the Buddha image by allowing texture and non-texture to be processed independently. strength as shown in Fig. 11 (b). It can be seen that texture denoising subnet preserves high-frequency details and nontexture subnet maximizes noise reduction, hence our network can adaptively adjust the strength of noise removal through the non-texture map. Therefore, if the subtle adjustments in denoising level are required, this can be done by simply adjusting the non-texture map, without redundant process of preparing noisy images with different target sigma and training the entire network from scratch. Fig. 12 presents qualitative comparisons of texture regions by adjusting the non-texture map. And also we compare the subjective visual quality with the Restormer. In order to demonstrate the efficiency of our network, we just change the non-texture map without additional update of the network for other results. Fig. 12 (a) is the output of our proposed network with the original non-texture map from learning. Figs. 12 (b) and (c) are the results that multiply parameters (0.8 and 0.2) for increasing the texture probability value  (close to 0) of the texture regions. Especially, to prevent the non-texture regions from being affected, we set a threshold value of the non-texture map and only multiplied parameters in the texture regions. As shown in Fig. 12 (b), with the increase of texture probability value, texture details are kept better and SSIM is also improved while PSNR is slightly degraded compared to Fig. 12 (a). In addition, it is also observed that further improvement of texture probability makes the texture detail more distinct as shown in Fig. 12 (c). Furthermore, as illustrated in Fig. 12 (d), our results show significantly improved texture quality over Restormer by simply ajdusting the non-texture map. It can be seen that as the texture probability increases, the high-frequency components are better preserved. Although there is a slight reduction in PSNR, the detail of the texture region is clearly improved. Therefore, our proposed network can control the noise removal strength by adjusting the non-texture map and the detail of the texture region can be further improves as needed with some quantitative compromise.

V. ABLATION STUDY
To demonstrate the effectiveness of our proposed network architecture, we evaluate the two main ideas of our model. First, we examine the effectiveness of two-branch network structure using texture map. Second, an ablation study is performed to find out how the proposed cross-fusion module affects the overall reconstruction performance.

A. EFFECT OF TWO DENOISING SUBNET STRUCTURE
To validate how individual subnet architecture and non-texture map are involved in improving denoising performance, we design a baseline without non-texture map (i.e: a single denoising subnet without non-texture map) and compare it with the proposed architecture as illustrated in Fig. 13. The baseline without the non-texture map excessively smoothes the texture region such as the forehead and hair as shown in Fig. 13 (d), while the proposed subnet architecture reconstructs the subtle patterns of the Buddha's surface by preventing the textures from being smooth in Fig. 13 (e). In the quantitative comparison as shown in Table 3, the use of individual subnets results in a performance improvement of more than 0.16 dB over the baseline without a nontexture map. It can be seen that by separating the region of image (texture and non-texture) through non-texture map, our proposed network can adjust the denoising strength according to the characteristics of the area and enhance the denoising performance, especially on the texture regions.

B. CROSS FUSION MODULE
As described in Section III.C.3, we proposed the CF module that extracts discriminant information and simultaneously learns inter-relationships between texture and non-texture features. In order to evaluate the effectiveness of our CF module, we compare the proposed CF with simple concatenation. Fig. 14 (d) illustrates the result images by substituting the CF with the concatenation. Compared to the proposed CF in Fig. 14 (e), the performance is degraded especially in reconstructing texture patterns. Furthermore, the proposed method with the CF module (individual subnets + w/ map + cross fusion) achieves about 0.08 dB higher than just concatenation for fusion (individual subnets + w/ map + cross fusion) as shown in Table 3.

C. SUBSTITUTING D t AND D nt WITH DnCNN
As with the previous subsections, we analyze the effect of the proposed network architecture to use DnCNN as baseline. As shown in Table 3, DnCNN (individual subnets + w/ map + concat) achieves about 0.37dB higher than DnCNN and DnCNN (individual subnets + w/ map + cross fusion) achieves about 0.42dB higher than DnCNN. It can be seen that our proposed network architecture universally achieves a great performance using any other networks as well as our proposed network.

D. GRAY VS. BINARY SCALE NON-TEXTURE MAP
The significance of the non-texture map is examined by comparing a binary scale map with a gray one. The binary scale map means that all pixels are either texture or nontexture, while the textureness of each pixel is probabilistically expressed in the gray scale map. As shown in Fig. 15 (b), using a binary map results in excessive smoothing on weak texture regions such as the line shape which has less textureness because those regions are classified as non-texture. On the other hand, those weak texture regions are enhanced better for a gray scale map. We can see that denoising strength can be controlled elaborately for all regions with a finer nontexture map.

E. NOISE SENSITIVITY ANALYSIS
Most denoising methods work on the assumption that the noise level was perfectly predicted. However, in reality, the real noise level is neither known nor estimated perfectly. Therefore, noise mismatch is essentially occured when removing the real noise. If the estimated noise level is higher than the real noise level (over-estimated), image details are removed with noise. On the other hand (under-estimated), the noise is not completely removed. Therefore, we evaluate in case of noise mismatch to assess the noise sensitivity of our proposed network in comparison with FFDNet, DnCNN, DRNet, IPT. Since the noise level map is not used in our proposed network such as BM3D and FFDNET, we used the model trained on σ = 50 for evaluating on BSD68 images with noise level ranging from 0 to 60. The noise level sensitivity curves are shown in Fig. 17. Our method shows better results in both cases of underestimating and overestimating the real noise level. Especially, our network results are superior when underestimating the real noise level.
Qualitative results for noise sensitivity in underestimating are shown in Fig. 16. The methods trained on σ = 50 were evaluated on BSD68 images with different gaussian noise levels (σ = 15,35). The second row is the result of each model for noise image (σ = 15) and third row is the result of each model for noise image (σ = 35). In the case of overesimation, the exisiting methods remove noise from the entire area at a certain intensity, leading to smoothing texture regions. However, since the proposed method extracts the texture map from the noise image, it is able to adjust the noise removal strength based on the textureness of regions. Therefore, our results show superior in preserving the texture details compared to the existing methods.

VI. CONCLUSION
In this paper, we provide a novel denoising method that adaptively controls the extent of noise removal strength according to image regions. We first propose a classification network to divide an image into texture and non-texture region, which works on DCT domain, and generate a texture map. Using this texture map, we propose the region adaptive denoising network that conducts independently denoising tasks for texture and non-texture regions. By fusing the denoised results for both regions, we adjust the denoising strength according to the texture characteristics of regions and further enhanced texture denoising performance by minimizing texture smoothness. Various experiments have demonstrated that the proposed network can control the noise removal strength using texture maps and achieve better performance than blind denoising.