Multi-scale Adaptive Fusion Network for Hyperspectral Image Denoising

Removing the noise and improving the visual quality of hyperspectral images (HSIs) is challenging in academia and industry. Great efforts have been made to leverage local, global or spectral context information for HSI denoising. However, existing methods still have limitations in feature interaction exploitation among multiple scales and rich spectral structure preservation. In view of this, we propose a novel solution to investigate the HSI denoising using a Multi-scale Adaptive Fusion Network (MAFNet), which can learn the complex nonlinear mapping between clean and noisy HSI. Two key components contribute to improving the hyperspectral image denoising: A progressively multiscale information aggregation network and a co-attention fusion module. Specifically, we first generate a set of multiscale images and feed them into a coarse-fusion network to exploit the contextual texture correlation. Thereafter, a fine fusion network is followed to exchange the information across the parallel multiscale subnetworks. Furthermore, we design a co-attention fusion module to adaptively emphasize informative features from different scales, and thereby enhance the discriminative learning capability for denoising. Extensive experiments on synthetic and real HSI datasets demonstrate that the proposed MAFNet has achieved better denoising performance than other state-of-the-art techniques. Our codes are available at \verb'https://github.com/summitgao/MAFNet'.


I. INTRODUCTION
H YPERSPECTRAL images (HSIs) have developed rapidly with the maturity of remote sensing technology. HSIs have been extensively applied in land cover classification [1]- [4], semantic segmentation [5], change detection [6]- [11], oil spill monitoring [12]- [14], and geographic transport prediction [15]. In these applications, high-quality images are commonly desired. However, during the HSI acquisition, some noise corruptions are inevitable and degrade the visual quality considerably [16]. Hence, removing noise from the acquired HSI is a critical step for many remote sensing applications [17]. The task of highly efficient HSI denoising has recently captured numerous research attention.
The HSI denoising task aims to recover an underlying clean image I from a noise observed data I N . The degradation model is commonly formulated as I N = I + I * N . Here, I * N denotes the mixed noise. To solve the ill-posed inverse problem, This   many methods model the prior knowledge of the clean image I to constrain the solution space. Total variation [18]- [21], sparsity-driven models [22]- [24] and low-rank representations [25]- [28] are commonly used. The intrinsic structures in HSIs are modeled by these optimization-based models for noise removal.
During the past few years, the deep learning-based model for low-level vision tasks has demonstrated significant potential and performance improvement. It has been extensively employed to image restoration tasks such as compression artifact reduction [29], image denoising [30], [31], and image super-resolution [32], [33]. In HSI denoising, the deep learning-based model has yielded excellent results recently. Deep convolutional neural networks (CNNs) are capable of exploiting rich feature representations from large-scale training data, instead of hand-crafted features, which are designed according to prior knowledge. Most existing CNN-based HSI denoising methods follow high-resolution feature processing [34]- [36]. These methods do not employ any downsampling operation; and hence, more accurate spatial details can be retained in the denoising results. However, the contextual information is inclined to get lost due to the limited receptive field. To effectively encode the contextual information, some researchers employ an encoder-decoder architecture [37], [38]. The input HSI is progressively mapped into a low-resolution representation and then gradually mapped to the original resolution. The broad context can be learnt in the low-resolution representation [39]. However, some fine spatial details are arXiv:2304.09373v1 [eess.IV] 19 Apr 2023 hard to be reserved. Such detailed information is hard to be recovered in the decoding stages.
It is crucial to encode the contextual information and preserve the spatial details simultaneously for robust HSI denoising. However, it is a non-trivial task due to the following challenges: 1) Tradeoff between the spectral-spatial detail preservation and contextual information modeling. HSI denoising creates pixel-to-pixel correspondence from the observed data with complex noise to the clean image, and it is essential to preserve the detailed spectra and texture via high-resolution feature processing networks. However, the contextual information is hard to be encoded while preserving the spatial details by existing models. Hence, how to preserve the detailed spectra and texture while encoding the contextual information effectively is of great significance. 2) Multiscale information aggregation. Image contents from multiple scales encode complementary information for feature representation, as shown in Fig. 1. Existing multiscale models rarely exchange information across different scales flexibly. The correlations among different scales have not been fully exploited. Therefore, how to aggregate multiscale information into a unified framework is a tricky task. Wang et al. [40] proposed a deep High-Resolution representation Network (HRNet) for visual recognition. It maintains highresolution representations in the whole network and repeatedly exchanges information among multi-resolution features. It has been widely used for human pose estimation [41], semantic segmentation [42]- [44], and multispectral image classification [45]. If such framework could be introduced into HSI restoration, the denoising performance could be further improved.
To solve the aforementioned issues, we proposed a Multiscale Adaptive Fusion Network (MAFNet) for hyperspectral image denoising. The framework of MAFNet is illustrated in Fig. 2. Specifically, we first generate a set of multiscale images and feed them into a coarse-fusion network to exploit the contextual texture correlation. Meanwhile, a fine fusion network is followed to exchange information across the parallel multiscale subnetworks. Furthermore, we design a co-attention fusion module to adaptively emphasize informative features from different scales, and thereby enhance the discriminative learning capability for denoising. We next adopt a reconstruction loss together with a global gradient regularization to optimize the network. We conduct extensive experiments on five publicly available datasets. The experimental results show that the proposed MAFNet outperforms several stateof-the-art baselines. Our MAFNet differs from HRNet in two respects: First, in the coarse-fusion network, the information interaction direction is only from the low-resolution features to the high-resolution features. In this stage, we aim to increase the receptive field to capture more content. Second, in multiscale feature fusion, HRNet transforms features from different resolutions to the same size and then concatenates them together as the fusion output. The proposed MAFNet use adaptive instance and co-attention mechanism for adaptive feature fusion.
The contributions of this work can be summarized as follows: • We propose a novel hyperspectral image denoising model MAFNet, which progressively fuses multiscale information. Hence, the global contextual information modeling and spatial detail preservation can be achieved simultaneously. • We present a co-attention fusion module to dynamically select the useful features from each scale subnetwork, enhancing the discriminative learning capability. Thereby, multiscale information is adaptively aggregated, and the correlations among different scales are concurrently enhanced. • Extensive experiments are conducted on two benchmark datasets, which demonstrates the rationality and effectiveness of the proposed MAFNet. Meanwhile, we have released our codes to benefit the remote sensing image restoration community. The reminder of this paper is organized as follows: In Section II, we review closely related HSI denoising methods. The details of MAFNet are described in Section III. Experiments on several datasets on HSI datasets are presented in Section IV. Conclusions are drawn in Section V.

II. RELATED WORK
The HSI denoising is an essential step to improve the image quality before interpretation. To date, a great number of methods have been proposed to reduce the noise in HSIs. In this paper, the existing methods are classified into three categories and introduced respectively.

A. Filter-Based Methods for HSI Denoising
In the beginning, the filter operator is generally used for HSI denoising, and it aims to separate the clean image from noisy signals by non-local means filter or Fourier transform. Othman et al. [46] presented a wavelet shrinkage method for HSI denoising. The method benefits from the feature dissimilarity between the spectral and spatial domains. Zelinski et al. [47] proposed a method based on wavelet decomposition and sparse approximation for HSI denoising, thereby exploiting the correlation between bands and higher quality band information. Maggioni et al. [48] proposed BM4D for volumetric data denoising, which is an extension of the BM3D filter. It embeds the grouping and collaborative filtering paradigms, thus integrating spatial and frequency domain filtering. Letexier and Bourennane [49] used the multidimensional Wiener filter for HSI denoising. Quadtree decomposition is also utilized to keep local characteristics. These filter-based methods are sensitive to the transform function and, therefore, can hardly remove the mixed noise in HSIs.

B. Model-Based Methods for HSI Denoising
The model-based method is the most popular representation tool for HSI denoising. Total variation [18]- [21], sparsitydriven models [22]- [24], and low-rank representations [25]- [28] are commonly employed to establish an optimization model for HSI denoising. For instance, Yuan et al. [18] presented an adaptive total variation model for HSI denoising, which simultaneously models the spectral and spatial noise distribution. Zhang et al. [20] proposed a HSI denoising method based on nonlocal low-rank tensor decomposition, in which the nonlocal similarity between the data cubes is captured to build a clean image. Zhuang et al. [50] proposed a denoising method that exploited the low-rank structure of the HSI data, utilizes the low-rank and self-similar characteristics contained in HSI for sparse and compact representation. Cao et al. [51] presented a subspace-based nonlocal low-rank method for HSI denoising. These methods achieve satisfying performance due to the comprehensive consideration of the image's prior information.

C. CNN-Based Methods for HSI Denoising
Recently, research on natural image restoration has been dominated by the deep CNN in recent years [29]- [32], [52]- [54]. Chang et al. [34] introduce the deep CNN model for HSI denoising, uses residual learning, dilated convolution and multi-channel filtering to enhance the ability to express spectral features. Liu et al. [35] presented a 3-D atrous convolution method for HSI denoising. Atrous convolution was employed to enlarge the receptive fields. Zhang et al. [55] employed the gradient learning strategy to capture the intrinsic and deep features of HSI. Cao et al. [56] proposed two global reasoning modules to exploit the contextual information along the channel and spatial dimensions, respectively. Both modules are combined in dense CNNs to exploit rich feature representations. Lin et al. [57] built a CNN-constrained nonnegative matrix factorization model for HSI denoising, realizes the optimization of noisy images through three stages: update of the spectral matrix, update of the abundance matrix, and estimation of the sparse noise. Wei et al. [58] designed QRNN3D, the model adopts the encoder-decoder structure, realizes the use of spatial pixel correlation information and spectral global information by building a three-dimensional recurrent unit, and uses the special structure of alternating directions to eliminate unreasonable causal dependencies. Capable of flexible processing of hyperspectral images.
Leveraging the powerful linear modeling capability of the deep CNN, these methods achieved promising performance on HSI denoising. However, existing CNN-based methods rarely build feature communications between cascaded multiscale layers; thus, the correlated noise information across different scales is not fully exploited.

III. METHODOLOGY
In this section, we present the MAFNet for hyperspectral image denoising, which exploits the inherent correlation of noise across multiple scales. As illustrated in Fig. 2, MAFNet consists of four parts: initial layer, coarse-fusion network, finefusion network, and noise reconstruction. Four parts work together to estimate the noise image I * N . The noise-free data is generated by subtracting I * N from the observation data Y . The details of each part of the network are presented in the following.

A. Coarse-Fusion Network
For a given input hyperspectral image, the proposed model first downsamples the input image into 1/2 and 1/4 scales by using Gaussian kernels. Shallow features are extracted by multiple parallel convolutions, as illustrated in Fig. 2 (the initial layer). Next, the coarse-fusion network extracts deep features and fuses the multiscale information through several parallel adaptive instance (AIN) modules. The motivations for designing the coarse-fusion network are twofold: 1) The multiscale structure presents a solution to increase the receptive field to capture more content. 2) The AIN module can transfer the basic structures from the low-resolution feature maps to the high-resolution ones. We choose adaptive instance normalization [59] to build the AIN module due to its efficiency and compact representation. Fig. 3 shows the architecture of the AIN module.
The adaptive instance normalization affine transforms the normalized feature map h ∈ R H×W ×C by taking an input Here H and W denote the height and width of the feature map, respectively. C is the number of channels. It should be noted that h is the feature from the downscale. Specifically, the adaptive instance normalization takes the current feature h and the downscale feature h as input. First, we convert h to the size of H × W × C by transposed convolution to be consistent with the same dimension as h. Afterwards, for the purpose of using the contextual semantic information contained in the downscale features, we obtain the affine transform parameters from the transformed h for each pixel (shift β and scale γ). Every feature map is pixel-wise affine transformed and channel-wise normalized, as illustrated in Fig. 3. The updated value in the feature map at position (i, j, c) can be formally represented as: where µ c denotes the mean of h in channel c, σ c denotes the standard deviation of features in channel c. To be more specific, µ c is computed as: σ c is computed as: It should be noted that γ i,j,c and β i,j,c are generated pixelwisely from h . Therefore, the images with spatially variant noise can be handled adaptively. Finally, a convolutional layer is applied on h new , and residual connection is used to better transfer feature information.

B. Fine-Fusion Network
The outputs of the coarse-fusion network are fed into the fine-fusion network to refine the information from multiple scales. It is well known in cognitive science that in the primate  . MAFNet is composed with initial layer, coarse-fusion network, fine-fusion network and noise reconstruction. Initial layer obtains the feature representation from the HSI image, then coarse-fusion network realizes the information transfer from the low-scale network to the high-scale network, then fine-fusion network fully integrates the contextual global information, and finally we reconstruct noise to get denoising images. For upsampling, we use transposed convolution to get the hight-resolution representation of the feature.  visual cortex, the local receptive fields of neurons are of different sizes. Hence, the capability of collecting multiscale information should be taken into account in deep networks. Inspired by HRNet [40], we conduct repeated multiscale fusion by exchanging information across parallel multiresolution subnetworks. Furthermore, we design a co-attention fusion module to adaptively emphasize informative features from different scales, and therefore enhance the discriminative learning capability of the network for image denoising.
As illustrated in Fig. 2, the fine-fusion network starts from multiscale feature representations {X 1 r , r = 1, 2, 3}, where r denotes the spatial resolution index. In the second layer, the feature representations are {X 2 r , r = 1, 2, 3}. Each feature representation is computed as: where CA is the co-attention fusion module. f (·) is a transform function. The transform function f (·) depends on the input and output spatial resolution of the feature map. As depicted in Fig. 4, the strided 3 × 3 convolution is employed for 2× downsampling. Two consecutive strided 3×3 convolutions are employed for 4× downsampling. At the same time, the nearest neighbor sampling following a 1 × 1 convolution is used for upsampling.
If the input and the output have the same resolution, we adopt the identity connection. Note that after transformation, the feature maps from different scales are of the same size, and they are fed into the co-attention fusion module.  In deep neural networks, features from different states or sources contribute differently to the feature representations [60]. In HRNet, multiscale features are directly fused by element-wise summation. We argue that simply combining multiscale features by concatenation or summation lacks the flexibility to modulate these features, and the discriminative ability of deep models will be influenced. Therefore, this paper proposes a co-attention fusion module to adaptively emphasize the important information from different scales. The structure of the co-attention fusion module is sketched in Fig. 5, which consists of two parts: 1) Concatenation and split, 2) Fusion and self-calibration.
Concatenation and split. The co-attention fusion module receives multiscale features and generates trainable weights for feature fusion. Given input features Y 1 , Y 2 , and Y 3 are with the size of C × H × W , we first conduct the concatenation operation on three features: where cat(·) is the concatenation operation. Next, The global average pooling is used to compute the channel-wise statistics s ∈ R 3C×1×1 along the spatial dimension of U ∈ R 3C×H×W . A downsampling convolution layer is employed to produce a compact feature u ∈ R 3C r ×1×1 . Here r = 4 is used in our experiments.
Ultimately, the feature u is passed through three parallel upsampling layers and provides us with three feature descriptors u 1 , u 2 and u 3 each with dimension C × 1 × 1. Softmax function is applied to u 1 , u 2 and u 3 , yielding three attention activation vectors α 1 , α 2 and α 3 , respectively.
Fusion and self-calibration. The three attention activation vectors generated will be used to recalibrate the input features as:Ỹ Then, we adjust and integrate the features, which are performed by the self-calibration module: where H sc (·) denotes the self-calibrated convolution [61]. As a follow-up operation after fusion, self-calibration convolution uses the convolution filters to operate on the fused feature map to enhance the feature representation ability.
Conclusively, the proposed co-attention module transforms the input features into compact descriptors and generates three sets of weights to model channel-wise interdependencies. In this way, the co-attention module can adaptively emphasize the important information from multiscale and generate trainable weights for representative feature fusion.

C. Denoising and Reconstruction
At the end of the fine-fusion network, multiscale features are fused by the co-attention fusion module. Then, one convolution layer is employed to learn the residual noise image I * N . Finally, the noise-free imageÎ is computed by subtracting I * N from the observation I N .
We use L 1 loss to optimize our network, and the reconstruction loss is: whereÎ denotes the estimated noise-free HSI, and I denotes the real noise-free HSI. While in the HSI case, hundreds of bands with abundant spectral information means the noise types and intensity in each band are usually different. Therefore, differences in spatial and spectral direction can provide additional complementary contributions for denoising. We introduce a global gradient regularizer to constrain the details ofÎ, where ∇ h , ∇ v and ∇ s denote the gradient operator along the horizontal, vertical, and spectral direction respectively. Then, the total loss function is as follows: where λ is the weight parameter of L grad . We empirically set λ to 0.01 to balance the loss terms.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. Experiment Setup
Benchmark datasets. To verify the effectiveness of the proposed MAFNet for HSI denoising. The proposed MAFNet is employed on several datasets, and training is conducted using data from ICVL [62] and CAVE [63] hyperspectral dataset. The images in the ICVL dataset were collected over 31 spectral bands with the size of 1392 × 1300, while the images in CAVE dataset were collected over 31 spectral bands with the size of 512 × 512. The training data are randomly cropped as cube data with the size of 128 × 128 × 31. Basic data augmentation (rotation and scaling) is used for regularization. Twenty thousand training samples are generated in total. To verify the robustness of the proposed MAFNet in real data, spaceborne hyperspectral data are used in our experiments, including Pavia University, Urban and Indian Pines. Through experiments on both real noise HSI datasets, we try to verify the generalization ability and denoising effect of the proposed MAFNet.
Noise setting. Hyperspectral data captured by real spaceborne sensors are commonly contaminated by a mixture noise, such as the Gaussian noise, impulse noise, and deadline noise. In the testing phase, five types of complex noise are defined as follows: 1) Case 1: Non-i.i.d. Gaussian noise. Data in all spectral bands are contaminated by Gaussian noise with various intensities. The variances of Gaussian noise are randomly selected from 30 to 70.
2) Case 2: Gaussian + Stripe noise. Every band is contaminated by non-i.i.d Gaussian noise, as mentioned in Case 1. Besides, some spectral bands are randomly selected to add strip noise. In each band, 5% to 15% of columns are polluted with strips. 3) Case 3: Gaussian + Deadline noise. Every band is corrupted by non-i.i.d Gaussian noise, as mentioned in Case 1. Besides, deadline noise is randomly added to one-third of spectral bands. In each band, 5% to 15% of columns are conflicted with deadlines. 4) Case 4: Gaussian + Impulse noise. Each band is contaminated by Gaussian noise, as mentioned in Case 1.
One-third of bands are randomly selected to add impulse noise with intensity ranging from 10% to 70%. 5) Case 5: Mixture noise. Like other cases, every spectral band is corrupted by Gaussian noise. Then, each band is randomly contaminated by a random combination of the other three noises.
Competing methods and quantitative metrics. The proposed method was compared with six state-of-the-art methods. Both traditional methods and deep learning-based methods are taken into account. Specifically, For traditional methods, BM4D [48], low-rank methods (LRMR [64] and LRTV [65]) are considered. For deep learning-based methods, the proposed MAFNet is compared with HSID-CNN [36], MemNet [66] and QRNN3D [58]. To give a fair evaluation, three quantitative metrics are used, including peak signal-to-noise ratio (PSNR), structure similarity (SSIM) [67], and spectral angle mapper (SAM) [68]. SAM is a spectral-based index, and a smaller value of SAM indicates better denoising performance. PSNR and SSIM are spatial-based indexes. Larger values of PSNR and SSIM suggest better denoising performance.
Incremental learning policy. We use an incremental learning policy for stable training, which can effectively avoid the network converging to suboptimal minimum. Specifically, the training goes through three stages, and the training data of each phase uses the same network. In the first stage, Gaussian noise with fixed noise level (σ = 30, 50, 70) is sequentially employed to build the training data for network training in turn. The network weights of each training phase are saved, and we load the last trained network weights to initialize the network parameters for the next training phase, instead of retraining the network again from scratch. Next, we use blind Gaussian noise (randomly selected from σ = 30, 50, 70) to construct the training data, and the method described in the first stage is still used to load the network weight data already trained in the first stage. Finally, the complex noise is employed to produce the training data (from Case 1 to Case 5). With the increase of the noise complexity of each stage, the denoising difficulty of the model also increases. Therefore, making full use of the pre-trained network model in the previous stage is more conducive to improve denoising performance and enables the network to converge better. In order to explore the effectiveness of the incremental learning strategy, we plotted the loss function curve in the model training process, as shown in Fig. 6, we can see that the incremental learning strategy can converge faster and achieve better denoising performance.
It should be noted that we make the model handle data with different noises sequentially. Following the easy-to-difficult learning strategy in Curriculum learning [69], we incrementally learning the noise from Case 1 to Case 4, and therefore gradually improving the generalization and learning ability of the model. Finally, the model learns the complex noise, including all kinds of noise from Case 1 to Case 4. Hence, the final model is robust for complex noise removal. Through incremental learning, the proposed MAFNet achieves better denoising performance.
We initialize the learning rate at 10 −4 , and it decayed every epoch to accelerate training. The training process of MAFNet took 100 epochs for Gaussian noise and 150 epochs for complex noise. The network is optimized using the Adam optimizer with the PyTorch framework on a machine with NVIDIA GTX 2080Ti GPU, Intel(R) Xeon(R) E5 CPU of 2.50GHz and 32GB RAM.
Iteration Fig. 6. The training errors with / without incremental learning. Blue curve denotes the mixture noise data training without incremental learning. Red curve denotes the mixture noise data training with incremental learning.

B. Experiments on Gaussian Noise Cases
This paper uses a single model to process various Gaussian noise levels. Specifically, additive Gaussian noise with different variances is imposed to produce a set of noisy HSI patches. The average evaluation indexes are listed in Table I. The best performance for each quality index is marked in bold. Fig. 7  and 8 show the denoising results under noise levels σ = 30 and σ = 70 to give detailed comparison results.
Through comparison, we can observe that the proposed MAFNet obtains better performance metrics (PSNR, SSIM, and SAM) when dealing with Gaussian noise cases. It is owing to this reason that the proposed MAFNet takes the multi-scale contextual information into account. Furthermore, benefiting from the AIN module and co-attention fusion, the MAFNet also achieves better denoising results compared with HSID-CNN, MemNet and QRNN3D. As shown in Fig. 7 and 8, we select one band to give the denoising results. It can be easily seen that the denoising result of the proposed MAFNet is capable of effectively reducing the Gaussian noise while precisely preserving the basic texture details of the

C. Experiments of Complex Noise Removal on ICVL Dataset
As mentioned before, the model trained at the final stage of training is used to deal with the five complex noise cases simultaneously. Five types of complex noise include Noni.i.d Gaussian noise, Gaussian + stripe noise, Gaussian + deadline noise, Gaussian + impulse noise, and Mixture noise. We conducted experiments on ICVL dataset for complex noise removal. The quantitative results are shown in Fig. 9, and the corresponding quantitative values are listed in Table II, respectively. Fig. 9 shows the visual results of MAFNet denoising under complex noise conditions. The result shows that our MAFNet significantly outperforms the other methods. By comparing the denoising performance indicators in Table II, it can be observed that our MAFNet performs better than LRMR and LRTV, since they are low-rank matrix-based methods and some basic structures get lost in the denoising process. Furthermore, our MAFNet performs better than the other deep learning-based methods (MemNet, HSID-CNN and QRNN3D). It is evident that the multi-scale feature exploitation and contextual information integration can help the network to capture more intrinsic characteristics of HSI. At the same time, it also helps the network to retain more structural information about the input image. As shown in   Fig. 9, our MAFNet not only removes complex noise, but also retains the structure and spatial details. Moreover, compared with other methods, the proposed MAFnet generates HSI images with a more natural and vivid appearance. Besides, the HSI images produced by the proposed MAFNet have better global contrast.

D. Experiments of Complex Noise Removal on CAVE Dataset
We conducted complex noise removal experiments on the CAVE dataset [63] so as to verify the effectiveness of MAFNet's denoising performance. Each image in the dataset is acquired at a wavelength of 10 nm in the range of 400 -700 nm. We divided the dataset into two sets, with 20 images for training and 12 images for testing. Table IV presents the quantitative evaluation results of different methods. We compare the proposed MAFNet with BM4D [48], OLRT [70], NGMeet [71], MemNet [66], HSID-CNN [36] and QRNN3D [58]. It can be observed that the proposed MAFNet achieves the highest PSNR value, which demonstrates its superior denoising performance.
It should be noted that we employ state-of-the-art low-rank tensor recovery models (NGMeet [71] and OLRT [70]) on the CAVE dataset. These low-rank tensor methods can effectively utilize both the spatial-spectral information, and preserve the high-dimensional spatial and spectral structure information in HSIs. NGMeet learns the orthogonal basis matrix and reduced image, which produces impressive recovered images. OLRT

E. Experiments on Remote-Sensing HSI Datasets
To verify the robustness and denoising performance of the proposed MAFNet, we conduct extensive experiments on three remote-sensing HSI datasets. The first dataset is the Pavia University dataset. It contains 103 bands, and the spatial size of the image is 610 × 340 pixels. It was captured by the reflective optics system imaging spectrometer sensor (ROSIS-3) over the Pavia University, Italy. The second dataset is the Urban dataset. It was captured by the HYDICE sensor. The sensor provides 210 bands ranging from 400 nm to 2500 nm. In order to further verify the denoising performance of MAFNet on real hyperspectral noise, we introduce the Indian Pines dataset, which is captured by the AVIRIS sensor and contains 220 bands with a resolution size of 145 × 145 pixels.
1) Results on the Pavia University Dataset with Mixture Noise. We added mixture noise on the Pavia University dataset, and the experimental results are listed in Table III. It can be easily seen that the proposed model achieves the highest quantitative metrics. The corresponding visual results are provided in Fig. 10, and the values of the PSNR and SSIM within different bands of the restored HSI on the dataset are depicted in Fig. 11. Our method not only effectively removes the complex noise, but also simultaneously preserves the highfrequency texture details. Furthermore, the mean normalized digital number curves by different methods are shown in Fig.  12, which demonstrates that the proposed MAFNet effectively removes the complex noise without introducing obvious spectral distortion.
As demonstrated in Fig. 11, the proposed method achieves the best PSNR values on almost all spectral bands. We find that nearly all the methods perform differently across spectral bands. It is mainly caused by the denoising difficulties among different spectral bands, since the intrinsic noise levels from different spectral bands are different. In addition, due to the characteristics of the hyperspectral sensor, the spatial details of the different spectral bands are different. Hence, nearly all the methods perform differently across spectral bands.
2) Results on the Urban Dataset with Mixture Noise. To further validate the efficacy of MAFNet in denoising, we added severely polluted noise to the Urban dataset. Figs. 13 and 14 demonstrate that deep learning-based methods are effective in removing severely polluted noise. Notably, the proposed MAFNet achieves the best denoising performance among all the methods.
3) Results on the Indian Pines Dataset with Real-World Noise. Some bands in the Indian Pines dataset are seriously corrupted by the atmosphere and water, and are polluted by complex noises. We show the denoising result of various methods on this dataset in Fig. 15. It can be observed that MAFNet can still achieve satisfactory denoising performance in real hyperspectral noise. Furthermore, it performs better in preserving spatial details.     Table VI. It is evident that the combination of co-attention and AIN achieves the best denoising performance.
The AIN module provides semantic guidance via channelwise normalization and pixel-wise affine transformation. The transform parameters γ and β are visualized in Fig. 16. In the AIN module, γ calibrates the input feature and highlights the important regions. As can be observed that in the lowresolution branch, γ highlights the object boundary and thus transfers the basic structure to the high-resolution branch. β is used as the complementary information for feature calibration in the AIN module, providing more details to complete the denoising task.

High resolution Low resolution
High resolution Low resolution Feature Fusion in Co-Attention Module. Here, we discuss the multiscale feature fusion in co-attention module. In the proposed MAFNet, multiscale features are fused by elementwise summation. We also designed two other schemes, and the experimental results on the Pavia University dataset are shown in Table VII. The "Concat" uses concatenation for multiscale feature fusion, and employs 1 × 1 convolution layer to reduce the channels. The "Multiply" uses element-wise multiplication for feature fusion. Compared with "Concat" and "Multiply", the proposed method could achieve slightly better denoising performance. Split Attention and Self-Calibration. Two attention mechanisms are employed in the co-attention module: Split attention and self-calibration. To verify the effectiveness of both mechanisms, we design three variants, as shown in Table  VIII. The "Split" only uses the concatenation and split part in co-attention module, and the self-calibration is removed. The "C-Attn" uses the channel attention [72] instead of the concatenation and split part in co-attention module, and the self-calibration is removed. In the proposed MAFNet, both split attention and self-calibration are used. The experimental results on the Pavia University dataset are shown in Table  VIII. It can be observed that "Split" slightly outperforms "C-Attn", since multiscale features are adaptively emphasized by split attention. The proposed MAFNet performs the best by combining split attention and self-calibration.
Global Gradient Regularizer. In our designed loss function, λ is a critical parameter that affects the global gradient regularizer, and can affect the denoising performance. We evaluate the denoising performance by taking different λ on the ICVL dataset while keeping the network unchanged. The results are shown in Table V, and it can be observed that the proposed MFANet reaches the bet PSNR value when λ = 0.010. Therefore, in our implementation, λ is set to 0.010.
Number of Channels. We explore the influence of the channel number on the denoising performance, and design three variants of MAFNet according to the channel number and model size, i.e., MAFNet-S, MAFNet-B, and MAFNet-L. The channel numbers of three scales of MAFNet-S are set as (32, 64, and 128). Then, the corresponding channel numbers in MAFNet-B and MAFNet-L are set as (64, 128, and 256) and (128, 256, and 512), respectively. The denoising performance of three variants and other deep learning methods are illustrated in Table IX. It can be observed that MAFNet-L achieves the best PSNR value, but its computational complexity is large. MAFNet-S achieves good denoising performance while it is rather computationally efficient. It should be noted that MAFNet-B achieves excellent denoising performance while its computational complexity is within an acceptable range. Therefore, in our implementations, we set the channel numbers of three scales as (64, 128, and 256). Compared with other deep learning-based methods, our method exhibits impressive performance in FLOPs and the number of parameters. Specifically, although the FLOPs of our method are higher compared than HSID-CNN, the denoising performance of our method is much better. Compared with MemNet and QRNN3D, the proposed method is more computationally efficient.

V. CONCLUSION AND FUTURE WORK
In this paper, we present a multiscale adaptive fusion network for hyperspectral image noise reduction. Two key components contribute to improving the hyperspectral image denoising: A progressively multiscale information aggregation framework and co-attention fusion module. Specifically, a set of multiscale images are generated and fed into a coarsefusion network to exploit the contextual texture correlation. Thereafter, a fine fusion network is followed to exchange the information across the parallel multiscale subnetworks. Ultimately, the co-attention fusion module adaptively emphasizes informative features from different scales and reinforces the discriminative learning capability for denoising. Experiments on both synthetic and real HSI datasets verified the superiority of the proposed method compared with other state-of-the-art HSI denoising methods.
Although the MAFNet in this paper exhibits outstanding denoising performance, the utilization of multi-scale branches for feature extraction also results in an increase in the number of parameters and computational complexity. Consequently, in the future, we will concentrate on devising lightweight HSI denoising techniques.