Multispectral Image Noise Removal With Adaptive Loss and Multiple Image Priors Model

Multispectral image (MSI) denoising is a crucial preprocessing step for various subsequent image processing tasks, including classification, recognition, and unmixing. This article proposes a novel image denoising model that integrates both noise modeling and image prior knowledge modeling. Specifically, to account for the complexity and nonuniformity of noise, a nonindependent identically distributed mixture of Gaussian model is employed for noise modeling, and a weighted loss function is obtained. The weights used in the loss function are adaptively learned from noisy MSI and employed to adjust the denoising strength of each pixel. In additionally, the model leverages the prior knowledge of the image by utilizing a nonlocal low-rank matrix model that captures the spatial–spectral correlation and nonlocal spatial similarity priors of the image. Moreover, our model adopts the weighted spatial–spectral TV model to encode the local smoothness prior of the image. Both prior models are translated into regularization terms in the denoising model. The efficacy of the proposed method is demonstrated through both simulated and real image experiments.


I. INTRODUCTION
A MULTISPECTRAL image (MSI) is a 3-D image consisting of a set of 2-D images, each of which is the imaging result of a certain spectral band. MSIs are used extensively in mineral exploration, pharmaceutical counterfeiting, and food safety due to their ability to provide richer information than traditional 2-D images [1], [2], [3]. However, noise contamination is inevitable in MSIs due to sensor sensitivity, calibration errors, and physical mechanisms. The noise distribution is complex and the noise intensity varies across spectral bands [4], [5], [6], [7], leading to significant degradation in image quality and negative impacts on subsequent processing, such as classification, unmixing, and target detection. As a result, MSI noise reduction is a critical and challenging task within the MSI processing field.
In recent decades, there has been a significant advancement in MSI denoising methods. These methods can be broadly classified into two categories: 1) deep learning-based methods and 2) traditional machine learning-based methods. The deep learning-based methods rely heavily on data and require the design of complex network structures to extract deep prior knowledge from images. For instance, Chang et al. [8] first introduced the deep convolutional neural network (CNN) for MSI denoising. Later, Yuan et al. [9] proposed a deep residual CNN with multiscale and multilevel feature representation for bandwise denoising. To encode the spatial-spectral correlation of image, Zhang et al. [10] presented a deep CNN by incorporating the spatial-spectral gradient information; Shi et al. [11] proposed a 3-D attention network with two separate branches. Pan et al. [12] and Wei et al. [13] introduced a quasi-recurrent network to capture the correlation of spatial features among the spectral domain. Dong et al. [14] presented a modified 3-D U-net architecture, and Cao et al. [15] proposed global reason network with three well-designed modules. More recently, some transformer-based approaches have been applied to HSI and have performed well [16], [17]. Although they provide outstanding denoising results, they lack theoretical support and often do not generalize well to new datasets. In contrast, the traditional machine learning-based methods do not rely on training data and usually have good theoretical support, and demonstrate better generalization. This type of denoising models typically consists of two primary components, i.e., the loss function term and the regularization term. The loss function term measures the deviation between the ground truth image and the observed image. Most denoising models use the l 2 norm as the loss function, which is simple and convex, leading to fast and efficient algorithms for obtaining the global optimal solution. However, the l 2 norm loss function assumes that the noise obeys an independent identically (i.i.d.) Gaussian distribution, which deviates from the true MSI noise distribution, resulting in the lack of robustness in removing mixed noise. To address this issue, many denoising models have proposed a Gaussian noise plus sparse noise assumption, which treats the non-Gaussian noise as sparse noise and embeds it as a parameter to be learned into the model [18], [19], [20]. This improves the robustness of the algorithm, and thus, have been widely used in MSI denoising task. However, this noise model roughly treats all non-Gaussian noise as sparse noise, which still deviates from the real mixed noise distribution. To alleviate this issue, the mixture of Gaussian (MoG) distribution noise model was proposed [21], [22], which can theoretically approximate any distribution as long as there are enough Gaussian components. The MoG model derives the weighted l 2 norm This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ loss function based on the maximum likelihood principle. However, the approximation ability of MoG model is often limited by the small number of Gaussian components in practice. To solve this problem, Cao et al. [23], [24] proposed the mixture of exponential power (MoEP) distribution model, which results in a weighted l p norm loss function. In addition, Chen et al. [25] proposed a non.i.i.d. mixture of Gaussian (NMoG) noise model considering the non.i.i.d. statistical characteristics of MSI noise and came up with the weighted l 2 norm function. Further, Barron [26] proposed a more general noise model that can obtain a general loss function by incorporating a set of robust loss functions.
Another critical component of denoising model is the regularization term, which is modeled based on the image prior knowledge. To date, the most significant prior knowledge that have been demonstrated include spatial and spectral correlation, nonlocal spatial similarity, and local smoothness. Extensive research has been dedicated to exploring more accurate and efficient methods for modeling these prior knowledge components. For spatial and spectral correlation prior, the low-rank matrix decomposition model [18], [27], [28], [29] and low-rank tensor decomposition model [30], [31], [32], [33], [34] have been proposed. To incorporate nonlocal spatial similarity prior, it is common practice to partition the image into small blocks, which are then grouped into subimage groups based on their similarity. The model is then applied to each subimage group separately, such as BM4D [35] applies 4-D filtering on each subimage group, Chang et al. [36] arranged each group into a matrix and then applied a low-rank matrix model. In addition, Xue et al. [32] and Zhang et al. [37] reorganized each subimage group into a 3-D tensor and 4-D tensor, respectively, and applied a low-rank tensor model on it. For local smoothness prior, a commonly used approach is to apply 2DTV band by band and then sum them together [38]. However, this method does not account for the smoothness of the image along the spectral dimension. So, the 3DTV model [39] was transferred to MSI and named spatial-spectral total variation (SSTV) [7]. In order to address the large difference in pixel scale and noise intensity across spectral bands in MSI, an adaptive SSATV model was developed by introducing weights into the SSTV model to adjust the denoising strength of each pixel [40], [41]. In addition, to protect image boundary information, Chen et al. [42] presented an adaptive SSTV model. Furthermore, Peng et al. [43], [44] confirmed that the gradient domain of an image is also lowrank, which resulting in the development of enhance TV and CTV models. To achieve improved denoising performance, the denoising model often incorporates multiple prior knowledge models. These may include a combination of spatial and spectral correlation priors with local smoothness priors [45], [46], [47], or a combination of spectral correlation priors with nonlocal similarity priors [48]. Moreover, some models integrate all the three priors completely or partially [7], [49], [50], [51], [52].
The denoising methods mentioned above, which incorporate multiple prior knowledge models, have demonstrated exceptional performance. However, to improve model efficiency, these methods often use relatively simple loss functions, such as l 2 norm. Unfortunately, noise in MSI is often complex and noni.i.d., which limits the performance of these methods since they assume independent and identically distributed statistical characteristics. In this article, we propose a novel MSI denoising method that combines noise modeling methodology with prior modeling methodology to address this issue.
We begin by assuming a non-i.i.d. noise structure for MSI and use the NMoG to model it. Based on this noise model, we derive an adaptive weighted l 2 norm loss function, the weight reflects the noise intensity of each pixel, and the pixel with high noise intensity has small weight to reduce the adverse impact of strong noise on the model results. On the contrary, the pixel with low noise intensity has large weight to protect the information in the original image from being distorted. The weight information can be efficiently and adaptively learned from noisy images using the variational expectation maximization (VEM) algorithm. Next, we build the regularization term by completely integrating the previously mentioned three image priors. Specifically, we use the nonlocal low-rank matrix model to capture the nonlocal similarity and the spatial-spectral correlation prior. In addition, we adopted an edge preserving total variation model to encode the nonlocal smoothness prior of HSI. Finally, we develop an effective ADMM algorithm to solve the denoising model. We validate the effectiveness of our proposed method on both synthetic and real HSI datasets, and our results indicate that our approach achieves competitive or superior performance compared to other state-of-the-art methods.

II. NOTATIONS AND PRELIMINARIES
A tensor is a multidimensional data represented by decorated letters, such as X ∈ R I 1 ×I 2 ×···×I N , and its element is denoted by X i 1 ,i 2 ,...,i N . In addition, the matrices, vectors, and scalars are represented by nonbold upper case letters X, bold lower case letters x, and nonbold letters x, respectively.
The MSI data cube can be treated as a 3-D tensor X ∈ R M ×N ×S with two spatial modes and a spectral mode, where M, N , and S represent the spatial height, spatial width, and spectrum number, respectively. The MSI tensor X can be unfolded into a matrix along the spectral mode, denoted as X (3) with the element (X (3) ) l,s corresponding to the element X i,j,s , where l is given by The observed MSI is often corrupted by complex types of noise, including Gaussian noise, stripe noise, deadline noise, and others. To simplify the noise model, the noise was assumed to be additive, and therefore, we can express the noise degradation model as follows: where Y represents the noisy MSI tensor, X is the clean MSI tensor, and E is the noise tensor.
The distribution of E is complex and unknown. To model this type of distribution, a mixture of Gaussian (MoG) distribution is often used. However, since the noise types and intensity between different bands can vary significantly, a more effective approach is to apply the MoG model to the noise of each band. This results in different model parameters for each band, and the parameters for all MoG models are generated from the same prior distribution. This approach is referred to as NMoG and can be expressed as follows [25]: where α sk and τ sk are the proportion and precision of the kth Gaussian component for the sth band, with the constraint that K k=1 α sk = 1, K is the total number of Gaussian components. To capture the common properties of noise across bands, the precision τ s was assumed to sample from the following prior distribution: where Gam(·) represents the Gamma distribution, η 0 , μ 0 , and ν 0 are the hyperparameters.

III. PROPOSED METHOD
In this section, we propose a new denoising model, which falls under the category of traditional machine learning and has the following expression: where Loss(Y, X ) presents the loss function which measures the deviation between observation MSI and the ground truth. And R(X ) is the regularization term which encodes the prior information of MSI. We will introduce the proposed method in detail from two aspects: 1) loss function and 2) regularization term.

A. Adaptive Loss Function
To construct a more appropriate loss function, it is crucial to have a comprehensive understanding of the noise and then reasonably model the distribution characteristics of the noise, which in turn will help to measure the deviation between observed data and ground truth. Moreover, since the noise in each data point can vary, the loss function must be adaptive to cater to the specific noise characteristics. To meet these requirements, we have selected the NMoG noise model and used the VEM algorithm to estimate the model parameters. This has enabled us to derive an explicit loss function that is suitable for our purposes. We have simplified the NMoG model slightly by assuming that the noise is zero-mean. To solve this model, we have introduced a hidden variable Z in the noise model. The NMoG model can be expressed as follows: where Mul(.) and Dir(.) represent the multinomial distribution and the Dirichlet distribution, respectively. The maximization of marginal likelihood p(Y|X ) can be transferred to maximize the variational lower bound (ELBO) as follows [53]: Step: We employ the variational inference algorithm to solve the approximate posterior distribution of the parameters involved in (6). The estimated posterior distribution is assumed to have the following decomposition form: The posterior distribution of parameter τ s can be updated using the following equation: where the parameters involved in the estimated posterior is obtained as follows: where · represents the expectation operator. The posterior distribution of latent variable Z is where the closed-form solution of distribution parameter ijsk is as follows: The posterior distribution of the mixing proportion α s is as follows: where where μ = μ 0 + c 0 KS and ν = ν 0 + s,k τ sk . Next, we present the necessary expectations required for the above update equations, based on the current posterior distributions. The details are provided as follows: where ψ(·) is the digamma function. E Step: We calculate the evidence lower bound (ELBO) of the logarithm of likelihood p(Y|X ) as (6). Note that we only focus on the components related to X and others were treated as constants. Therefore, the ELBO can be denoted as a fucntion of X with respect to X old . We denote the function as Q(X , X old ) M Step: We maximum the ELBO function Q(X , X old ) with respect to X , the loss function of this optimization problem can be reformulated as follows: where represents elementwise product. The loss function obtained here takes the form of a weighted l2-norm. The weight is calculated adaptively based on the estimated noise from the denoising process. It can be seen that the high-intensity noise receives small weight, reducing the negative impact of strong noise on model results. Conversely, the low-intensity noise receives large weight to preserve the structure information in the original image.

B. Multiple Image Priors Model-Based Regularization
As mentioned previously, the nonlocal spatial similarity, the spatial and spectral correlation, and the local smoothness are the most significant priors knowledge for MSI restoration. The regularization term in our denoising model will completely encode all these priors with the following form: where X NSC represents the regularization related to nonlocal spatial similarity and the spatial-spectral correlation priors. And X LS represents the regularization related to local smoothness prior.
To model the nonlocal spatial similarity prior, it is necessary to divide the images into overlapping patches for matching and grouping. In order to simultaneously encode the spatial-spectral correlation prior, the MSI is divided into overlapping full-band patches to ensure that the correlation exists in each patch. This allows the low-rank model to be applied to each group of full-band patches, enabling both nonlocal spatial similarity prior and the spatial-spectral correlation prior modeling. Note that Chang et al. [36] demonstrated that, for MSI, the low-rank property between patches is much more significant than between bands. To simplify the model, each patch in the group is vectorized to form a matrix, which is then subjected to low-rank matrix decomposition. Therefore, the regularization related to nonlocal spatial similarity and spatial-spectral correlation priors can be expressed as follows: where P i represents the operator to group the similar patches for the ith patch and rearrange it into matrix, P denotes the number of patches, and · w, * represents the weighted nuclear norm. The local smoothness prior is not only present in two spatial modes but also extends to the spectral mode. To capture such prior, the spatial-spectral total variation (SSTV) model is commonly used. Due to significant variations in noise intensity and pixel values, it is necessary to impose TV regularization constraints with different strengths on each pixel. To address this, we introduce a weighted SSTV model that allows for the application of varying regularization strengths on each pixel, which defined as follows: where the operator D = [D 1 , D 2 , D 3 ] represents the 3-D firstorder forward finite-difference operator. In addition, the operators D 1 , D 2 , and D 3 correspond to the first-order finitedifference operator along the spatial horizontal, spatial vertical, and spectral mode, respectively. The parameter S = [S 1 , S 2 , S 3 ] is the weight tensor, which is used to preserve image texture while enforcing sparsity constraint on the pixels. We can estimate the weight tensor from gradient maps of image. Although a large amount of noise has been removed from the denoised imagê X , the SNR is not enough to extract the texture information of the image. To alleviate the situation, we carry out low-rank approximation with rank setting as 1 to further denoise to get X 0 with higher SNR, so the texture information of the image can be effectively obtained from the gradient maps of X 0 . Therefore, the weight tensor is obtained from restored MSI during the denoising process as follows [42]: where LR 1 is the low-rank approximation operator using a lowrank matrix decomposition model with a fixed rank of 1.X represents the restored MSI data during denoising processing, and the threshold δ is utilized to prevent undesirably large values in the weight tensor. Typically, it is set to the first quartile of G i .

C. Denoising Algorithm
Based on the adaptive loss function and regularization term introduced above, the denoising model can be represented as follows: where λ 1 and λ 2 is the tradeoff parameters.
In order to solve this model, the auxiliary variables are introduced, and thus, (21) can be rewritten into the following optimization problem: The ADMM methodology can be used to minimize the optimization problem mentioned above by transforming it into an augmented Lagrangian function as follows: where μ 1 , μ 2 , and μ 3 are penalty parameters, and L 1 , L 2 , and L 3 are the lagrange multipliers.
The ADMM framework provides an alternative approach to optimize the augmented Lagrangian function by fixing all variables except one and optimizing it.
Update A: The suboptimal problem for optimizing the variable A is derived from (23) by removing the terms not related to A This problem can be treated as solving the linear system 25) where D * represents the adjoint operator of D. The fast Fourier transform is adopted to efficiently solve this problem, and the closed-form solution to A can be obtained by the following equations: where | · | 2 represents the elementwise square, the operators fftn(·) and ifftn(·) are the fast 3-D Fourier transform and its inverse transform, respectively, and 1 is the tensor with all elements 1.
Update B: The suboptimal problem with respect to B is as follows: This suboptimal problem can be solved by using the known soft-thresholding operator [54] as follows: where R λ (·) represents the soft-thresholding operator with parameter λ. Update C i : The suboptimal problem concerning C i is as follows: To solve this subproblem, we utilize the off-the-shelf algorithm WNNM [55]. The optimal solution of C i is deduced as the following equations: where the diagonal elements of Σ correspond to the singular values of P i X + 1 μ 3 L 3 arranged in decreasing order. In addion, theΣ is the diagonal matrix, and the diagonal elements are calculated as follows: where the parameters α 1 and α 2 are expressed as follows: Update X : The subproblem with respect to X can be written as where C = {C i } P i=1 , and the operator P −1 represents the inverse operator of P . Its function is to arrange each patch back to its original position in the MSI while also averaging out the Update S by (20).

5:
Update lagrange multipliers L 1 , L 2 , L 3 by (35). 6: t ← t + 1. 7: end while overlapping portions. The solution of X has the close-form as follows: The multipliers are updated as follows: After updating the denoised data, we proceed to update the weights involved in the loss function using (16). More information on this step can be found in Section III-A. The denoising algorithm is summarized in Algorithm 1.

IV. EXPERIMENTS
To demonstrate the effectiveness of our proposed denoising method, we compared our method with several state-of-theart denoising methods on both simulated and real MSI data, including BM4D [35], TDL [30], LRTV [38], NMoG [25], LRTDTV [47], LLRT [36], and RCTV [56]. Specifically, BM4D is a classical method that utilizes block-matching and 4-D filtering for denoising. TDL utilizes the l 2 -norm loss function and spatial-spectral correlation prior model. LRTV is based on the Gaussian plus sparse noise model, combined with the low-rank matrix decomposition model. NMoG is a popular method that uses the NMoG noise model and low-rank matrix decomposition model. LRTDTV combines the Gaussian noise plus sparse noise model with the low-rank tensor decomposition prior model and SSTV regularization. LLRT uses the l 2 -norm loss function and combines the nonlocal similarity, spectral correlation, and local smoothness priors model. RCTV uses the Gaussian plus sparse noise model, as well as the local smoothness and spectral correlation priors model. Overall, the comparison experiment involved a diverse range of denoising methods, allowing us to accurately evaluate the performance of our proposed method against the current state of the arts. All experiments were implemented in Matlab R2021a on a PC with 3.4 GHz CPU and 32 GB RAM.

A. Simulation Experiments
This experiment aims to evaluate the performance of our proposed denoising method quantitatively. We utilized two MSI datasets in the simulation experiment: the Balloons data from the CAVE dataset 1 [57] with size of 512 × 512 × 31, and the Washington DC MALL dataset 2 with size of 1208 × 307 × 191. We resized the Balloons data to 200 × 200 × 31 and cropped the main part of the Washington DC MALL dataset to obtain a size of 200 × 200 × 191. These datasets were used as clean MSI data since they do not contain significant visual noise and their gray values were normalized to [0,1].
To simulate the real noise cases, we added four kinds of noise to the clean MSI data, including the following. sian noise and stripe noise were added to clean MSI. In addition, 40% (on Balloons dataset) and 45% (on Washington DC Mall dataset) bands were randomly selected to add the impluse noise with the percentage of impluse is from 50% to 70% (on Balloons dataset) and 90% to 100% (on Washington DC Mall dataset), respectively. We conducted 20 repetitions of the noise addition and denoising experiments on two datasets for each noise case. In these experiments, the maximum number of iterations of our method is set to 10, and the hyperparameters λ 1 and λ 2 are adjusted in each noise case. The sensitivity of parameter is analyzed in Section IV-C. Five quantitative measurements were employed to evaluate the denoising performance, namely: 1) MPSNR, which is the means peak signal-to-noise ratio (PSNR) across bands; 2) MSSIM, which is the mean structural similarity (SSIM) across bands; 3) ERGAS, which stands for Erreur Relative Globale Adimensionnellede Synthese, and 4) SAM, which refers to the spectral angle mapper; 5) time, which is the experimental time cost of each method. The average results of 20 repeated experiments on the Balloons dataset are presented in Table I. The best value of measurements are marked in bold. LLRT method performs remarkably in  the i.i.d. Gaussian case since their noise assumption matches the actual noise well. However, as the noise complexity increases, the performance of LLRT tends to decrease. Similarly, BM4D and TDL methods that use the l 2 -norm loss function show a similar trend. On the other hand, methods utilizing the Gaussian plus sparse noise model, such as LRTV, LRTDTV, and RCTV, demonstrate more robust performance. However, their denoising performance is weaker than our method due to the better image priors model adopted in our denoising approach. Although the NMoG method performs stably under all noise conditions, but its denoising measurements are worse than others since it models only the spectral correlation prior, which may not be the best choice for this dataset with only 31 bands. Overall, our method exhibits strong robustness and remarkable denoising performance compared to other methods.
As a specific example, Fig. 1 provides the visual presentation of the denoising results of one experiment under mixture noise case. To facilitate comparison, we enlarged a common region of each figure, marked by a red box. The figures demonstrate that our method visually achieves best denoising performance compared with other methods. Table II presents the denoising results on the Washington DC mall dataset. It is worth noting that this dataset has a greater number of spectral bands and exhibits strong spectral  correlation, making the NMoG method remarkably effective in all noise cases. On the other hand, the i.i.d. Gaussian noise model-based method is unstable, while the Gaussian plus sparse noise model-based method is relatively more stable. Our method achieves the best or second-best performance under all noise cases. The visual comparison results of one realization are illustrated in Fig. 2. From the figure, it is evident that our method outperforms other methods in removing noise and preserving detailed information.

B. Real Data Experiments
In this section, we present an evaluation of the performance of our method on a real MSI dataset, namely, the AVIRIS Indian Pines dataset 3 with the size 145 × 145 × 220. This dataset is corrupted by various types of noise, including Gaussian noise, stripes, atmosphere absorption, and other unknown noise. In these experiments, the maximum number of iterations of our method is set to 10, and hyperparameters are set as λ 1 = 500 and λ 1 = 5.
Figs. 3-5 show the visual presentation of denoising results on bands 103, 149, and 165, respectively. It is evident that the TDL method fails to denoise this dataset effectively, and the BM4D method obviously has residual noise. This is due to the overly simplistic noise assumption and inadequate image prior  modeling. Conversely, other methods yield significant noise reduction. Compared to all the other methods, our approach exhibits the most substantial noise removal in the spatial domain and less visual image distortion.
To facilitate further comparison, Fig. 6 displays the spectral signatures of a pixel located at (24, 88) before and after restoration. The horizontal axis denotes the band number, while the vertical axis represents the digital number value of the given location. Due to the presence of noise, the curve exhibits rapid fluctuations, as shown in Fig. 6(a). After denoising, the fluctuations are substantially suppressed. In addition, we observe that the NMoG, LLRT, and our method yield the most notable fluctuation suppression, consistent with the visual results depicted in Figs. 3-5.

C. Parameters Setting
In the proposed denoising model, the parameters λ 1 and λ 2 play a crucial role in balancing the loss term and regularization terms. In our algorithm, we initially estimate the noise variance σ 2 using the initialized X and set λ 1 = λ 1 σ 2 and λ 2 = λ 2 σ 2 . It is important to note that the parameter selection for λ 1 and λ 2  occurs before multiplying them by σ 2 . To analyze the sensitivity of parameters, we conducted denoising experiments on the Balloons dataset under various parameter settings in the case of mixture noise. The experiment results are presented as contour maps. In Fig. 7(a), the MPSNR value is plotted against λ 1 and λ 2 . It is observed that MPSNR changes slowly with variations in λ 2 , suggesting its insensitivity to this parameter. Conversely, MPSNR exhibits rapid changes with variations in λ 1 , indicating its sensitivity to this parameter. Similarly, Fig. 7(b) shows that the value of MSSIM is also insensitive to λ 2 but relatively sensitive to λ 1 . However, within a large range, MSSIM remains stable at a high level. Hence, λ 2 can be considered insensitive while λ 1 is sensitive. Furthermore, λ 1 represents the intensity of TV regularization. Therefore, its value can be determined

V. CONCLUSION
In this article, we propose a novel MSI denoising model with an adaptive loss function and multiple image prior model-based regularization. Specifically, we apply the NMoG distribution to model the complex and unknown noise distribution, and then obtain the weighted l 2 -norm loss function by solving the noise model with VAE algorithm. The weight involved in the loss function is adaptively and efficiently learned from observed MSI. In addition, we model the nonlocal spatial similarity, spatial and spectral correlation, and local smoothness priors comprehensively and integrate them into the regularization term. Finally, we conduct the simulation and real data experiments to demonstrate that the proposed method can effectively reduce the noise compared with the state-of-the-art methods.