Tensor Decomposition-Inspired Convolutional Autoencoders for Hyperspectral Anomaly Detection

Anomaly detection from hyperspectral images (HSI) is an important task in the remote sensing domain. Considering the three-order characteristics of HSI, many tensor decomposition based hyperspectral anomaly detection (HAD) models have been proposed and drawn much attention during the past decades. However, as most tensor decomposition based detectors are directly performed on the original HSI, the detection accuracy is usually limited due to the high-dimension and noise corruption of the HSI. Benefiting from the good capacity of autoencoders (AE) for feature extraction, in this article, an enhanced tensor decomposition-inspired convolutional AE for HAD is proposed to address those problems, named TDNet. Within the proposed TDNet, the traditional canonical-polyadic (CP) tensor decomposition model is innovatively alternated by a deep neural network (DNN), and the DNN tensor decomposition model performs more stably and robustly for noise. Specifically, a potential abnormal pixels remove strategy is first built to obtain the background training sets. Then, a DNN tensor decomposition-inspired convolutional AE is used to recover the original background information, which consists of an encoder, a low-rank tensor decomposition network, and a decoder. Finally, the residual errors between input HSI and recovered background are used for anomaly detection. Extensive experiments demonstrate the superiority of the TDNet in terms of both AUC values and ROC curves.


I. INTRODUCTION
H YPERSPECTRAL images (HSI) include hundreds of spectral bands with much more information than RGB images, and are usually used for change detection [1], [2], Manuscript  unmixing [3], [4], classification [5], [6], and anomaly detection [7]- [9]. Among these tasks, hyperspectral anomaly detection (HAD) has attracted much interest due to its significance in military surveillance and mineral exploration [10]. Generally, anomalies are different from their surrounding pixels in spatial domain or spectral domain, and the goal of HAD is to separate the abnormal elements from the background.
In the past decades, researchers have proposed a large number of methods for HAD. Those methods can be divided into traditional methods and deep-learning based methods. The traditional methods mainly consist of statistical modeling based methods, representation modeling based methods, matrix or tensor decomposition modeling based methods and other methods [11]. The statistical modeling-based methods use statistical variables to estimate background and detect anomalies, such as Reed-Xiaoli detector (RX) [12] and local RX detector (LRX) [13]. But the assumption of background distribution is not reasonable due to the complex image scene in the real world [14]. The representation modeling-based methods include sparse representation [15] and collaborative representation [16]. These methods assume that background pixels can be represented by a set of bases extracted from input HSI, while anomalies cannot.
Within the matrix decomposition modeling based methods, the background generally has low-rank priority due to the redundancy of HSI, and anomalies are sparsely distributed in the background. Xu et al. [17] used dictionary learning and low-rank representation for HAD. Qu et al. [14] first proposed an abundance and dictionary-based low-rank decomposition method for HAD. Afterward, the tensor decomposition technique is introduced for HAD, which can explore the spatial-spectral features of the HSI [10], [18]- [20]. However, the matrix or tensor decomposition methods may not be stable due to the high-dimensional and noise corruption of the HSI. For other methods, such as support vector data description (SVDD) [21]. SVDD assumes that the background of HSI is enveloped by a minimum enclosing hypersphere, and anomalies fall outside this hypersphere. In [22], Xie et al. used the structure tensor and guided filter (STGF) for HAD. In STGF, the informational bands are selected by structure tensor, and then the background is removed through attribute filter. Moreover, the selected first band as the guided filter to rectificat the final detection map.
Recently, various deep learning based HAD methods have been proposed due to the high representation capability of deep neural network (DNN). The deep-learning based HAD methods mainly consist of convolutional neural network (CNN) This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ based methods, autoencoder (AE) network based methods, and generative adversarial network (GAN) based methods. To the best of our knowledge, Li et al. [23] first introduced CNN into the HAD domain. The authors use transferred learning and reference data to measure the similarity of the test pixel, and then for anomaly detection. To further utilize the spatial and spectral information of HSI, Zhang et al. proposed TCNNT detector [24]. The TCNNT employs tensor theory and CNN into a joint unsupervised framework, and anomalies are detected by the difference between test block and local neighboring tensor. Nevertheless, the performance of CNN based methods is limited due to the lack of labeled training sets. The AE-based HAD methods are the main component of deep-learning based HAD [25], which generally use reconstruction errors for anomaly detection. A sparse AE for HAD was proposed in [26], and the anomalies are determined under the two cases of their network. Wang et al. [27] designed an adaptive weighted loss to train the skip connected AE, and the reconstruction errors are used for HAD. Lu et al. [7] introduced the manifold learning skill into the AE architectural, and the local structure of the latent representation can be preserved. Concurrent to this method, a graph AE was proposed in [28].
Besides, for better modeling the background or constraining the latent representation, Xie et al. broadened the prospects of HAD by introducing the GAN into this domain [8], [29]- [32]. An unsupervised discriminative reconstruction constrained GAN for HAD is proposed in [30], which is named HADGAN. To learn the distribution of background, the encodings are constrained by Gaussion distribution. Meanwhile, Xie et al. [32] proposed an adversarial AEs (AAE) for HAD. The features are learned by AAE, and the anomalies are detected by Mahalanobis distance.
Considering the three-order characteristics of HSI and the strong feature extraction capability of AE, a tensor decomposition inspired convolutional AE is proposed for HAD in this articler (TDNet). The TDNet tries to recover the background of the input HSI by using a limited number of training sets, and then anomalies are represented as the residual errors between input HSI and the recovered background. Specifically, the TDNet mainly consists of three steps. First, a potential abnormal pixels remove strategy is built to obtain the background training sets. Then, those training sets are fed into the DNN based decomposition network to recover the background of the input HSI. Note that the traditional CP decomposition is alternated by DNN in this article, which performs more stably and robust for noise. Finally, the residual errors between recovered background and input HSI are employed to detect anomalies.
The following major contributions of TDNet are threefold. 1) To the best of our knowledge, the CP-inspired convolutional AE (TDNet) is the first work that uses deep convolutional AE for tensor decomposition in the HAD domain. 2) To alleviate the blurry reconstructions, the SSIM term is added into the loss function, which can further improve the detection accuracy. 3) A larger number of experiments are conducted on four datasets, which demonstrate the superiority of the proposed TDNet.
The rest of this article is organized as follows: Section II reviews some related works. Section III presents the proposed TDNet in detail. The parameter analysis and detection results are shown in Section IV. Finally, Section V concludes this article.

A. Statistical Based Methods
As a classic statistical based method, let x represents the pixel vector, RX detector [12] is defined as follows: where μ is the mean vector of the samples, Γ denotes the covariance matrix of the image, and N represents the total number of pixels. The RX detector can detect anomalies effectively when the background is close to the multivariate normal distribution, but it can hardly find anomalies in the complex image scene. There are two main improvements for RX detector. One improvement is to enhance the adaptive selection of the background [11], e.g., weighted RX detector (WRX). WRX improves anomaly detection accuracy by giving smaller weights to potential anomalous pixels and larger weights to background pixels [33]. For the other improvement, some methods remove anomaly contamination and get a relatively pure background [34]. In [34], the detector eliminates the influence of potential abnormal pixels on the final detection result through multiple iterations, and stops iteration when two adjacent results are continuous.

B. Representation Based Methods
The CRD detector [16] assumes that the background pixel can be approximately represented by its neighborhoods, while anomalies cannot. The objective function is where y represents the pixel under test, and X s denotes the pixels inside in the outer window while outside in the inner window. When the representation process is finished, the anomalies can be determined from the residual image y is the represented pixel vector, r 1 is an anomaly pixel if it is larger than a predefined threshold. The main drawback of the CRD method is that it ignores the correlations of all the pixels, and there is no general rule for choosing appropriate window sizes [35]. To improve the computational speed of the CRD algorithm, Ma et al. reduce the iteration time of CRD by constructing the correlation matrix between adjacent pixels [36]. Wang et al. improved the performance of CRD algorithm by giving different weights to different spectral bands [37]. Moreover, Zhao et al. [19] imposed the collaborative representation constraint into the stacked AE, thus making the extracted features easier for CRD detection.

C. Matrix Decomposition Based Methods
The matrix decomposition based method utilizes the low-rank and sparse priors of the hyperspectral data for anomaly detection. Given a hyperspectral data X, LRASR method [17] is given by where D is the background dictionary, which is obtained by k-means algorithm. S represents the coefficients of D, and E denotes the anomalous part. (5) can be written as the following energy minimization problem: where · * denotes the matrix nuclear norm, · 1 and · 2,1 represent the 1 norm and 2,1 norm, respectively. β and λ are the parameters to control rankness and sparsity, and the linearized alternating direction method with adaptive penalty (LADMAP) [38] is adopted to solve (6). Actually, those parameters are difficult to determine due to the lack of priori knowledge of anomalies or background [27]. To reduce the redundancy of adjacent bands, a spectral difference low-rank representation learning framework was proposed in [39]. The residuals of adjacent bands are added to the background dictionary, thus improving the accuracy of the original LRASR. Feng et al. [40] used both the sparse part and the low-rank part for anomaly detection, which achieved considerable detection results. Afterward, Zhang et al. used Tucker decomposition method for HAD.
Wang et al. [20] recovered the background of the HSI from a preprocessing data, then anomalies are detected by the residual errors.

D. Autoencoder Based Methods
As a typical unsupervised DNN [41], AE is popularly used for anomaly detection, which is composed of an encoder and a decoder. The encoder aims to learn a low dimensional latent representation from input data while the decoder tries to reconstruct the input from the latent representation. Assume x ∈ R D is the input data with D dimension, AE first encodes it to a latent representation z ∈ R d via a nonlinear transform function f : z = f (x). The decoder is used to reconstruct the input data x from latent representation z : x = g(z). The general loss function of AE is mean square error (MSE), which is defined by where N is the number of training samples, and the parameters are updated by the back-propagation method [42]. Compared with original data, the latent representation is more abstractive and informational. In [43], the autoencoder is first used for HAD. The input image is encoded by a sparse encoder and then decoded, and the final anomaly score of each pixel is determined by the reconstruction error. To effectively extract the spatial-spectral features of hyperspectral images, Sun et al. combined low-rank representation and 3-D convolutional autoencoder for HAD [44]. In which, the 3-D convolutional autoencoder extracts features, and then uses low-rank representation to obtain the initial results on the encoding features. The final abnormal score is obtained by fusing the initial result and reconstruction error of autoencoder. Zhang et al. designed a 3-D variational autoencoder for HAD, and the reconstruction errors are regarded as abnormal score [45]. Arisoy et al. [46] proposed a HAD algorithm based on GANs. The algorithm uses a GAN to generate a synthetic background image that is as close as possible to the original background image. Then, the background information in the original image is removed by the synthesized image. Finally, the detection result is obtained by RX detector on the residual image. Besides that, the most similar works to TDNet are [47]- [49], but these works are mainly designed for HSI recovery tasks. Saragadam et al. [47] used deep generative networks for lowrank decomposition, and the method is robust to a wide range of distributions. Luo et al. [48] utilized a multilayer neural network to learn the nonlinear nature of multidimensional images, which is more effective than the linear transforms. Zhang et al. [49] constructed the Kroneker bases by performing a maxpooling operation on the input features, and then the degraded HSI is recovered by a Conv layer with the spatial size of 3 × 3. Unlike the above mentioned works, this article focuses on the HAD task. The proposed method first extracts the low-dimensional embedding of the revised HSI using an encoder, then performs a low-rank tensor decomposition on the feature space, and finally recovers the background information of the original HSI by a decoder.

III. PROPOSED METHOD
In this section, the proposed TDNet is presented in detail. As shown in Fig. 1, the TDNet mainly consists of three parts: potential abnormal pixels remove, deep CP decomposition for background recovery, and anomaly detection. In summary, the proposed TDNet is based on autoencoder framework. In particular, we use neural network to simulate the traditional CP decomposition model on the latent representations, so as to improve the accuracy of anomaly detection. Different from the traditional iterative optimization update of tensor decomposition, we update parameters by using back propagation through end-to-end training, so that the network can adapt to different datasets.

A. Potential Abnormal Pixels Remove
The potential abnormal pixels remove part is used to construct background training sets for deep CP decomposition network. There are many strategies that can achieve this goal, such as clustering based method [8], attribute filter based method [50], and naive detector based method [20]. In this article, RX detector is used to remove potential abnormal pixels due to its effectiveness and simplicity. Given a hyperspectral dataset X ∈ R W ×H×B , here W , H, B denote the width, height, and spectral bands of the HSI, respectively, then the detection result of RX detector can be denoted as M ∈ R W ×H . Obviously, the larger the values in M, the higher probabilities of being abnormal pixels. Thus, the pixels W × H × η with the smallest response values in the  M are set as 1, and the others are set as 0, here η (η ∈ (0, 1]) denotes the rate of background training set. Finally, the rectified M is used to obtain the background training set X B ∈ R W ×H×B .

B. Deep CP Decomposition for Background Recovery
The deep CP decomposition network (CPNet) aims to learn a low rank tensor from the background training sets, and then to recover the background information of the original HSI. As shown in Fig. 1, the CPNet consists of an encoder, a low-rank tensor generate network (LTGN), and a decoder. The encoder is used to reduce the dimension of the input training sets, then, LTGN generates the low-rank tensor from the encodings [49], [51]. Fig. 2(b) shows the structure of LTGN, which is composed of rank-1 tensor generate block (RTGB) and residual learning. After the low-rank tensor is obtained, the decoder is used to recover the background information.
The CP decomposition is a higher order of image singular value decomposition, which can describe the discriminative information of different components of the image. For a threeorder tensor X ∈ R I 1 ×I 2 ×I 3 , it can be decomposed into a linear combination of N rank-1 tensors where α n is the weight parameter of nth rank-1 tensor, a n ∈ R I 1 , b n ∈ R I 2 , c n ∈ R I 3 are the Kroneker bases in the three directions, and • denotes the Kroneker product. Generally, (8) can be transformed as a optimization problem, and the alternating least squares (ALS) algorithm is usually used to solve it. However, due to the high-dimension and noise corruption of the HSI, the solutions to this problem may not be stable. Fortunately, with the powerful representation capacity of DNN, Chen et al. [51] proposed a tensor low-rank reconstruction network for semantic segmentation, and Zhang et al. [49] exploited the tensor low-rank prior for HSI reconstruction. Those works provide a new perspective to solve the problem of tensor decomposition, and it is more flexible and effective. In this article, a CP-inspired convolutional AE is proposed for HAD. As shown in Fig. 2(a), RTGB is designed to generate rank-1 tensors. With the image features fed to RTGB, the global average pooling (GAP)-1 × 1 convolution (Conv)-Sigmoid is employed to generate Kroneker bases in the channel, height, and width directions. Then, those bases are applied to generate rank-1 tensor by Kroneckor product.
As shown in Fig. 2(b), the LTGN is used to generate a low rank tensor. First, the encodings of training sets act as the input of LTGN, and the first rank-1 tensor O 1 is generated by RTGB. Then, the residual part between the input of RTGB and the generated rank-1 tensor is used to generate the second rank-1 tensor O 2 . It can obtain N rank-1 tensors by repeating N times, and each of them can be regarded as the different frequencies of the encoding features [49]. Finally, those rank-1 tensors are aggregated into a low-rank tensor along channel directions, and a 3 × 3 Conv is used to reduce the dimension. Note that, the skip connection is used to avoid overfitting and enhance the discriminative of feature representation [49]. The above processing can be expressed as X low = X encoding Conv (Concate (O 1 , O 2 , . . . , O N )) (9) where is the element-wise product, and X low ∈ R W ×H×S , X encoding ∈ R W ×H×S are the obtained low-rank tensor and the encodings of the training set, respectively. Afterward, the lowrank tensor is fed into the decoder to recover the background information.
C. Anomaly Detection 1) Training Loss: The MSE is usually used for training the AE network, but they may suffer from blurry reconstructions [52]. In this article, inspired by [53], the structure similarity (SSIM) is added into the loss function to alleviate the blurry problem. Given the input image x and the reconstructed imagê x, the SSIM can be defined as In (10), μ x , μx denote the mean of input image x and reconstructed imagex, respectively, while σ x and σx represent their variances. As described in [53], c 1 and c 2 are set as 0.01, 0.03, respectively. Obviously, the larger the SSIM value is, the higher the similarity between x andx will be. Thus, the negative SSIM as a penalty term is defined [53] Loss ssim (X,X B ) = −SSIM(X,X B ) (11) where X ∈ R W ×H×B is the input HSI andX B ∈ R W ×H×B represents the reconstructed background. So, the total loss function in this article is defined as The first term is MSE, and λ as a parameter to balance the SSIM term. The effect of λ is discussed in Section IV-C.
2) Abnormal Score: During the test phase, the original HSI X ∈ R W ×H×B is directly fed into the network. The abnormal score is calculated as where Map denotes the final detection result.

IV. EXPERIMENTS
In this section, we first introduce the datasets and comparing methods used in the experiments. Then, we verify the effectiveness of TDNet on four real hyperspectral datasets. Finally, the ablation studies are performed to evaluate the necessity of each component in the framework.

A. Datasets
Four datasets are used in our experiments, and the details are listed in Table I. Furthermore, these datasets can be obtained from the corresponding links 1 [54].

1) AVIRIS-I Dataset:
This dataset was collected by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor from San Diego. The original image size is 400 × 400 pixel, with a total of 224 bands and a wavelength range of 366-2496 nm. Followed with previous literatures, a cropped size of 100 × 100 is used as the experimental image. In addition, the low signalto-noise ratio bands and the bad bands were removed. The three airplanes with 189 pixels are regarded as anomalies, Fig. 3(a) and (e) shows the pseudocolor image and the ground truth, respectively.
2) EI Segundo Dataset: The spatial size of EI Segundo dataset is 250 × 300 × 224, and the spatial resolution is 7.1 m/pix. The anomalies are storage tanks and the towers, which occupy 2048 pixels. The image scene and GT are shown in Fig. 3(b) and (f).

3) Texas Coast Dataset:
The Texas Coast data set comes from [55], it has 207 spectral bands with the wavelengths from 0.45 to 1.35 μm. The image size is 100 × 100, and the spatial resolution is 17.2 m/pix. Over all, the total number of anomalies occupy 155 pixels, and the pseudocolor image and the ground truth are shown in Fig. 3(c) and (g).

4) Los Angeles Dataset:
There are 205 spectral bands of the image, and a region with 100 × 100 pixels is used for the experiment. The anomalies mainly consist of buildings, and the image scene as well as the ground truth are shown in Fig. 3(d) and (h).    Here, S is the hidden nodes of the encoder. Considering the tradeoff between detection accuracy and computation time, S is set as 24. Furthermore, the rate of training set is 0.5, the rank number is 15, and the weight parameter λ is 0.5. The effect of those parameters for the final detection accuracy is discussed in Section IV-C.

2) Compared Methods and Evaluation Metric:
The receiver operating characteristic (ROC) curve and AUC value are generally used for HAD [56], seven mostly used detectors are employed for comparison. The RX [12] and LRX [13] detectors are statistical based methods, and the CRD [16] detector is based on collaborative representation theory. The main parameters of LRX and CRD detector are the window size. The optimal window size of LRX is set as (13,25), (9,31), (17,19), (5,25) for AVIRIS-I, EI Segundo, Texas Coast, and Los Angeles datasets, respectively. For CRD detector, the window size is (19,23), (13,31), (23,35), and (7,13) for the four datasets, respectively. The LRASR [17] is based  on low-rank matrix decomposition skill. The DAEAD [57] is based on AE architecture, and the TDAD [10] method is a tensor decomposition based detector. The KIFD [54] is a state-of-the-art detection method. The DAEAD method and TDNet are implemented by the tensorflow framework, and other methods are implemented by MATLAB 2018b.

C. Parameter Analysis
As described in Fig. 4, the main parameters of TDNet contain the rate of training set η, the number of hidden nodes S, the number of rank tensor N , and the weight parameter λ in the loss function. Fig. 4(a) reflects that the Texas Coast dataset is stable in different hidden nodes, and the other datasets achieve best detection accuracy when S is set to 24. Limited by this premise, Fig. 4(b)-(d) shows the effect of N , λ, and η, respectively. N controls the redundancy of the background training set, when N is set to 15, the accuracy of the four datatsets is stable, and three of them obtained the highest AUC values. The λ balances the effect of SSIM term in the loss function. For a small value of λ, the reconstructed background may lose structural information, while for a very large λ, the detection results are decreased. For a larger η, the anomalies are mistakenly regarded as one part of the background training set, and it will decrease the final accuracy.

D. Detection Performance
The detection maps, ROC curves, and AUC values are shown in Figs. 5-9, and Table II, respectively. For AVIRIS-I dataset, the proposed TDNet obtains the second-highest AUC values. The RX, LRX, and TDAD detector cannot detect anomalies well, since they get poor AUC values. Compared with LRASR and KIFD detector, the intensity of the anomalies is weak for TDNet, but TDNet can suppress the background well. To further illustrate the effectiveness of the TDNet, the ROC curves of seven detectors are shown in Fig. 9(a). The curve of TDNet and LRASR is higher than other methods, which demonstrates the effectiveness of the TDNet.
The size of Segundo is larger than other datasets, and the image scene is more complex. From Fig. 6, we can see that RX, LRX, CRD, and TDAD detector failed to detect anomalies, and the LRASR and DAEAD detector just find a few parts of targets. The detection maps of KIFD and TDNet are closer to GT than others, and the shapes of anomalies are preserved. Compared with KIFD, the TDNet has a lower false alarm rate, and obtains the highest AUC values, 0.9943. Fig. 9(b) shows the ROC curve of the detectors on Segundo dataset, while the curves of the compared methods are all surrounded by TDNet, which indicates the superiority of the TDNet.
As shown in Fig. 7, for Texas Coast dataset, RX, DAEAD, TDAD, and TDNet achieve better visual inspection. LRASR and KIFD detector is influenced by stripe noise, so they get lower AUC values. From the detection map of TDNet, we can see clearly that TDNet can not only find the locations of abnormal targets, but also keep the shape of the anomalies. The ROC curves in Fig. 9(c) further validate the effectiveness of the TDNet.
The anomalies of Los Angeles dataset mainly consist of buildings, and the image scene is relatively simple. All of the seven detectors obtain satisfactory detection accuracy, and TDNet gets the highest AUC values. The ROC curve in Fig. 9(d) clearly and intuitively shows the performance among these detectors, and the other detectors are surrounded by TDNet.
Overall, from ROC curves, AUC values and detection maps, the proposed TDNet not only gets better visual inspection, but also obtains the highest AUC values. Meanwhile, the ROC curves of other detectors are surrounded by TDNet, which can verify the effectiveness of our proposed TDNet.

E. Ablation Study
In order to analyze the effect of SSIM term and CP decomposition network in the proposed framework, an ablation study is shown in the last three columns of Table II. AE just uses AE network for anomaly detection by calculating the reconstruction errors, and TDNet(M) uses the MSE as the loss function. Compared with the original AE, the CP decomposition network can improve the detection accuracy by 0.0106, 0.1160, 0.0043, 0.0058 for AVIRIS-I, Segundo, Texas Coast, and Los Angeles dataset, respectively, which indicates the effectiveness of the CP decomposition network. When the SSIM term is added into the loss function, the AUC values of the four datasets can be further improved. As mentioned above, SSIM term can preserve the structure of the original dataset.

V. CONCLUSION
In this article, a CP-inspired convolutional AE is proposed for HAD (TDNet). In particular, the traditional CP decompositional model is replaced with DNN, which makes the model more stable and robust for noise. To alleviate the blurry reconstruction problem, the SIMM loss is added into the loss function.
The comparative experiments on four datasets demonstrate the effectiveness of the TDNet, especially in large abnormal targets. Furthermore, the ablation studies verified the superiority of TDNet. In fact, TDNet also has some weaknesses. For instance, the detection ability is limited on the small target. In the future, we will analyze the performance of TDNet on other object detection tasks. He is a Full Professor with the Xi'an Institute of Optics and Precision Mechanics, Chinese Academy of Sciences of China, Xi'an, China. His current research interests include precision spectral detection, metrology spectral analysis technology, and optical remote sensing of water environment.