A New Deep Convolutional Network for Effective Hyperspectral Unmixing

Hyperspectral unmixing extracts pure spectral constituents (endmembers) and their corresponding abundance fractions from remotely sensed scenes. Most traditional hyperspectral unmixing methods require the results of other endmember extraction algorithms to complete the abundance estimation step. Due to the impressive learning and data fitting capabilities of convolutional neural networks (CNNs), deep learning (DL)-based hyperspectral unmixing technologies have rapidly developed in the literature. According to the procedure used to combine different layers (i.e., fully connected layers, convolution layers, and activation layers), these techniques are mainly divided into three main categories, i.e., those based on autoencoder networks, convolutional neural networks, and convolutional autoencoder networks. They usually extract the weight and output of a specific activation layer as endmember signatures and abundance maps, respectively. Moreover, most existing DL-based unmixing approaches usually use 2-D CNNs to learn the features contained in hyperspectral images, and very few approaches employ 3-D CNNs to extract spectral and spatial information. However, 2-D CNN-based techniques cannot capture good discriminative feature maps from the spectral viewpoint, and 3-D CNN-based techniques usually have high computational overload. In this work, to further exploit the feature extraction capability of CNNs, we combine 3- and 2-D convolutions to propose a cross-convolution unmixing network (CrossCUN) for hyperspectral unmixing. Simultaneously, to better illustrate the improvements of our proposed CrossCUN, we also build the corresponding 2-D convolution unmixing network (2-DCUN) and 3-D convolution unmixing network (3-DCUN). We evaluate the performance of our newly developed networks on two types of synthetic datasets and three real hyperspectral images. Experimental results show that the proposed networks not only obtain better results than other DL-based unmixing methods but also do not require any prior knowledge (e.g., the results of other endmember extraction algorithms) to estimate the abundance maps.

Traditional endmember extraction methods include vertex component analysis (VCA), automatic target generation process (ATGP), negative abundance-oriented (NABO), and alternating decoupled volume max-min (ADVMM), among others. VCA [26] iteratively projects the HSI data to one direction, which is orthogonal to the subspace formed by the determined endmembers. As a result, the pixel with the extreme projection is extracted as the new endmember. ATGP [29] uses the notion of orthogonal subspace projection to extract endmember signatures. NABO This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ [30] considers the pixels outside the hull as alternative candidate endmembers to complete the unmixing task. ADVMM [31] addresses the worst case simplex volume maximization problem by alternating optimization. In addition to these approaches, within the current literature, there are also other interesting probabilistic techniques, developed as an attempt to address some of the shortcomings faced by the previous methods [35], [36]. Nevertheless, these methods are computationally very expensive. In contrast, the maximum distance analysis (MDA) [37], [38] has proven to be a simple but effective unmixing method that does not require any prior knowledge about the number of endmembers. However, the above methods both focus on endmember identification and need to be combined with other abundance estimation methods to perform the abundance estimation task.
Traditional abundance estimation methods include fully constrained least squares (FCLS) [39], sparse unmixing by variable splitting and augmented Lagrangian (SUnSAL) [40], spatial group sparsity regularized nonnegative matrix factorization (SGSNMF) [41], minimum volume-constrained nonnegative matrix factorization (MVCNMF) [27], minimum-volume enclosing simplex (MVES) [24], and robust collaborative nonnegative matrix factorization (R-CoNMF) [42], among others. SUnSAL utilizes the alternating direction method of multipliers to decompose a difficult problem into a sequence of simpler ones. SGSNMF incorporates the group-structured prior information of hyperspectral images into the nonnegative matrix factorization optimization to complete abundance estimation, where the data are organized into spatial groups. MVCNMF includes a volume constraint into the nonnegative matrix factorization formulation to integrate least-squares analysis and the convex geometry model to perform the unmixing task. MVES incorporates convex analysis and Craig's criterion to enhance the performance of spectral unmixing. It is worth noting that FCLS, SUnSAL, and SGSNMF need the results of endmember extraction from other methods to complete abundance estimation, and MVCNMF and MVES can simultaneously complete endmember extraction and abundance estimation. Compared with the above methods, R-CoNMF can simultaneously complete the three tasks, i.e., determining the number of endmembers, extracting endmember signatures, and estimating the abundance maps. Especially, a different advantage of R-CoNMF compared to other abundance estimation methods is that it completes abundance estimation without any prior knowledge of the number of endmembers.

A. Deep Learning-Based Methods for Hyperspectral Unmixing
Recently, based on their impressive learning and data fitting capabilities, some DL-based unmixing methods have been proposed in the literature [43], [44], [45], [46], [47], [48], [49], [50]. Depending on how different layers are combined, these DLbased unmixing methods are mainly divided into three groups, i.e., autoencoder networks, convolutional neural networks, and convolutional autoencoder networks. These three types of unmixing methods usually extract the endmember signatures and the corresponding abundance maps from the weights and outputs of a specific activation layer, respectively.
Regarding the first group, unmixing methods based on autoencoder networks mainly use fully connected layers and activation layers. In [43], Guo et al. proposed an autoencoder cascade framework that integrates a denoising autoencoder and a nonnegative sparse autoencoder in order to conduct data denoising and endmember estimation, respectively. In this sense, the autoencoder cascade framework is the first attempt to apply DL-based methods to spectral unmixing, improving the performance of unmixing by considering the observation noise and a sparsity prior. Nevertheless, its performance may be affected due to the fact that the same encoder and decoder is used. To overcome this limitation, in [44], Qu et al. proposed a so-called untied denoising autoencoder with sparsity (uDAS) method to address the unsupervised unmixing problem for HSIs with high noise levels. uDAS incorporates denoising capabilities in the form of a constraint to reduce the reconstruction errors and introduces a novel backpropagation method that uses simple matrix operations to effectively force the endmembers to be nonnegative (while the abundance vector is constrained to sum to one). However, both the autoencoder cascade framework and uDAS ignore spatial information and aim at addressing LMM-based problems.
On the other hand, unmixing methods based on CNNs mainly contain convolutional and activation layers. For instance, in [45], Palsson et al. proposed a novel spectral and spatial linear mixture model by developing an associated estimation method based on a convolutional neural network autoencoder unmixing (CNNAEU) to perform the overall unmixing task, i.e., endmember extraction and abundance estimation. CNNAEU extends the LMM to consider the situation where neighboring pixels take part in the pixel reconstruction. As a result, it represents the first attempt to use a CNN to directly exploit the spatial correlation in HSIs for spectral unmixing. Nevertheless, CNNAEU requires a significant number of training samples to adequately fit a large number of learnable parameters. In this sense, the network quickly tends to overfit. In addition, CNNAEU is designed for LMM-based problems. In [46], Gao et al. proposed a cycle-consistency unmixing network (Cycu-Net) which learns two cascaded autoencoders (in an end-to-end manner) to more effectively improve unmixing performance. Cycu-Net designs a cycle-consistency strategy and introduces a new self-perception loss containing two spectral reconstruction terms and an abundance reconstruction term to further refine the unmixing process. However, it ignores the challenges introduced by the high spectral dimensionality of the data.
Finally, unmixing methods based on convolutional autoencoder networks mainly use fully connected layers, convolutional layers, and activation layers. For instance, in [47], Yasiru et al. developed a convolutional autoencoder (CAE) architecture for spectral unmixing. CAE captures the spatial distribution in the original data through convolution filters, which are used to parameterize the spectral features. The encoder and decoder parts of CAE are used to translate the feature space into a latent space representation and to reconstruct the input from the latent space, respectively. However, CAE is based on the LMM and is generally unsuitable for nonlinear unmixing problems. In [48], Qi et al. proposed a deep spectral convolution network with spectral library (SCSL) support to conduct spectral unmixing. SCNL uses convolutional layers and fully connected layers to capture features and estimate abundance maps, respectively. Moreover, it exploits another fully connected layer to reconstruct the original data. Although SCNL reduces complexity and improves the utilization of the network, it heavily relies on the quality of the spectral library.

B. Contributions of This Work
As mentioned, existing unmixing methods based on DL architectures usually adopt autoencoder networks, CNNs, and convolutional autoencoder networks to perform spectral unmixing. The methods based on autoencoder networks are the most commonly used, whilst those involving CNNs and convolutional autoencoder networks are quite rare. Moreover, most existing unmixing methods usually employ 2-D convolution kernels to extract features, but 3-D CNNs have the ability to capture spectral and spatial features. Nevertheless, only using 2-D-pure CNNs or 3-D-pure CNNs has some disadvantages, such as the inability to capture information on the relationship between the channels or the need to design a very complex model. For instance, by only using 2-D kernels, the model cannot learn good discriminative feature maps from the spectral information. Similarly, a deep 3-D CNN is significantly more complex and expensive in computational terms. At the same time, it can obtain worse results for classes with similar textures on many spectral bands.
To address the aforementioned drawbacks and limitations faced by existing DL-based methods, in this article we combine 3-and 2-D convolution to propose a new cross convolution unmixing network (CrossCUN) for hyperspectral unmixing. The proposed CrossCUN can not only directly complete the abundance estimation task without any prior knowledge about endmember extraction, but also get more accurate results than traditional unmixing methods. Notably, most existing DL-based methods usually use autoencoder networks or 2-D convolutional networks to complete abundance estimation, and our proposed CrossCUN is the first one to exploit cross convolution-that combines 2-and 3-D convolutions-to perform abundance estimation. Our newly proposed CrossCUN shows the potential of cross convolution compared to using 2-or 3-D convolution alone in terms of abundance estimation, and obtains better results than other traditional and DL-based methods.
The rest of this article is organized as follows. Section II introduces the newly proposed CrossCUN and the corresponding 2-D and 3-D convolution networks, i.e., 2-DCUN and 3-DCUN. In Section III, we test the performance of different unmixing methods in abundance estimation on synthetic data and three real HSIs. Finally, Section IV, concludes this article.

II. PROPOSED FRAMEWORK
Our proposed CrossCUN combines 2-and 3-D convolutions and aims at estimating the abundance fraction of each endmember at each pixel in the scene. Let Y ∈ R Nx ×Ny×Nb denote HSI data, where Nx is the width, Ny is the height, and Nb is the number of spectral bands. Since the 3-D CNN is more computationally complex and hyperspectral pixels contain mixed land cover classes (with the subsequent inter-class similarity and intra-class variability), we employ principal component analysis (PCA) to remove the spectral redundancy on the proposed CrossCUN. PCA reduces the number of bands from Nb to D, maintaining the same spatial dimensions whilst keeping the most relevant spectral information. In this sense, let X ∈ R Nx ×Ny×D denote the reduced data by PCA (in our work, D is empirically set to 15).
We build 3-D patches to further improve the performance of 2-DCUN, 3-DCUN, and CrossCUN when dealing with the unmixing task. In this context, let P ∈ R S ×S ×D denote the 3-D patches, whilst S × S denotes the window size. The number of created 3-D patches by these three networks is (Nx − S + 1) × (Ny − S + 1). We assume that P a,b is the 3-D patch and (a, b) is its center pixel. In this sense, the width and the height of the 3-D patch P a,b are from a − (S − 1)/2 to a + (S − 1)/2 and b − (S − 1)/2 to b + (S − 1)/2, respectively. In our work, S is empirically set to 9.
On the one hand, our proposed 2-DCUN and CrossCUN involve 2-D convolutions, and their input patches are convolved with 2-D kernels. 2-D convolution computes the sum of the dot products between the input patch and the 2-D kernel to obtain the corresponding results, and the kernel spans the input data to cover the entire spatial dimension. The convolved features use an activation function to introduce nonlinearity in the model.
where ϕ denotes the activation function, w i,j is the value of the weight parameter for the jth feature map of the ith layer, b i,j is the bias for the jth feature map of the ith layer, d i−1 is the number of feature maps in the (i − 1)th layer, and 2ρ + 1 and 2γ + 1 are the width and height of kernel, respectively. On the other hand, 3-DCUN and the proposed CrossCUN first employ 3-D convolution to convolve the input patches with 3-D kernels. Similarly, 3-D convolution also computes the sum of the dot products between the input patch and the 3-D kernel to obtain the corresponding results. We then use an activation function to learn the nonlinear features. In this regard, the activation value v x,y,z i,j at spatial position (x, y, z) in the jth feature map of the ith layer is obtained by The 2-DCUN, 3-DCUN frameworks and our proposed Cross-CUN are graphically illustrated in Figs. 1-3. From Fig. 1, we see that 2-DCUN contains three 2-D convolutional layers, one flatten layer, and one dense layer. The kernel sizes of the three 2-D convolutional layers are 3 × 3 (f 1 where f i j denotes the kernel size of the ith dimension in the jth convolution), and the number of kernels    Especially, applying a 2-D convolutional layer after the 3-D convolutional layer and before the flatten layer can better distinguish spatial information in different spectral bands without losing a large amount of spectral information, which is very significant for HSI data analysis. Similar to existing unmixing methods, the estimated abundance maps need to satisfy two constraints, i.e., abundance sum-toone constraint (ASC) and abundance nonnegativity constraint (ANC). We use SoftMax as the activation function of the last dense layer in 2-DCUN, 3-DCUN, and CrossCUN, which is computed byâ whereâ ij is the estimated abundance of the jth endmember on the ith pixel, z j is the output value of the dense layer, and C is the total number of endmembers. We use cross-entropy as the loss function to train the 2-DCUN, 3-DCUN, and CrossCUN, whereâ i denotes the abundance estimated for the ith pixel, a i is the real abundance vector of the ith pixel, and N is the number of pixels. Tables I-III show a detailed description   need the fewest number of parameters, it obtains more accurate results on experimental data. Moreover, it is worth noting that the number of nodes of the last dense layer in the three networks is 3, which is equal to the number of endmembers in Samson data.

III. EXPERIMENTS
To assess the performance of 2-DCUN, 3-DCUN, and our proposed CrossCUN in terms of abundance estimation, two big families of synthetic data, i.e., with and without pure pixels, and three real HSIs (Samson, Jasper, and Urban scenes) are adopted in our experiments. We use the following: 1) minimum-volume enclosing simplex (MVES) [24]; 2) spatial group sparsity regularized nonnegative matrix factorization (SGSNMF) [41]; 3) robust collaborative nonnegative matrix factorization (R-CoNMF) [42]; 4) convolutional autoencoder (CAE) [47]; 5) untied denoising autoencoder with sparsity (uDAS) as comparison methods to test the performance of different unmixing methods in abundance estimation. Especially, the initial endmember results required by SGSNMF are extracted by VCA. We employ the root-meansquare error (rmse) and the mean rmse as metrics to test the performance of 2-DCUN, 3-DCUN, and CrossCUN. RMSE is computed by where N is the number of pixels, andâ ij and a ij are the estimated and real abundance fractions of the jth endmember on the ith pixel. The mean rmse is computed by where C is the number of endmembers.

A. Data Descriptions 1) Synthetic Dataset:
To evaluate the impact of pixel purity, the number of endmembers, and the level of noise on abundance estimation, we generate two big families of synthetic data, i.e., with and without pure pixels. Each kind of synthetic data comprises two types: 1) synthetic data with 2500 pixels and 20dB noise with different numbers of endmembers, i.e., 5, 10, 15, and 20; and 2) synthetic data with 2500 pixels and 5 endmembers under different levels of noise, i.e., 10, 20, 30, and 40 dB.
2) Samson Dataset: The Samson dataset was captured by the SAMSON sensor, and it is one of the most widely used datasets for evaluating hyperspectral unmixing algorithms. The scene contains three endmembers, i.e., soil, tree, and water. The original Samson data has 952 × 952 pixels and 156 bands covering the wavelengths from 401 to 889 nm. In our experiments, to reduce the computational burden, we start from the (252,332)-th pixel and utilize a region of 95 × 95 pixels to validate the performance of different unmixing methods in estimating abundance maps. The false color composition and the corresponding spectral of Samson are shown in Fig. 4(a) and (d), respectively.
3) Jasper Dataset: The Jasper dataset was obtained by the airborne visible/infrared imaging spectrometer (AVIRIS) of NASA's Jet Propulsion Laboratory (JPL). The original Jasper data contains 512 × 614 pixels recorded in 224 bands, ranging from 380 to 2500 nm. It has four endmembers, i.e., tree, water, soil, and road. Since the original Jasper data are too complex and will bring high computational overload, we start from the (105 269)th pixel and keep 100 × 100 pixels in our experiments. In addition, due to dense water vapor and atmospheric artifacts, we remove bands 1-3, 108-112, 154-166, and 220-224 and retain 198 bands in our experiments. The false color composition and the corresponding spectral of Jasper are shown in Fig. 4(b) and (e), respectively.

B. Implementation Details and Performance Metrics
In our work, we empirically set the number of layers and randomly select 80% pixels and run 50 epochs for training 2-DCUN, 3-DCUN, and our proposed CrossCUN. Especially, for randomly selecting 80% pixels, we first determine the number of pixels (e.g., N ) and then randomly generate N × 80% nonrepeating numbers between 1 and N . Next, we use the N × 80% numbers to extract the corresponding data and abundances from the original experimental data and abundances to ensure randomness. Finally, we use the extracted data and abundances to train our proposed network. The parameters that need to be adjusted in our proposed CrossCUN are the learning rate, input size, input dimension, dropout rate, and weight decay. We perform a parameter sensitivity analysis on Samson data and report their results in Fig. 5. From Fig. 5(a), we see that the mean rmse increases as the learning rate grows, and the mean rmse reaches the lowest value when the learning rate is 0.0001. Therefore, we choose 0.0001 as the final training rate in our proposed CrossCUN. From Fig. 5(b), the mean rmse achieves the lowest value when the input size is 9 × 9. In this sense, we choose 9 × 9 as the final input size to train our proposed CrossCUN. From Fig. 5(c), we can see that the value of the mean rmse fluctuates as the input dimension increases, and the mean rmse reaches an optimal value when the input dimension is 13. Therefore, we consider 13 as the final input dimension in our proposed CrossCUN. From Fig. 5(d), we also can observe that the mean rmse fluctuates as the dropout rate grows, and reaches the lowest value when the dropout rate is 0.03. Therefore, we choose 0.03 as the final dropout rate to train our proposed CrossCUN. Fig. 5(e) shows that our method obtains the best results in terms of mean rmse when the weight decay is 0.0001. As a result, 0.0001 is considered the final weight decay in our experiments.

C. Experiments With Synthetic Data
We generate two big families of synthetic data, i.e., with and without pure pixels, to evaluate the performance of different unmixing methods in estimating abundance maps, and retrain the network when synthetic data have different numbers of endmembers or different levels of noise. Especially, considering that synthetic data include many endmembers, we only report the mean rmse results of different unmixing methods on the two big families of synthetic data. Table IV presents the mean rmse results obtained by different unmixing methods on synthetic data with pure pixels under different numbers of endmembers. From Table IV, we see that 2-DCUN, 3-DCUN, and CrossCUN both get better results than other methods and CrossCUN obtains the most accurate results on synthetic data with any number of endmembers. The results shown in Table IV reflect that the number of endmembers almost has no effect on the performance of our proposed CrossCUN in abundance estimation. Table V displays the mean rmse results of different unmixing methods on synthetic data with pure pixels under different levels of noise, i.e., 10, 20, 30, and 40 dB. From Table V, we observe that our proposed CrossCUN obtains the best results on synthetic data with 10 and 20 dB, and R-CoNMF achieves the best results on synthetic data with 30 and 40dB. From Table V, we also see that although our proposed CrossCUN does not achieve the best results on all synthetic data (the gap between it and R-CoNMF is very small). Therefore, our proposed CrossCUN can be used as an effective abundance estimation method. Table VI shows the mean rmse results obtained by different unmixing methods on synthetic data without pure pixels using different numbers of endmembers. From Table VI, we observe that MVES and R-CoNMF exhibit better results than other methods. Moreover, 2-DCUN, 3-DCUN, and CrossCUN achieve more accurate results than MVES and R-CoNMF, and CrossCUN obtains the best results in terms of abundance estimation on synthetic data with any number of endmembers. Table VII presents the mean rmse results obtained by different abundance estimation methods on the synthetic data without pure pixels under different levels of noise, i.e., 10, 20, 30, and 40 dB. From Table VII, we observe that our proposed CrossCUN is better than other methods on synthetic data with 10 and 20 dB, and R-CoNMF obtains the most accurate results compared with other methods on synthetic data with 30 and 40dB (but the gap between R-CoNMF and our proposed CrossCUN is small). Therefore, our proposed CrossCUN is also very effective in terms of abundance estimation.

D. Experiments With Samson Data
Table VIII quantitatively compares the performance of different unmixing methods in the task of estimating abundance maps on Samson data, reporting the rmse and mean rmse scores obtained by them. From Table VIII, we see that SGSNMF obtains the best results in the task of estimating the abundance     maps of soil, tree, and water compared with the other two traditional unmixing methods, i.e., MVES and R-CoNMF. Moreover, CAE is the best abundance estimation method compared with uDAS. From Table VIII, we also find that 2-DCU, 3-DCUN, and CrossCUN are better than the best traditional and DL-based methods, i.e., SGSNMF and CAE, and our proposed CrossCUN achieves better abundance estimation results compared with 2-DCUN and 3-DCUN. This fact reveals that our proposed CrossCUN is not only very effective for abundance estimation purposes but also improves the results obtained by the corresponding 2-and 3-D convolution networks, i.e., 2-and 3-DCUN. Fig. 6 presents the abundance maps obtained by different unmixing methods from the Samson data. From Fig. 6, we observe that SGSNMF and CAE provide results that are more similar to the ground-truth compared with other traditional and DL-based unmixing methods. We also find that 2-DCUN, 3-DCUN, and our proposed CrossCUN are closer to the ground-truth compared with all methods, and CrossCUN is slightly better than 2-and 3-DCUN. Table IX shows the results of rmse and mean rmse obtained by different unmixing methods and reports their performance evaluation in abundance estimation on Jasper data. From Table IX, we observe that SGSNMF is the best method in the task of estimating the abundance maps of three, water, soil, and road compared with another two traditional unmixing methods, i.e., MVES and R-CoNMF. CAE obtains the best results in  estimating the abundance maps of tree and water, and uDAS obtains the best abundance estimations for soil and road. In fact, uDAS is the best among DL-based methods in terms of the value of the mean rmse. In addition, from Table IX we also see that 2-DCUN, 3-DCUN, and CrossCUN are not only better than the best traditional and DL-based methods, i.e., SGSNMF and uDAS, but also can get quite accurate results when estimating the abundance maps of all minerals. Simultaneously, CrossCUN is better than 2-and 3-DCUN. Therefore, CrossCUN is not only very effective in abundance estimation but also has The results obtained by different unmixing methods on Jasper data are presented in Fig. 7. From Fig. 7, we see that MVES and CAE provide the worst results compared with the other traditional and DL-based methods, while 2-DCUN, 3-DCUN, and our proposed CrossCUN are more similar to ground-truth than all other methods. This fact, together with the results in Table IX, leads us to conclude that our proposed CrossCUN is not only an effective abundance estimation method but also outperforms 2-and 3-DCUN. Table X shows the rmse and mean rmse values obtained by different unmixing methods and reports their performance evaluation in abundance estimation on Urban data. From Table X, we can conclude that the three traditional unmixing methods, i.e., MVES, SGSNMF, and R-CONMF, and the two DL-based methods i.e., CAE and uDAS cannot obtain ideal results in abundance estimation. Clearly, 2-DCUN, 3-DCUN, and CrossCUN obtain better results in terms of rmse and mean rmse compared with all traditional and DL-based unmixing methods. As in previous experiments, CrossCUN is better than 2-and 3-DCUN. Fig. 8 reports the abundance maps obtained by different unmixing methods on Urban data. From Fig. 8, we see that 2-DCUN, 3-DCUN, and our proposed CrossCUN are all close to the ground-truth. After analyzing the results in Table X and Fig. 8, we can conclude that CrossCUN not only gets the best results in abundance estimation compared with other traditional and DL-based unmixing methods, but is also more effective than 2-and 3-DCUN.

G. Computational Cost
We conducted all experiments on a computer with 2.6-GHz Intel Core i7 CPU and 16 GB of memory (NVIDIA GeForce RTX 2060 GPU) and evaluated the computational cost of different unmixing methods using different datasets. By analyzing the experimental results on synthetic data, we find that the existence of pure pixels does not significantly affect the efficiency of different unmixing methods. In this sense, we report the results of the average running time (including training and testing time) of different unmixing methods on synthetic data with and without pure pixels.  Table XI, we see that MVES achieves the highest efficiency on synthetic data with five endmembers, and the number of endmembers has a greater impact on MVES. In addition, R-CoNMF is the fastest on synthetic data with 10, 15, and 20 endmembers, and it is stable on synthetic data with any number of endmembers. From Table XI, we also find that 2-DCUN is faster than 3-DCUN and CrossCUN, and the efficiency of 3-DCUN and CrossCUN is similar. Although our proposed CrossCUN is not faster than MVES and R-CoNMF, the time gap between them is not significant. Moreover, the results in Tables IV and VI show that CrossCUN is more effective than MVES and R-CoNMF. Therefore, our proposed CrossCUN is very effective and efficient for abundance estimation. Table XII shows the results of different unmixing methods on synthetic data under different levels of noise, 10, 20, 30, and 40 dB. The experimental results in Table XII indicate that MVES is the fastest method on synthetic data with any level of noise. Moreover, we find that 2-DCUN achieves higher efficiency than 3-DCUN and CrossCUN, and the running time of 3-DCUN and CrossCUN is similar. Although our proposed CrossCUN does not have higher efficiency than MVES, the results in Tables V and VII show that CrossCUN is more effective than MVES. Therefore, our proposed CrossCUN is an effective and efficient abundance estimation method.
Table XIII shows the running time of different unmixing methods on the three real hyperspectral datasets, i.e., Samson, Jasper, and Urban. From Table XIII, we see that MVES is the most efficient among the considered methods. Although MVES is faster than CrossCUN, the results in Tables VIII-X show that CrossCUN is more effective than MVES. Moreover, the computational cost of CrossCUN is acceptable for abundance estimation.

H. Sensitivity to the Number of Training Samples
Although the proposed CrossCUN is a supervised unmixing method, it obtains accurate abundance estimation results without requiring a lot of training samples. To illustrate this point, we choose different training set sizes (i.e., 20%, 40%, 60%, and 80% of the available labeled samples) to evaluate the performance of 2-DCUN, 3-DCUN, and CrossCUN on different datasets. The related mean rmse results are reported in Table XIV. From  Table XIV, we see that: 1) more training pixels will lead to more accurate results (as expected); and 2) CrossCUN provides more accurate results than 2-and 3-DCUN with limited training samples. For clarity, we summarize the mean rmse results of different unmixing methods in Table XV (using 80% of the available samples). After analyzing the results of Tables XIV and XV, we can conclude that 2-DCUN, 3-DCUN, and CrossCUN outperform other compared methods even when the training pixels are very few, and CrossCUN obtains the best results compared to all other methods.

IV. CONCLUSION
In this work, a new cross convolution unmixing network (CrossCUN) has been introduced. The proposed CrossCUN can not only get more accurate results than DL-based methods but also directly complete the abundance estimation results without any prior results about endmember extraction, as opposed to traditional unmixing methods. We evaluated the performance of the newly proposed CrossCUN in terms of abundance estimation on synthetic data and three real HSIs, and our experiments reveal that CrossCUN outperforms the competitors in most cases.
As with any new approach, there are some unresolved issues that may present challenges over time. Similar to other DLbased methods, our proposed CrossCUN is a supervised method that requires labeled samples to train the network. Although the process of collecting training data for unmixing applications is costly, this can be done using image data at multiple resolutions or even field visits. In future work, we will work on designing a new unsupervised network to directly complete abundance estimation.

ACKNOWLEDGMENT
The BBVA Foundation accepts no responsibility for the opinions, statements, and contents included in the project and/or the results thereof, which are entirely the responsibility of the authors.