GPU Implementation of Graph-Regularized Sparse Unmixing With Superpixel Structures

To enhance spectral unmixing performance, a large number of algorithms have simultaneously investigated spatial and spectral information in hyperspectral images. However, sophisticated algorithms with high computational complexity can be very time-consuming when a large amount of data are involved in processing hyperspectral images. In this article, we first introduce a group sparse graph-regularized unmixing method with superpixel structure, to promote piecewise consistency of abundances and reduce computational burden. Segmenting the image into several nonoverlapped superpixels also enables to decompose the unmixing problem into uncoupled subproblems that can be processed in parallel. An implementation for the proposed algorithm on graphics processing units (GPUs) is then developed based on the NVIDIA compute unified device architecture (CUDA) framework. The proposed scheme achieves parallelism at both the intrasuperpixel and intersuperpixel levels, where multiple concurrent streams have been used to enable multiple kernels to execute on the device simultaneously. Simulation results with a series of experiments demonstrate advantages of the proposed algorithm. The performance of the GPU implementation also illustrates that parallel scheme largely expedites the implementation.


I. INTRODUCTION
M IXED pixels are widespread in hyperspectral images due to the limited spatial resolution of sensors and the intimate material interactions in complex scenes. In the past three decades, spectral unmixing techniques have been intensively studied to analyze the constituent substances in mixed pixels by separating the observed pixel into a set of spectral signatures (or endmembers) with corresponding fractions (or called abundances) [2]. The majority of unmixing algorithms in literature are based on the linear mixture model (LMM). Traditional LMM-based unmixing algorithms include endmembers identification by endmember extraction algorithms [3], [4], [5] and abundance estimation by least squares algorithms [6]. To improve the abundance estimation performance, prior information or constraints on abundances have been introduced in many studies. Sparse regression is the most common one to be used in the unmixing task. In this framework, the sparsity constraint is enforced on abundance vectors, due to the fact that compared to the number of substances in the endmember library, only a small number of endmembers are activated in one pixel. For example, an 1 -norm sparse regularizer on abundance vectors is introduced in sparse unmixing by variable splitting and augmented Lagrangian (SUnSAL) [7], a stronger sparsity promotion using l q -norm (0 ≤ q ≤ 1) is proposed in [8] and [9] and collaborative SUn-SAL (CLSUnSAL) in [10] introduces the 2,1 norm to impose sparsity for all pixels.
In addition, a large amount of works also have incorporated spectral/spatial correlation among pixels into sparse regression spectral unmixing to further enhance the estimation performance [11], [12], [13], [14], [15], [16], [17]. Variable splitting augmented Lagrangian and total variation (SUnSAL-TV), a sparse unmixing technique described in [11], makes use of a total variation (TV) regularizer to encourage smoothness between each pixel and its four neighbor-hoods. Multiscale graphs are created as the graph regularization term in [18], which limits the unmixing model. The reconstructed pixels in [19] are subjected to a graph-based total variance regularization rather than abundances. The work in [12] incorporates a graph Laplacian regularization on abundances to improve the smoothness of abundance maps. Superpixel structures have gained interest in hyperspectral image unmixing as a way to investigate local spatial information. To alleviate the computational overhead when working on the entire image, for example, the work in [1] adopts a graph-based TV regularizer with superpixel building. Borsoi et al. [20] proposed a multiscale unmixing algorithm (MUA), where the image is unmixed first on a coarse scale and then on the original scale. Based on MUA, Zhang et al. [21] introduced two weight factor matrices to the coarse unmixing model and the model in the original image, respectively, in order to investigate intersuperpixel information. In [22], [23], and [24] different reweighted 1 norm on abundances are introduced with different global factors and local superpixel-based weighted factors to enforce sparsity of the abundance matrix. The authors in [25], [26], and [27] introduced low rank limits on abundances in each superpixel.
However, due to the higher spectral resolution of hyperspectral images and the fact that many of the aforementioned techniques are frequently computationally intensive, the unmixing process becomes quite laborious in terms of computation. In order to speed up the processing, it is required to look into high-performance computing infrastructures. In recent years, practical implementations of hyperspectral imaging tasks, such as classification [28], [29], [30], detection [31], [32], [33], compression [34], [35], [36], and unmixing [37], [38], [39] have attracted attention to parallel computing techniques. Hardware devices used for this purpose range from field-programmable gate arrays (FPGAs), multicore central processing units (CPUs) to GPUs. Among these candidates, GPUs have been widely implemented to accelerate hyperspectral images tasks, considering a large number of concurrent threads and huge memory bandwidth [40]. For instance, Samat et al. [41] presented a GPU-accelerated CatBoost algorithm for hyperspectral classifaction. In [39], a full unmixing chain is implemented on multicore processors and GPUs. In [40], vertex component analysis (VCA) and SUnSAL algorithms are combined and parallelized on GPUs. In [42], a fast algorithm with a high degree of parallelism for linearly unmixing hyperspectral images (pFUN) is proposed, and this algorithm also has been implemented using OpenCL [43]. In [44], a parallel implementation of simplex identification for unsupervised unmixing is applied on GPUs based on CUDA.
In this article, we present a parallel implementation of the graph-based spectral unmixing with the superpixel construction. Specifically, adjacent pixels with similar spectral signatures are clustered into several subgraphs and with the established subgraphs, the unmixing problem is solved with the structural sparsity and graph regularization in each superpixel. Due to the high degree of parallelism shown in the unmxing algorithm, we develop a parallel GPU implementation using the NVIDIA CUDA platform. With the application of CUDA streams, calculations across superpixels are computed in parallel on the GPU and with the ADMM-based solving algorithm, intra superpixels are also processed in parallel in each stream. The contributions are summarized as follows.
1) To address the computational burden of working on an entire image, superpixels are constructed for benefiting the graph-regularized sparse unmixing problem. Compared to the spatial regularization within uniform blocks, unmixing performance is enhanced as superpixel structures maximally benefits from the spectral homogeneity within irregular regions. 2) A GPU implementation of the proposed unmixing method is proposed where parallelism is achieved at both the intrasuperpixel level and the intersuperpixel level. At the intersuperpixel level, specific challenges of processing irregular blocks are addressed via applying concurrent CUDA streams, which enables kernels to process different superpixels concurrently. In addition, the intrasuperpixel parallelism for matrix operations is carried out by using cuBLAS library and the created kernels. Experimental results illustrate that parallel scheme largely expedites the sequential implementation. The rest of this article is organized as follows. Section II recalls the related linear sparse regression unmixing works. Sections III and IV introduce the group sparse graph-regularized unmixing method and its parallel implementation using NVIDIA CUDA framework, respectively. Section V evaluates the performance of the proposed algorithm and its parallel GPU-based implementation. Finally, Section VI concludes this article.

II. RELATED WORK
Suppose that the total number of pixels in the hyperpectral image is N and each pixel consists of L spectral bands. Let Y = [y 1 , . . . , y N ] ∈ R L×N be the observed spectral matrix, M = [m 1 , . . . , m r , . . . , m R ] ∈ R L×R be the endmember matrix, with R being the number of endmembers and m r being the ith spectral signature vector, and X = [x 1 , . . . , x n , . . . , x N ] be the abundance matrix. Other symbols are defined in the context where they are referred.

A. Linear Sparse Unmixing
The LMM can be expressed as where Z ∈ R L×N is the additive noise. Each column x n of X is usually subject to two constraints, namely, the abundance nonnegative constraint (ANC) and the abundance sum-to-one constraint (ASC). Under the sparse regression framework, the unmixing problem is generally formulated by incorporating the 0 -norm on the abundances where · F and · 0 are the Frobenius norm and the 0 norm, respectively, μ is a nonnegative hyperparameter. The ASC is ignored as it is prone to strong criticisms in the literature [45]. As (2) is nonconvex and difficult to solve, it is common to replace the 0 -norm by the 1 -norm as its surrogate [7], leading to where X 1,1 = N n=1 x n 1 and optimization problem (3) is convex.

B. Graph-Regularized Sparse Unmixing
Based on the LMM, graph-regularized sparse unmixing algorithm in [1] introduces a framework of regularizing the sparse unmixing with the graph structure, establishing subgraphs of the image with the superpixel segmentation technique. The unmixing problem is performed with a graph-based total variation regularization in each superpixel and is expressed as where Y (i) and X (i) represent the observed matrix and the associated abundances within the ith superpixel, μ and λ are two regularized parameters. The graph-regularized term g(X (i) ) measures difference between the two abundance vectors weighted by the similarity of pixel where A (i) is the adjacency matrix to encode the similarity between abundances of the pixels in each superpixel and B (i) represents the incidence matrix of the subgraph [19]. In each superpixel, the binary adjacency matrix where threshold δ is defined by the user. A pair of pixels is considered linked provided that the squared distance between their associated spectra is smaller than this threshold. Then the problem (4) is written as follows: where N s is the number of superpixels after segmentation. The alternating direction method of multipliers (ADMM) algorithm is applied to solve the problem (7).

III. GROUP SPARSE GRAPH-REGULARIZED UNMIXING
In [1], subgraph components are first constructed by the superpixel segmentation method that jointly considers the spectral and spatial relations of pixels, and then connections are established in each subgraph. However, considering the fact that pixels may share the same endmembers within a homogeneous superpixel, it can be beneficial to consider the group sparsity in one superpixel instead of 1 -norm regularizer and then it can be expressed as the 2,1 norm. Such a strategy has often been used in the literature of hyperspectral unmixing works, like in [46]. Then, subproblem i may be written as min Algorithm 1: ADMM Solution for Problem (10).
We now present the solution to subproblems in (8) via the ADMM algorithm. Introducing auxiliary variables V 1 to V 4 , problem (8) can be rewritten as follows: where l R + (X) = N n=1 l R + (x n ) is the indicator function such that l R + (x i ) is 0 when x i is in the nonnegative orthant and +∞ otherwise. The augmented Lagrangian function is as follows: where ρ is the penalty parameter and D 1 , D 2 , D 3 , D 4 are Lagrange multipliers. The ADMM algorithm is used to solve the optimization problem (10). The ADMM routine is summarized in Algorithm 1, where in step 5 soft row (·) is the row-shrinkage thresholding operator [47] and in step 7, soft(·) denotes the shrinkage thresholding operator introduced in [48]. Algorithm 1 shows that updates of X (t+1) and V (t+1) 3 are the two most expensive steps. In each superpixel, the term (2I + M T M ) −1 and (I (i) + B (i) B (i) ) −1 are precomputed. Then in each iteration, the computational complexity for updating X is O(RN L) and assuming B (i) ∈ R N ×N eg , where N eg denotes number of edges in a graph, the computational complexity for updating V 3 is O (RN N eg ). The iteration stops when the maximum number of iterations is reached or the residual The convergence is illustrated in Section V-B.

IV. PARALLEL IMPLEMENTATION ON GPU
This section proposes a parallel implementation of the presented algorithm (denoted by GSGRU) on CUDA platform to leverage GPU acceleration. When compared to CPUs, GPUs are generally more efficient at processing large data blocks in parallel due to their highly parallel architecture [49]. Based on the CUDA platform, we either write our own kernels to achieve parallelism or use existing libraries which take advantage of parallel architectures.

A. Algorithm Analysis
An examination of Algorithm 1 reveals that the algorithm can be parallelized at two levels: 1) the intrasuperpixel and 2) intersuperpixel levels. First, within each superpixel, each iteration of abundance estimation consists of a large number of matrix operations (multiplication, addition, inversion, and transposition). We use the cuBLAS library to achieve parallelism for the matrix operations, which is an implementation of basic linear algebra subprograms (BLAS) on top of the NVIDIA CUDA runtime and provides basic matrix computations functions whose execution takes place on the device exploiting the maximum degree of parallelism [50]. For two shrinkage thresholding operators in lines 5 and 7, corresponding kernel functions are created to call threads to achieve parallelism In addition to intrasuperpixel parallelization of matrix operators, segmenting the image into a number of nonoverlapping superpixels permits the problem (8) to be decomposed into uncoupled subproblems that are also highly parallelizable. However, the irregular shapes of the superpixels poses a challenge of developing a new parallel mechanism rather than straightforwardly using cublasxxBatched"-shaped functions in cuBLAS library, since the latter only operates with a batch of identically sized matrices. To tackle this issue, in this article CUDA streams are used to enable multiple kernel functions to be executed concurrently. In our problem, estimating abundance for N s subproblems can be regraded as multitasks and streams are used to overlap the different tasks.
The CUDA implementation of this algorithm is depicted in Fig. 1. Each superpixel block is set to be associated with one stream. At the intrasuperpixel level, operations including graph construction and abundance estimation are implemented in parallel. At the intersuperpixel level, operations in different streams could run concurrently. In the following, we describe the details for each step of the parallelization accomplished by the GSGRU algorithm. It should be noted that setting the number of stream to N s does not guarantee that all streams execute concurrently, as there is a hardware limit.

B. GPU-Based GSGRU Implementation
In general, the CPU first receives data and transfers them to the GPU. Kernel functions then call local threads in the GPU. Meanwhile CUDA streams are used to perform multiple kernel operations simultaneously in different tasks. Finally, the results are copied back from GPU to CPU. Algorithm 2 describes the process of GPU implementation.
1) Graph Construction: Graph construction focuses on construction of the adjacency matrix A (i) and incidence matrix B (i) , both implemented in the GPU by setting kernel_A (i) and kernel_B  In general, this GPU implementation scheme can be applied to parallel any unmixing algorithms involving the similar superpixel processing. CUDA streams can be used at the intersuperpixel level, and each superpixel mathematical manipulations can also be accelerated with specific kernels. For example, implementation of algorithms in [25] and [26] can be straightforwardly achieved based on this principle.

V. EXPERIMENTAL RESULTS
A series of experiments were conducted to evaluate the performance of the GSGRU algorithm on the CPU and GPU. The experimental setting and hyperspectral datasets used in experiments are given at the beginning. After that, we analyze  2) Hyperspectral Datasets: Two images are used for validating the proposed algorithm. The first image D1 is a synthetic image consisting of 100 × 100 pixels. Five endmembers are randomly picked from the pruned USGS spectral library with 216 spectral bands. The abundance map is created using the Synthesis tools package of the HYDRA project (HYperspectral Data Retrieval and Analysis) [51], where samples of the abundance image are generated with the Legendre method. The pure image is created using the LMM and corrupted by white Gaussian noise, with different signal-to-noise ratios (SNRs) of 30 and 20 dB, respectively. Superpixels are constructed using the SLIC method from the vlfeat library [52] after applying PCA over the spectral dimension of the image. The false color image of D1 with SNR = 30 dB and the superpixels constructed are illustrated in Fig. 2. A total number of 289 superpixels are created, and clearly, spatial patterns in this image are well captured by the superpixel structures.
The real image tested is a subimage of the Cuprite, which is obtained on the Cuprite mining district (NV, USA) by AVIRIS. This subimage (denoted by Cuprite) consists of 250 × 191 pixels with 188 spectral bands. The number of endmembers is set to 12 and VCA is used to extract endmembers. Fig. 3 illustrates the false color image of D2 and the created superpixels with a total number 336 superpixels.

B. Performance Analysis of Unmixing Accuracy
The GSGRU algorithm is evaluated with SUnSAL [7], CLSUnSAL [10], SUnSAL-TV [11], GLUP-Lap [12], and MUA [20] to compare unmixing performance. The parameter μ is the sparsity regularization parameter for all algorithms except MUA. λ T V and λ Lap represent TV regularization parameter of SUnSAL-TV and graph Laplacian regularization parameter of GLUP-Lap, respectively. MAU includes three parameters where λ sp1 and λ sp2 are sparsity parameters controlling sparsity at coarse scale and original scale, respectively, and parameter β controls the fitting term between the coarse abundance matrix and original abundance matrix. For comparison, superpixel segmentation is considered in MUA, the same as GSGRU. Different values are tested for the sparse and spatial regularized parameters μ, λ sp1 , λ sp2 and λ T V , λ Lap , λ via a search over the grids with the range of {1, 0.5, 0.1, 0.05, 0.01, 0.005, 0.001, 0.0005, 0.0001}, β is set to 10 and ρ is set to 0.05. In each superpixel, the binary adjacency matrix A (i) used in GLUP-Lap and the proposed method is defined as (6).

1) Convergence of GSGRU:
We first analyze the convergence of Algorithm 1 with D1 (SNR=30 dB) used to test. Fig. 4 shows the residual GX − V H F as a function of the number of iterations in one superpixel. The maximum number of iteration is set to 200. We can see that the residual result decreases quickly as the number of iterations increases and finally converges close to 0.
2) Unmixing Performance: To evaluate the performance of abundances estimation, the root mean square error (RMSE) is used and defined by where x n andx are true and estimated abundance vectors, respectively. The reconstruction error (RE) is also used as complementary information when ground-truth information is where y n andŷ n are the observed pixel and the reconstructed pixel, respectively. Table I reports RMSEs in D1 with the optimal parameters setting for each algorithm (The optimal parameters are given following the result.). The increase of SNR shows improvement for all algorithms as expected. For SNR = 20 db, results of GS-GRU are similar to MUA, however, for SNR = 30 db, GSGRU results are significantly lower, as compared to other algorithms. Fig. 5 illustrates ground-truth and the estimated abundance maps for D1 with SNR = 30 dB. We can observe that compared to the ground truth, the results of GSGRU are significantly better than other algorithms, especially in the second column.  For Cuprite, we set μ, λ sp1 to 0.001, λ T V , λ Lap , λ sp2 , λ to 0.0001, β in MUA to 0.1. Note that the image is divided into four parts to process when applying the GLUP-Lap algorithm, as memory would run out when calculating the adjacency matrix due to the large size of Cuprite. Table II reports REs for the algorithms. We can see that our algorithm exhibits a low RE, similar with SunSAL, SUnSAL-TV and MUA. However, we can see that GSGRU highlights localized targets while SUnSAL-TV tends to oversmooth the abundance maps in Fig. 6, especially for Andradite material in the second row.
3) Sensitivity to Hyperparameters: The sensitivity to hyperparameters μ and λ of the proposed algorithm is discussed. In this experiment, D1 with SNR = 30 dB is tested. Fig. 7 shows RMSE results with respect to μ and λ. We can see that around the optimal parameters (μ = 0.0005/0.001 and λ = 0.005), the proposed algorithm shows the smallest RMSE.

4) Number of Endmembers:
For GSGRU, the influence of number of endmembers is also analyzed using D1 with SNR  TABLE II  RES FOR CUPRITE FOR SUNSAL, CLSUNSAL, SUNSAL-TV, GLUP-LAP, MUA AND GSGRU, δ FOR GLUP-LAP AND GSGRU   TABLE III  = 30 dB. Table III gives RMSEs with different numbers of endmembers in the optimal hyperparameters (μ, λ). We can see that as the number of endmembers increases, the selection of optimal μ goes up where the sparsity weighs more in the unmixing model and accuracy of abundance estimation decreases when the algorithm converges. When the number of endmembers goes to 50, the algorithm needs more iterations to converge (Fig. 8) compared to Fig. 4.

C. Performance Analysis of the Parallel Implementation
This part focuses on the parallel performance analysis of the GSGRU algorithm via several experiments.
First, we verify that the kernels run concurrently in CUDA streams by creating 8 streams for 16 superpxiels of D1 (with SNR = 30 dB). In the test, the iteration is set to 200. Fig. 9 shows the view of Memory copy from Host to Device, kernel _A (line 3 in Algorithm 2) and kernel_ edge computing edges for incidence matrices B provided by NVIDIA Visual Profiler. We can observe that 8 streams are operated simultaneously for first 8 superpixels and then looped again for another 8 superpixels. Because the superpixel sizes are different, the processing time for each stream varies. Fig. 10 shows the kernel_B (line 4 in Algorithm 2) in different streams viewed by Profiler. Two narrow vertical bars in different color (shown in each stream row closely to kernel_B) in the timeline represents the kernels of generating the identity matrices I (i) and matrices (I (i) + B (i) B (i) ) for 16 superixels, respectively. In Fig. 11, a part of kernels of ADMM operations is displayed (line 6 to line 19 in Algorithm 2). In the timeline, each compact bar composed by kernels (lines 9-17 of Algorithm 2) indicates one iteration for one superpixels. It is also clear that the kernels are overlapping among streams.

1) Time Comparison With CPU:
In this part, we evaluate performance of the GPU implementation by comparing the processing time to the CPU implementations using three datasets. In this experiment, although it is optimal to assign each superpixel to a stream, the number of streams is limited to 64 due to hardware constraints.
We generate another two images using the true abundance maps of D1. The pure images are contaminated by white Gaussian noise under SNR = 30 dB and the sizes in spatial dimensions extend to 200 × 200, 500 × 500, respectively, via upsampling from the original image (100 × 100). For the sizes of 200 × 200 and 500 × 500 in this experiment, the numbers of constructed superpixels are 400 and 1764, respectively. For the real datasets, the Cuprite scene and another well-known hyperspectral image HYDICE Urban data are tested. The Urban scene consists of 307 × 307 pixels with 210 spectral bands. The number of spectral bands reduces to 162 after removing the low SNR bands [53]. The six endmembers are Asphalt, Grass, Tree, Roof, Metal and Dirt, respectively. Fig. 12 illustrates the false color image of Urban and the constructed superpixels with a total number of 441 superpixels created and Fig. 13 shows the ground truth and estimated abundance maps using the GPU implementation. Table IV reports running time (averaged over 5 runs) for the CPU and GPU implementations on different platforms for D1, the Cuprite and Urban scenes. The speedup is calculated between the GPU and CPU(C++) implementations. Note that the processing time on the GPU includes data copy between the CPU and the GPU and the kernel's execution time. We also provide the running time in Matlab, which is advantageous in matrix operations. We can see that as the image size increases, the performance of GPU implementation becomes much more remarkable. For all cases, compared to CPU(C++) in visual studio 2017 and CPU in Matlab version, running time on GPU is reduced especially for D1 with 500 × 500 pixels and Urban, where the speedup is 15 and 8 times greater than CPU (C++).
2) Number of Superpixels: In this part, we analyze how number of superpixels impacts the performance of the algorithm. We can obtain different number of superpixels for the image by presetting the size of superpixels.    Urban scenes, we can observe that as number of superpixels increases, both CPU and GPU processing times decrease, with the decrease of RMSEs. When the number of superpixels rises from 81 to 289 and 441 to 784, respectively, it implies the number of pixels within each superpixel have reduced, enable less threads to be needed in device operations for one kernel, at the same time the rest of threads for more kernels to be overlapped. Meanwhile, more superpixels in the image leads to a more homogeneous area (very highly similar spectrum and texture) for every superpixel, which is beneficial for group sparsity and then explains higher accuracy. However, for Cuprite, RE keeps unchanged as it is not always directly related to accuracy, and the processing time   slightly rises up, which may be caused by the larger number of superpixels required to be processed when the number comes to 192 in this case.
3) Number of Streams: As noted previously, the active stream actually is less than N s because of hard limitation. We are still interested in how different numbers of streams affect results in the experiment. Table VI reports running time of the GPU implementation with 16, 32, and 64 streams, respectively. We can see that as the number of streams increases, the run time is reduced slightly in all cases instead of proportionally. It seems activating 16 streams already reaches the maximum capacity of the GPU.

VI. CONCLUSION
In this work, based on the graph-regularized sparse unmixing method with superpixel construction, we have developed the corresponding GPU implementation on the CUDA platform.
Unmixing in each superpixel not only largely reduced computational burden of the analysis but also exhibits a high degree of parallelism. A two-level strategy including the intrasuperpixel and intersuperpixel level was proposed. With an NVIDIA GeForce GTX 1660, the experimental results demonstrated the efficiency of the GPU implementation compared to serial versions on the CPU with the Eigen library and MATLAB. In the future, we will continue exploiting strategies to optimize the parallel method and also take advantage of multi-GPUs.