Regularized Sparse Band Selection via Learned Pairwise Agreement

Desired by sparse subset learning, in this paper, a hyperspectral band selection method via pairwise band agreement with spatial-spectral graph regularier, referred as Regularized Band Selection via Learned Pairwise Agreement (RBS-LPA), was proposed. The process was formulated as a graph-regularized row-sparse constrained optimization problem, which select a few representative bands to code the all bands based on the learned pairwise band agreement. In RBS-LPA, a spatial-spectral informative graph, constructed by spatial-spectral neighbor relationship, is incorporated to encode both the spatial and spectral geometrical structure. By combining the learning procedure with graph regularizer jointly, the graph regularizer can adaptively change with the sparse representative bands, which ensures that the selected bands can well preserve the local manifold structure. Experimental results on three real urban hyperspectral data demonstrate the efficiency of the proposed RBS-LPA and achieve convincing performance.


I. INTRODUCTION
The advancement in remote sensors has made hyperspectral data collection much easier and more informative than ever before, which makes hyperspectral imaging become one of the most promising techniques for various successful applications, such as precision agriculture, environmental science, and military applications [1], [2]. Consisting of two spatial dimensionality and one high resolution spectral dimensionality, a hyperspectral image cube provides sufficiently valuable observations at hundreds of contiguous and very narrow spectral bands. Intuitively, having more spectral bands implies more discriminative power in classification and recognition [3], [4]. However, this is not always work in practical application, because not all the features presented in high-dimensional data set help in class prediction. The presence of irrelevant and redundant features not only increase the computational cost and memory requirement but also be possibly detrimental to their performance due to the noise effect and insufficient number of samples The associate editor coordinating the review of this manuscript and approving it for publication was Jeon Gwanggil . with respect to the ambient space dimension, commonly referred to as the ''curse of dimensionality'', to describe the problems faced in the analysis of high dimensional data [5]- [8].
Finding low-dimensional structures in the data not only helps to reduce the computational cost and memory requirements of algorithms, but also reduce the side effects of noise in the high-dimensional data and improve the performance of classification tasks [9]. High-dimensional data sets are inherently sparse and hence, can be projected into lower dimension space without losing too much discriminative information, and the projection operation is so called dimensionality reduction (DR). In [10], Bioucas-Dias, JosÍȩ M., et al. said that the high spectral resolution of hyperspectral data with original spectral features contains high redundancy. Specially, there is a high correlation between adjacent bands and the number of the original spectral features may be too high for classification purposes. In addition, the original spectral features may not be the most effective ones to separate the objects of interest from others. Therefore, dimensionality reduction (DR) techniques, often performed prior in some practical hyperspectral tasks, which reduce the redundancy in original subspace while preserving the informative and representative information about objects for specific purpose, have been playing a paramountly important role in both researches and applications. Generally, DR techniques transform the original high-dimensional data into a smaller and more manageable representation with reduced dimensionality. Algorithms that have been developed can be broadly categorized into two groups: feature extraction and feature selection. Feature extraction approaches create a new set of features from the original feature space by transformation. Among them, principal component analysis (PCA) [11] and linear discriminant analysis (LDA) [11], are two well-known linear feature extraction algorithms which are widely used. While feature selection methods select, from the original feature set, an informative, compact and accurate subset features, which are highly effective in discriminating classes. Like FAST in [12] and Ranking based sparse SVM in [13]. Compared with feature extraction methods, feature selection not only has an easy implementation but also preserve the original semantic and physical property of the features, thereby offering the advantage of interpretability by a domain expert.
Feature selection can be conducted in a supervised or unsupervised mode, in term of whether the label information is utilized to guide the selection process [14]. Generally, supervised feature selection methods require a large amount of label information, in real hyperspectral scenario, supervised information is expensive to obtain. In this paper, we only consider unsupervised feature selection algorithms. Unsupervised feature selection is of great value where there is no prior knowledge of phenomena present. Recently, many unsupervised hyperspectral band selection methods have been proposed to support efficient practical application [15]. Existing techniques can be broadly classified into five categories: ranking based [16]- [18], clustering based [19], [20], learning based [21], evolution selection based [22], and information metric based [23] in these scenes. The main idea of ranking based methods is to find the most distinctive and informative bands. Various ranking based methods for unsupervised band selection are present in the literature, like information divergences (ID) [17], maximum variance based principal component analysis (MVPCA) [16], and similaritybased band selection method [18]. Clustering based methods [19] perform clustering operation on bands (treated as patterns) to group them according to their correlation, and select one band from each cluster representing the whole group. A few clustering based band selection techniques for hyperspectral images exist in the literature e.g., Ward's linkage strategy using divergence (WaLuDi) [19], or using mutual information (WaLuMI) [19], and band selection using affinity propagation (AP) [20]. In addition, learning based method [15], [21], multi-task sparsity pursuit (MTSP) learning framework is introduced to improve the selection performance. Also, some sparsity inspired feature learning methods, which assume the features can be expressed as a linear combination of only some representative features, which formulates FS as s joint sparse recovery problem. and Li.et proposed a sparse representation-based band selection method (SpaBS), which assumes that the interested bands contain the maximum information [21]. These algorithms are not well work where features lie in nonlinear models. evolution selection based [22], and information metric based [24].
Despite that many unsupervised band selection approaches have been proposed for HIC, there are still some issues to be further studied. First, it is highly demanded that the pairwise samples relationship should be preserved in the selection procedure; Second, spatial information buried in the hyperspectral image should be utilized, which can be easily explored by incorporate a local smoothness regularization, that is the spatial-spectral neighbor pixels have similar representative subset bands.
In this paper, we propose an unsupervised band selection algorithm to exploit the geometrical structure information in the image by using the manifold regularizer and pairwise agreement, referred to as Regularied Band Selection via Learned Pairwise Band Agreement (RBS-LPA). The pairwise agreement between each band can be computed by using a certain metric, such as Euclidean distance, mutual information, Kullback Leibler divergence etc., or by a learning processing. In our work, the pairwise agreement defined as the refined low-rank representation coefficients with spatial relative position constraints. Using pairwise agreement to encode the original bands relationships, the band selection process can be implemented by solving a row-sparsity with graph regularizer constrained trace minimization problem by using Alternating Direction Method of Multipliers (ADMM) [25].
The key idea behind the proposed algorithm is what so-called self-expressiveness property of the data [26] that is we can use few informative bands to well encode the original all bands. Thus we cast the assumption that the original bands can be expressed as a linear combination of some representative and informative bands, and then it can be formulated as a sparse optimal problem that finding few representative bands from the original band. Meanwhile, the geometrical information of image is important for discrimination [27]. And the geometric structure information can be exploited by using the manifold assumption, which has been shown effective in classification and clustering tasks [28]- [30]. Thus geometrical spatialspectral graph regularizer is introduced in this paper to exploit geometrical local invariant structure of the hyperspectral data. [26], i.e. the nearby pixels in the original feature space are likely to have similar embedding, to formulate a pixel-wise graph regularizer. We define an undirected graph to reveal the spatial-spectral geometric structure.
The contribution of the proposed method can be summarized as: (1) The RBS-LPA improves from sparse representation and it allows one to consider models beyond linear subspaces, it does not enforce the hyperspectral bands sampled from some underlying independent subspace.
(2) The learned pairwise agreement via refined low-rank representation with spatial relative position can well encode VOLUME 8, 2020 the latent geometric structure of the spatial-spectral bands and also robust to band noises.
(3) The proposed algorithm utilizes both the sparse learning mechanism and manifold structured regularizer, thus the graph spectral vectors and sparse responsibility matrix are jointly learned at the same time, the graph spectral vectors can be adaptively changed with the sparse responsibility matrix, and the selected representative bands are reselected to preserve the graph structure.
(4) By incorporating the geometrical information of the data space via constructing a neighbor graph regularizer, we ensure that the spatial spectral neighbors have similar representative band subset, which explicitly considers the local invariance.
The remainder of this paper is organized as follows. Section II first describes the works that related to representative selection and then describes the details of the proposed RBS-LPA. Section III evaluates the effectiveness of the proposed method by experimenting on some real urban hyperspectral dataset. The conclusion and discussion are drawn in Section IV.

Given a hyperspectral data set
And x k is a N dimensionality vector that denotes the k − th row X corresponding to the band of the given hyperspectral data. And X ∈ R d×N denotes the whole data set. Without loss of generality, feature selection can be formulated as: where W ∈ R d×d is a transform matrix, and force W is a transform matrix, and force W equal zeroes. And S( · ) denotes a selection operator that select the nonzero entries of input, that is to say, the features/bands that correspond to the indices of non-zero rows of W are selected. So, the dimension of s i equals to the number of nonzero rows of W , i.e. the dimensionality of the low-dimensional space.

A. A BRIEF REVIEW OF RELATED WORKS ON SPARSE REPRESENTATIVE SELECTION BASED ON PAIRWISE AGREEMENT
The performance of machine learning methods is heavily dependent on the choice of data representation (or features) on which they are applied. A good data representation should be compact, representative and informative. Inspired by the idea of DS3 [26], [31], we cast the assumption that the bands of pixels can be well represented only by a few 'informative and representative' bands. Thus the informative bands can form a compact representation of the complete original bands. Given hyperspectral data set X ∈ R d×N , it is easy to construct a nonnegative pairwise agreement metric ..,d between the bands. Each A ij indicates how well i − th band to represents j − th band.i.e., the larger value of A ij is, the better x i encode x j .The pairwise agreement can be arranged into a matrix form as: Given agreement metric A, the goal of RBS-LPA is to find a small representative and informative feature subset of X that well represents the originally entire X .Agreement metric can be computed according to a predefined metric measurement function, such as Euclidean distance or by a certain learning criteria. In our work, the agreement measure is obtained by low rank representation coefficients refined by spatial irrelative relationship, the reason lies in the fact that the hyperspectral pixels are usually mixed due to the low spatial resolution and noise. To better handle the mixed hyperspectral data, here we lesson a more general framework ''nonnegative Low Rank Representation (LRR)'' to obtain the pairwise relationship between each band.
where X T is a ''dictionary'' that linearly spans the data space. The optimal representation A can be obtained via Alternating Direction Method of Multipliers, which can be regarded as the optimal pairwise representation of bands. Meanwhile, the neighbor bands usually have similar response, thus we modified the relationship by using the sigmoid function. Fig.1 shows the spatial disagree weight with respect to spatial relative position. . We consider an optimization program with responsibility variableZ ij associates with agreement D ij . We denote the matrix of all variables as: Z ij is regarded as the responsibility of x i being a representative band forx j band and restrict Z ij ∈ [0, 1]. Notice that x j can have multiple representative bands, in which case Z ij > 0 for every x j that represent x i .We further restrict N i=1 Z ij = 1 to ensure that the probability of x j being represented by X is equal to 1. The problem of selecting a few bands to represent the all bands according to the agreements can be formulated by a row-sparsity regularized trace minimization problem on Z . First, selecting representative bands that can well encode X based on the agreement. If x i being the a representative band for x j band with responsibility Z ij , then the cost of encoding x j is −D ij Z ij .Hence the total cost of encoding x j using its all representative bands . Second, we would like to as few as bands of X to represent Y . It can be easily found that if the row of Z equal to 0, the index of the feature does not contribute to coding the output. Putting these two goals together, the selection procedure can be formulated as the following optimization problem: where q denotes the l q norm and I () denotes the indicator The first term in (3) counts to the number of selected bands and the second term corresponds to the total cost of encoding X via the selected representative bands. The parameter λ sets the trade-off between the two terms. Since the minimization of Eq. (3), is NP-hard, we consider the following convex relaxation formulation: We rewrite the above formulation as matrix form: where Z 1,q = d i=1 Z i q and 1 denotes a column vector with all elements equal 1. Once we get the optimal value, we can select the representative bands indices from the nonzero rows of solution Z * . By optimizing the above object function Eq.(5), we can obtain a representative feature subset that can well approximate the original feature space.

B. REGULARIED BAND SELECTION VIA LEARNED PAIRWISE BAND AGREEMENT (RBS-LPA)
The above optimization Eq.(5) fails to discover the intrinsic geometrical structure of the image data space, which is FIGURE 2. The black-red pixel is a given pixel, and its spatial-based connect region denoted by texture background, the dash line connect to its homogeneous neighbors (denoted by red-dark green). essential to the real-world hyperspectral application. One might further hope that the selected features can well preserve the intrinsic spatial structure of the hyperspectral cube. It is assume that if two pixels xi, xj are close in the original feature space, si, sj the representation of these two pixels with respect to the new feature space, are also close to each other. This assumption is usually referred to as local smoothness assumption [32]. Recent studies on manifold learning theory have demonstrated that the local geometric structure can be effectively modeled through a neighbor relationship graph on a scatter of data points. Consider a graph with N vertices, which each vertex corresponds to a instance/pixels. For each pixels xi, we find its p spatial-spectral nearest neighbors and add an edge between xi, and its neighbors. In our works, we set 15 × 15 since the experiments reveal that the number of window size does have little effect on the final classification, and p equals 6. Figure 2 shows the spatial-spectral neighbor relationship. There are many choices to define the weigh matrix G on the graph. Most commonly used are 0-1 Weighting, Gaussian Kernel Weighting and Dot-Product Weighting. In this proposed framework, we use Euclidean distance to measure the neighbor relationship. Then we define the following regularized term:

Tr(S T QS) − Tr(S T GS) = Tr(S T LS)
we define

0, others
and a ij denotes the spatial Euclidian distance betweenx i and x j , where Tr () denotes the trace of a matrix and is the degree matrix of G, which is a diagonal matrix whose entries are the column sum of G, Q ii = j G ij and L = Q − G is the Laplician matrix of a graph. Minimizing the regular term, VOLUME 8, 2020 we expect that if two pixels x i and x j are close, their embedding si, sj are also close to each other. Combining this geometric-based regularizer with DRS, the object function can be formulated as: Eq. (7) can be solved by Alternating Direction Method of Multiplier (ADMM) optimization framework [22]. The overall procedure of the ADMM algorithm is to introduce appropriate auxiliary variables into the optimization program, and iteratively optimize the Lagrangian with respect to the variables. We introduce an auxiliary variable matrix M ∈ R d×d to Eq.(7) and reformulate as: where denotes trace operator of a given matrix. The ADMM approach is an iterative procedure that proceeds as summarized in Table 1.
The optimization problem of Step 1 in Table 1, in the case of, can be obtained by solving a system of linear equation via singular value thresholding operator in [29]. One can use matrix inversion to obtain Z with complexity of O(d 3 ) when d is not very large. For large values of d,more efficient approaches such as the conjugate gradient method [31] should be employed to solve Z , Since in our formulation, d is not very large, Z is obtained by matrix inversion operator. And projection onto l 1 ball in case of q = ∞. For the proposed RBS-LPA, the computational complexity of updating the Z and M is O(d 3 ) .Therefore, the total computational complexity of RBS-LPA method is O k d 3 3 , in which is the iteration number in ADMM. The selected representative bands correspond to the indices of nonzero of row of Z .We then could use selected band subset to produce the finial classification. In the proposed algorithm, the spatial-spectral graph structure and representative bands are jointly optimized, the graph can be characterize the spatial-spectral structure and adaptively change with the optimal representative bands. In this work, the graph not only describe the manifold structure but also indicates the representative band.

III. EXPERIMENTS AND DISSICUSION
This section will verify the performance of the proposed method in hyperspectral classification. The experimental results correspondingly consist of two types: 1): a comparison between the propose framework with some classical and state-of-art unsupervised band selection methods: This first scene, Indian Pines Datsset, was gathered by AVIRIS sensor over the Indian Pines test site in North-western Indiana and consists of 145 × 145 pixels and 224 spectral reflectance bands in the wavelength range (0.4-2.5)×10 −6 meters. The Indian Pines scene contains two-thirds agriculture, and one-third forest or other natural perennial vegetation. There are two major dual lane highways, a rail line, as well as some low density housing, other built structures, and smaller roads. The ground truth available is designated into sixteen classes and is not all mutually exclusive. Figure. 3 shows the true color image of this dataset overlaid with the ground truth map.

2) PAVIAU DATASET
This hyperspectral dataset was acquired in 2001 by the ROSIS instrument, flown over the city of Pavia, Italy. The scene contains pixels, but some of the samples in this images contain no information and have to be discarded before the analysis. which includes 9 classes, is centered at the University of Pavia. The geometric resolution is 1.3 meters. After removing 12 bands due to noise and water absorption, this hyperspectral dataset comprises of 103 spectral channels. A true color image and the ground truth map are shown in Figure.4.

3) KSC DATASET
This dataset was collected over the Kennedy Space Center (KSC), Florida, on March 23, 1996 by The NASA AVIRIS. AVIRIS acquires data in 224 bands of 10 nm width with center wavelengths from 400 -2500 nm. The KSC data, acqui red from an altitude of approximately 20 km, have a spatial resolution of 18 m. After removing water absorption and low SNR bands, 176 bands were used for the analysis. Discrimination of land cover for this environment is difficult due to the similarity of spectral signatures for certain vegetation types. For classification purposes, 13 classes representing the various land cover types that occur in this environment were defined for the site.

B. CONVERGENCE RESULTS
In practice, how fast the algorithm converge is crucial for the whole computational efficiency. Hence we conduct some experiments to test the convergence of the proposed algorithm. Figure.5 (a) and (b) give the some numerical results on IndianPine dataset to show the convergence behavior of the proposed iteration procedure. As seen from Figure.6, the objective of RBS-LPA all converge to a fixed value. The number of iteration when q = ∞ is much less than that of q = 2. Figure.  iteration procedure. As seen from Figure.4, the objective of RBS-LPA all converge to a fixed value. The number of iteration when q = ∞ is much less than that of q = ∞.

C. COMPARISONS WITH STATE-OF-ART METHODS
In this part, we utilize five state-of-art methods to make holistic comparisons with our methods, including: 1):Clustering-based Band Selection (CBS) [19]. CBS is a hierarchical clustering structure to group bands to minimize the intracluster variance and maximize the intercluster variance by using a certain information measures. And also k-medoids, is a clustering algorithm related to the k-means algorithm and the medoidshift algorithm, which is attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses data points as centers. The other method is Affinity Propagation(AP), which takes as input measures of agreement between pairs of bands, messages are exchanged between bands until a high-quality set of exemplars and corresponding clusters gradually emerges. We used affinity propagation to cluster hyperspectral bands. The Affinity Propagation (AP) algorithm [6], [31], [32] tries to find representatives from pairwise similarities between data points by using an approximate message passing algorithm. While it has been shown empirically that AP performs well in problems such as unsupervised image categorization [23], there is no guarantee for AP to find the desired solution and, in addition, it works only with a single dataset.
2):Maximum-Variance Principal Components Analysis (MVPCA) [16]. This method based on eigenvalue decomposition of a data matrix, and selects bands with higher variance.
3):Exemplar Component Analysis (ECA) [32]: Exemplars component analysis chooses the exemplars as the selected bands, and exemplar is a band which has a high local density and at a large distance from the band with a higher density. 4):Sparsity based band selection (SpaBS) [21]: This band selection cast a sparse representation of the hyperspectral image data through K-SVD, that decomposes the image   data into the multiplication of an overcomplete dictionary (or signature matrix) and the coefficient matrix. The coefficient matrix, that possesses the sparsity property, reveals how importantly each band contributes in forming the hyperspectral data. By calculating the histogram of the coefficient matrix, we select the top K bands that appear more frequently  than others to serve the need for dimensionality reduction and at the same time preserving the physical meaning of the selected bands.

D. EVALUATION METRICS
Since the proposed RBS-LPA method does not convolve with specific classifiers. Therefore, we expect the bands selected by the proposed framework have good performance on various types of classifiers. To test this, we consider two widely used classifiers, i.e., K-Nearest Neighbors (KNN), Support Vector Machine (SVM). The SVM classifier implements the radial basis function as the kernel function, with the variance parameter and penalization factor estimated via cross validation. The Euclidean distance is used in the KNN classifier with the neighborhood size equal to 1 and 3.
In our experiments, the regularized parameters are associated with the selected number of bands. Fig.6 gives the classification results of Pavia University Scene with different classifiers. The accuracy curve is sketched by varying the selected band number in the range from 1 to 30. As observed from Figure.7.(a-e), the performance of proposed RBS-LPA dominate the competitors on the employed classifiers. Figure.8 shows the selected band indices of the proposed methods with its competitors. The comparative selected band indices for Pavia University Scene are illustrated in Figure.8, which presents the band produced by ECA [32], MVPCA [16], WaLuMI [19], WaLuDI [19] and our proposed algorithm, respectively. As observed from figures, our proposed algorithm selected the band across all the band space. In addition, the selected bands indices of our proposed algorithm RBS-LPA ( q = ∞ ), RBS-LPA ( q = 2 ) only 3 different band among the 20. Figure.9 gives the classification results of KSC Scene with different classifiers. The accuracy curve is sketched by varying the selected band number in the range from 1 to 30. As observed from Figure.(9).a-c, the performance of proposed DSRS-GR dominate the competitors on the employed classifiers when the number of bands is small, but the performance are basically flat with other competitors when the number of bands increased to 20. Figure.10 shows the classification map and numerical result with 10 selected bands when 20 samples are labeled. The proposed method achieved the best results than the compared methods and have better spatial consistency than the other methods.

IV. CONCLUSION
This paper proposes an effective sparsity-induced band selection methods named as RBS-LPA. By refining the learned pairwise agreement with spatial relative position, the final agreement can well encode the intrinsic agreement between each band. Given pairwise agreement between each band, band selection can be considered as a representation learning process. Compared with sparsity-based band selection, we not only make any assumption about of the independence between bands but also the band information and pixels information are exploited jointly to find a few representative bands that can efficiently encode the target bands.