Hyperspectral Band Selection Using Attention-Based Convolutional Neural Networks

Hyperspectral imaging has become a mature technology which brings exciting possibilities in various domains, including satellite image analysis. However, the high dimensionality and volume of such imagery is a serious problem which needs to be faced in Earth Observation applications, where efficient acquisition, transfer and storage of hyperspectral images are key factors. To reduce the time (and ultimately cost) of transferring hyperspectral data from a satellite back to Earth, various band selection algorithms have been proposed. They are built upon the observation that for a vast number of applications only a subset of all bands convey the important information about the underlying material, hence we can safely decrease the data dimensionality without deteriorating the performance of hyperspectral classification and segmentation techniques. In this paper, we introduce a novel algorithm for hyperspectral band selection that couples new attention-based convolutional neural networks used to weight the bands according to their importance with an anomaly detection technique which is exploited for selecting the most important bands. The proposed attention-based approach is data-driven, re-uses convolutional activations at different depths of a deep architecture, identifying the most informative regions of the spectrum. Also, it is modular, easy to implement, seamlessly applicable to any convolutional network, and can be trained end-to-end using gradient descent. Our rigorous experiments, performed over benchmark sets and backed up with statistical tests, showed that the deep models equipped with the attention mechanism are competitive with the state-of-the-art band selection techniques and can work orders or magnitude faster, they deliver high-quality classification, and consistently identify significant bands in the training data, permitting the creation of refined and extremely compact sets that retain the most meaningful features. Also, the attention modules do not deteriorate the classification abilities, and slow down neither training nor inference of the deep models.


I. INTRODUCTION
Current advancements in the sensor technology bring exciting possibilities in hyperspectral satellite imaging (HSI) which is being actively applied in various domains, including precision agriculture, surveillance, military, land cover applications, and more [1]. It captures a wide spectrum of light for each pixel-such detailed information can be effectively exploited in HSI classification (assigning a class label to each pixel) and segmentation (determining the boundaries of objects of a given class in an input HSI) [2]. However, hyperspectral data's high dimensionality is an important challenge towards its efficient analysis, transfer, and storage. There are two approaches for dealing with such noisy, almost always imbalanced, and often redundant data. Feature extraction algorithms (with principal component analysis and its variations being the mainstream [3]) generate new lowdimensional descriptors from hyperspectral images (HSI), whereas feature selection techniques retrieve a subset of all HSI bands carrying the most important information. Although the former approaches can be applied to reduced HSI sets, they are generally exploited to process raw HSI data, they are computationally-expensive, can suffer from band noisiness, and may not be interpretable [4]. Band selection techniques are divided into filter (unsupervised) and wrapper (supervised) algorithms. Applied before classification, filter approaches do This work was funded by European Space Agency (HYPERNET project). P. Ribalta, L. Tulczyjew, and J. Nalepa are with Silesian University of Technology, Gliwice, Poland. L. Tulczyjew, M. Marcinkiewicz, and J. Nalepa are with KP Labs, Gliwice, Poland. * Corresponding author: jnalepa@ieee.org not require ground-truth data to select specific bands [5]- [7]. They, however, suffer from several drawbacks: (i) it is difficult to select the optimal dimensionality of the reduced feature space, (ii) band correlations are often disregarded, leading to the data redundancy [4], (iii) bands which might be informative when combined with others are removed, and (iv) noisy bands are often labeled as informative due to low correlation with other bands. Wrapper approaches use the classifier performance as the objective function for optimizing the subset of HSI bands [8], [9]. Although these methods alleviate the computational burden of the HSI analysis, such algorithms induce serious computational overhead. In this work, we mitigate this problem, and incorporate the selection process into the deep network training. Such approaches have not been explored in the literature so far. Deep learning (DL) has enabled unprecedented achievements and established the state of the art in a plethora of domains, including HSI analysis [10]. In general, the HSI segmentation algorithms encompass conventional machine learning techniques which require feature engineering [11], and DL approaches [12]. DL can conveniently elaborate spectral features [13] or both spectral and spatial features without any user intervention. These features are intrinsically extracted by the deep nets operating on the full HSI. Therefore, we need to face the aforementioned challenges concerning the high HSI dimensionality in both conventional and DL-powered segmentation approaches. Attention mechanisms allow humans and animals to effectively process enormous amount of visual stimuli by focusing only on the most-informative chunks of data. An analogous approach can be applied in DL to localize the most informative parts of an input image to focus on. We build upon the painless attention mechanism which is trained during the network's forward-backward pass [14], and exploit it in our convolutional architectures for HSI band selection. To the best of our knowledge, attention mechanisms have been used neither for this purpose, nor for HSI segmentation before.
In this letter, we introduce a new HSI band selection method (Section II) which exploits attention-based convolutional neural networks (CNNs). The goal of this system is to learn which bands convey the most important information, as an outcome of the training process, alongside a ready-to-use deep model. Thus, our method is an embedded approach-the generation of attention heatmaps is embedded into the CNN training. These heatmaps quantify the importance of specific parts of the spectrum, and they are later processed using an anomaly detection algorithm. We build upon our observation that only a (very) small subset of all bands within an original HSI convey arXiv:1811.02667v3 [cs.CV] 9 Jan 2020 the important information, and these bands can be seen as outliers (the other bands, which are in the majority, are not informative). The contribution of this work is multi-fold: -We introduce a new HSI band selection algorithm (Section II) which couples attention-based CNNs and anomaly detection (Section II-A) to find the most important bands. -We introduce attention-based CNNs to extract attention heatmaps that show which parts of the spectrum are important during the training. Our CNNs are spectral, and use the spectral information during the classification. However, they could be potentially extended by incorporating the convolutional layers which would operate in the spatial dimension, as the attention modules are topology-agnostic. -We performed a rigorous experimental study (Section III) to: (i) compare our technique with the state of the art in HSI band selection, (ii) verify the impact of band selection on various supervised learners, (iii) understand the impact of appending the attention modules to our CNNs, and (iv) verify the statistical importance of the results.

II. METHOD
1) General overview of the deep network architecture: In the attention-based CNNs for HSI ( Fig. 1), an attention module is inserted after each max-pooled activation of a convolutional layer Z l (l denotes the depth within the network topology, and l ≥ 1), in order to reduce the computational burden of the attention mechanism. This module is composed of two elements: an attention estimator, extracting the most important regions of a feature map, and a confidence gate, producing a confidence score for the prediction. We can easily modify the number of building blocks (BBs) in our CNNseach BB encompasses the one-dimensional (1D) convolution followed by the non-linearity, batch normalization, and 1D max pooling layer (we operate only in the spectral dimension, hence both types of the layers are one-dimensional), alongside the attached attention module. We exploit the rectified linear unit (ReLU) as a non-linearity, which outputs zero for any negative input x, and it returns the value of x otherwise. Hence, it can be formally written as ReLU(x) = max (0, x). In this work, we experimentally analyzed the attention-based CNNs with two, three, and four BBs (Section III).
Each layer in our CNN is parameterized with the corresponding hyper-parameter values: the number of kernels n, together with the size of the kernels in the convolutional layers, and the size of the pooling windows (both sizes are denoted as k in Fig. 1 for brevity). The kernel size is kept constant for all convolutional layers (k = 5, unit stride, and the padding of 2). Similarly, the pooling window size was kept unchanged (k = 2 with the stride of 2) in all max pooling layers. The number of kernels n in the convolutional layers decreases with the increase of the number of BBs ( Fig. 1) to reduce the complexity of the model, alongside its memory footprint. We expect that the shallower convolutional layers will be able to extract low-level features based on the spectral information, as they should be manifested in specific parts of the spectrum.
In Table I, we present the dimensionality of the input and output tensors for the selected operations in our deep  Fig. 1. In attention-based CNNs, features at different levels Z l are processed to generate the attention heatmaps, and they are used to output (i) a class hypothesis based on the local information, and (ii) a confidence score c l . The final output is the softmaxed weighted sum of the attention estimators, and the output of the network's classifier (here, an artificial neural network, ANN containing two hidden layers containing 512 and 128 neurons with ReLU).
architecture (annotated as A, B, C, and D in Fig. 1; b is the number of bands in the input HSI). The dimensions of the corresponding steps in the deeper parts of the CNN can be calculated analogously. The details of the C and D steps (in the attention module) are discussed in the following sections.  Step 2) Attention estimator: The attention estimator module encompasses the 1D convolution with one kernel applied (therefore, n = 1; in [14], the number of kernels is greater). The kernel size is equal to the number of feature maps extracted by the corresponding BB. This kernel performs the dimensionality reduction (all feature maps are reduced to one), and it is followed by the ReLU activation and spatial, across all entries within the feature map, softmax to elaborate the final heatmap. We do not softmax separate confidence scores, as suggested in [14], to decrease the computational burden.
The attention estimator learns the following embedding: where b and n denote the number of HSI bands and the number of feature maps, respectively. The attention estimator merges all feature maps (FMs) at depth l into a single one. The estimator builds a heatmapẐ l -it is used to normalize each map in Z l , which denotes the set of all activation maps at the level l, i.e., the attention heatmap is exploited to highlight the importance of each entry in each map. The hypothesis H l of the output space given its local information is produced: where the sign is the element-wise product. The number of activation maps Z l at a given level l is variable (i.e., 96 after the first BB, 54 after the second BB, and 36 after the third BB), and this normalization is executed to each of them-they are average-pooled to produce the hypothesis H l . It is used by a linear classifier to predict the label of the input sample: 3) Confidence gate: The local features are very often not enough to output a high-quality class hypothesis. Thus, we couple each attention module with the network's output to predict the confidence score c by the means of an inner product with the gate weight matrix W c (at the l-th level): The network's outputis the softmaxed weighted sum of the attention estimators and the output of the classifier o net multiplied by its confidence score c net : The softmax function converts a real-valued score x into a probability value p in the multi-class classification. Thus, a vector of such scores x ∈ R C is converted into a vector of probabilities p ∈ [0, 1] C , where p i is the probability of an input pixel HSI belonging to the i-th (out of C) class: The output vector p is the probability vector, therefore it is non-negative, and C−1 c=0 p c = 1, and can be used to predict the class label for each HSI pixel. To better understand the influence of the dimensionality reduction in the deeper parts of our attention-based CNNs, see an illustrative attention-based CNN example in the Supplementary Material 1 (Fig. 1SM).

A. Selection of HSI bands as anomaly detection
We exploit an Elliptical Envelope (EE) algorithm to extract the most discriminative bands based on the final attention heatmap. Since the number of important bands should be low, they can be understood as an anomaly in the input set ( Fig.1SM and Fig. 2). In EE, the data is modeled as a Gaussian distribution with covariances between feature dimensionshere, the entries of the heatmap. The input tensor to EE is b × 1, and an ellipse which covers the majority of the data is determined. The samples which lay outside of this ellipse are anomalous [15]. EE utilizes a fast minimum covariance determinant estimator [16], where the data is divided into nonoverlapping sub-samples for which the mean (µ) and covariance matrix in each feature dimension (C) are calculated. The Mahalanobis distance D for each sample x is: 1 We use SM to refer to entities included in the Supplementary Material. and the samples with the smallest values of D are retained. In EE, the fractional contamination rate (λ) defines how much data in the analyzed dataset should be selected as anomalies. These data samples (i.e., spectral bands) are selected as important in our band selection technique-they are assigned significantly larger attention values in the heatmap.

A. Experimental setup
In all experiments, we perform Monte-Carlo crossvalidation and randomly divide each HSI dataset (Section III-B) 30 times into balanced (with under-sampling) training (T ) and validation (V ) sets, and the unseen test sets (Ψ). These sets encompass 80%, 10%, and 10% of all pixels in the HSI, respectively, and they never overlap-since we analyze only spectral segmentation, this training-validationtest division does not cause a training-test information leak [2]. The T and V sets are used during the CNN training, whereas Ψ is utilized to quantify the generalization of the trained models. We report per-class, average accuracy (AA), and the values of the Cohen's kappa: κ = 1 − 1−po 1−pe , where p o and p e are the observed and expected agreement (assigned vs. correct class label), respectively, and −1 ≤ κ ≤ 1 (the higher, the better). All the measures are averaged across all 30 runs.

B. Datasets
We focused on two imbalanced multi-class HSI benchmarks: Salinas Valley (acquired using the NASA Airborne Visible/Infrared Imaging Spectrometer AVIRIS sensor), and Pavia University (Reflective Optics System Imaging Spectrometer ROSIS sensor). AVIRIS registers 224 contiguous bands with wavelengths in a 400 to 2450 nm range (visible to nearinfrared), with 10 nm bandwidth, and it is calibrated to within 1 nm. ROSIS collects the spectral radiance data in 115 bands in a 430 to 850 nm range (4 nm nominal bandwidth).
1) Salinas Valley: This set (217 × 512 pixels) was captured over Salinas Valley, California, USA, with a spatial res. of 3.7 m. The image shows different sorts of vegetation (16 classes). The original data contains 224 bands, however 20 bands were removed by the authors of this set due to either atmospheric absorption or noise contamination (see https://tiny.cc/grsl).
2) Pavia University: This set (340 × 610 pixels) was captured over Pavia University, Italy, with a spatial res. of 1.3 m. It shows an urban scenery (9 classes) with 103 bands, as 12 most noisy bands (out of 115) were removed by its authors.

C. Selection of bands using the proposed algorithm
We extracted bands from the benchmark HSI using our attention-based CNNs followed by the anomaly detection. For  Table II. Although the contamination rate is a hyperparameter of our method, the differences, in terms of the number of selected bands, across different λ values are small. Our technique drastically decreased the number of HSI bands for all datasets, and for all λ's (less than 14% and 9% of bands were selected as important for λ = 0.01 for Salinas and Pavia, which amounts to 28 and only 9 bands, respectively). The average attention scores for Salinas and Pavia are visualized in Fig. 2SM. There are several attention peaks for Salinas Valley indicating the most meaningful part of the spectrum. Although for Pavia University there are less such clearly selected bands, some parts of the spectrum are definitely more distinctive than the others. This experiment showed that our method retrieves consistent attention scores annotating the most important bands, and that it is data-driven.

D. Influence of attention modules on classification
This experiment verifies whether applying attention modules in a CNN has any impact on its performance and convergence. For each set, we trained the deep networks with and without attention using original HSI data. The CNNs without attention are referred to as CNN-2, CNN-3, and CNN-4 (two, three, and four convolutional-pooling blocks, as depicted in Fig. 1).
The average per-class accuracy scores (averaged across 30 executions) for Salinas and Pavia are gathered in Tables 1SM and 2SM. The differences between the architectures are not statistically important (i.e., CNN-2 compared with CNN-2A, CNN-3 with CNN-3A, and CNN-4 with CNN-4A), according to the Wilcoxon tests at p < 0.01. Therefore, attention modules did not adversely impact the performance of the CNNs-they allow for building a high-quality model and selecting the most important bands at once. Deeper CNNs delivered more stable results (std. dev. of the accuracy over Ψ decreased from 0.007 to 0.005 for Salinas, and from 0.03 to 0.01 for Pavia). On the other hand, we can observe only minor improvements in the performance when more BBs are appended. It shows that the shallower models can extract high-quality features using just two convolutional-pooling blocks. The same observation can be drawn from Figs. 3SM-4SM, where we render the kappa scores for Salinas and Pavia. There are classes (C8 and C15 for Salinas, and C1, C2 and C8 for Pavia) which are "difficult" for all classifiers (Tables 1SM-2M). In both cases, it is observed for the most numerous classes, and it can be attributed to the fact that they are under-sampled while creating the balanced training sets. Therefore, the sampled examples are not representative.
The average number of epochs before convergence, and the average processing time 2 of a single epoch are presented in Figs. 5SM-6SM. Appending attention or adding BBs increases neither of them, hence they can be considered as a seamless CNN extension to enhance its operational ability.

E. Classification accuracy over reduced datasets
We evaluated the performance of state-of-the-art models trained using full and reduced sets. They included Support Vector Machines (SVMs), Random Forests (RFs), and Decision Trees (DTs). We additionally executed grid search to optimize the hyper-parameters of all models: C and γ of the radial-basis kernel function in SVMs, number of trees in RFs, minimum samples per leaf in DTs, and minimum samples in a split in both RFs and DTs. The training with grid search was repeated 30 times (Monte-Carlo cross-validation). Table 3SM shows that decreasing HSI helps shorten the grid-search time which can easily become large for full sets. Hyper-parameter optimizations are not necessary in our CNNs.
The average-accuracy results gathered in Tables 1SM-2SM show that for most of the classes, the performance of the classifiers is not diminished by our band selection. Although there are classes for which the accuracy decreased (e.g., C2 and C3 in Pavia), the differences for other classes are negligible, especially for CNNs for λ ≥ 0.03. It is proved by the Wilcoxon tests executed to analyze the differences between models trained with different datasets (with and without reduction). Although the differences in AA of the classifiers trained with the reduced numbers of bands are statistically important (at p < 0.01), they are not as dramatic as in other band selection algorithms [18]. Note that CNN-4A could not be trained for very small number of bands because of the dimensionality reduction in pooling layers.
The inference time of all investigated learners was very short. Reducing the number of bands decreased the total inference time of all examples in Ψ which amounted to approx. The decrease in the time was analogous for Pavia. Also, we do not report the times for CNNs without attention as they were practically the same as for the attention-based CNNs.

F. Comparison with the state of the art
We compare our algorithm with other state-of-the-art techniques. For the sake of thoroughness, we took into consideration both filter and wrapper approaches. As a filter algorithm, we implemented the mutual information-based method (MI) [19]. In [19], the authors used the estimated reference maps to calculate the mutual information. Since this map should be estimated using available knowledge about the spectral signatures of the materials encountered within the scene, the lack of them may lead to incorrect maps. For fair comparison, we used the original ground-truth information instead of such estimated reference maps-it can render over-optimistic results for this method (i.e., our MI implementation is "handicapped" by the availability of ground truth). As the wrapper approaches, we selected two modern algorithms: a multi-objective immune algorithm (BOMBS) [20], and the algorithm (ICM, Improved Classification Map) [21], in which the authors assess the quality of selected subsets of HSI bands using the pixel-wised classification map enhanced by the edge preserved filtering. We extracted the same number of bands as in Table II.
The kappa scores are presented in Fig.7SM-for virtually all classifiers, they consistently grow for all techniques with the increase of the number of extracted bands, and ultimately converge to the same values. As already mentioned, the results for MI may be over-optimistic, as we utilize the entire groundtruth information to extract the important bands. Hence, we "leak" the information across the training and test sets because the training-validation-test splits are created after the band selection step, and before training a supervised learner. The execution times (Table 4SM) show that our technique is orders of magnitude faster when compared with the wrapper algorithms (BOMBS and ICM) while delivering competitive classification results. Also, the number of bands selected in all methods was set according to our contamination factorsif we did not know the desired number of bands, we would have to execute each method in a grid search-like manner, and it would drastically increase their running time.

IV. CONCLUSION
We proposed new attention-based CNNs coupled with anomaly detection for selecting bands from HSI. The experimental validation showed that the proposed algorithm extracts important bands from HSI, and allows us to obtain state-ofthe-art accuracy using only a fraction of bands (14-19% for Salinas, and 9-27% for Pavia). Overall, it revealed that: -Attention-based CNNs deliver high-quality classification, and adding attention modules does not impact classification abilities and training time of an underlying CNN.
-Attention-based CNNs extract the most informative bands in HSI during the training in an embedded approach. -Selected bands can be used to identify relevant and discard unimportant parts of the spectrum, drastically shortening training times of a classifier, and compressing HSI without sacrificing the amount of conveyed information. -Our technique is applicable to any HSI set and any CNN. -Our technique is competitive with the state-of-the-art approaches, and works orders of magnitude faster.