Deep Reinforcement Learning for Band Selection in Hyperspectral Image Classification

Band selection refers to the process of choosing the most relevant bands in a hyperspectral image. By selecting a limited number of optimal bands, we aim at speeding up model training, improving accuracy, or both. It reduces redundancy among spectral bands while trying to preserve the original information of the image. By now many efforts have been made to develop unsupervised band selection approaches, of which the majority are heuristic algorithms devised by trial and error. In this paper, we are interested in training an intelligent agent that, given a hyperspectral image, is capable of automatically learning policy to select an optimal band subset without any hand-engineered reasoning. To this end, we frame the problem of unsupervised band selection as a Markov decision process, propose an effective method to parameterize it, and finally solve the problem by deep reinforcement learning. Once the agent is trained, it learns a band-selection policy that guides the agent to sequentially select bands by fully exploiting the hyperspectral image and previously picked bands. Furthermore, we propose two different reward schemes for the environment simulation of deep reinforcement learning and compare them in experiments. This, to the best of our knowledge, is the first study that explores a deep reinforcement learning model for hyperspectral image analysis, thus opening a new door for future research and showcasing the great potential of deep reinforcement learning in remote sensing applications. Extensive experiments are carried out on four hyperspectral data sets, and experimental results demonstrate the effectiveness of the proposed method.

Abstract-This is the preprint version.To read the final version, please go to IEEE Transactions on Geoscience and Remote Sensing.Band selection refers to the process of choosing the most relevant bands in a hyperspectral image.By selecting a limited number of optimal bands, we aim at speeding up model training, improving accuracy, or both.It reduces redundancy among spectral bands while trying to preserve the original information of the image.By now many efforts have been made to develop unsupervised band selection approaches, of which the majority are heuristic algorithms devised by trial and error.In this paper, we are interested in training an intelligent agent that, given a hyperspectral image, is capable of automatically learning policy to select an optimal band subset without any hand-engineered reasoning.To this end, we frame the problem of unsupervised band selection as a Markov decision process, propose an effective method to parameterize it, and finally solve the problem by deep reinforcement learning.Once the agent is trained, it learns a band-selection policy that guides the agent to sequentially select bands by fully exploiting the hyperspectral image and previously picked bands.Furthermore, we propose two different reward schemes for the environment simulation of deep reinforcement learning and compare them in experiments.This, to the best of our knowledge, is the first study that explores a deep reinforcement learning model for hyperspectral image analysis, thus opening a new door for future research and showcasing the great potential of deep reinforcement learning in remote sensing applications.Extensive experiments are carried out on four hyperspectral data sets, and experimental results demonstrate the effectiveness of the proposed method.The code is publicly available 1 .
Index Terms-Deep reinforcement learning, deep Q-network, hyperspectral band selection, hyperspectral image classification, neural network, unsupervised learning.

I. INTRODUCTION
I N remote sensing, spectral sensors are widely used for Earth observation tasks, like land cover classification [1]- [15], anomaly detection [16]- [20], and change detection [21]- [33].A hyperspectral image often comprises hundreds of spectral bands within and beyond the visible spectrum.Such an image can be deemed as a hyper-cube, providing rich spectral information that helps to identify various land covers.Hyperdimensionality also raises some issues, e.g., a high level of redundancy among spectral bands, high computational overheads, and large storage requirements.Therefore, it is beneficial to reduce data redundancy.
In the literature, two kinds of methodologies, namely feature extraction [34], [35] and band selection [36]- [55], are commonly used to reduce redundancy in hyperspectral images.The former transforms original hyperspectral data into a lower dimension via a linear or nonlinear mapping.For example, [34] makes use of independent component analysis (ICA) to extract features from a hyperspectral image in an unsupervised way.In [35], the authors investigate a supervised feature extraction approach based on linear discriminant analysis (LDA).Moreover, several works put effort into using manifold learning algorithms, e.g., Laplacian eigenmaps (LE) [56], locally linear embedding (LLE) [57], and isometric feature mapping (Isomap) [58], to learn low-dimensional features by taking advantage of the underlying geometric structure of hyperspectral data.On the other hand, band selection refers to the process of choosing a cluster of informative spectral bands and discarding ones that are often not discriminative enough for the considered problem.Unlike feature extraction, band selection can keep the physical meaning of original hyperspectral images and be better interpreted for certain tasks [52].Hence in this paper, we are interested in hyperspectral band selection.Band selection is applicable to tasks as diverse as hyperspectral image classification, change detection, and anomaly detection.In this work, we use classification tasks to validate the effectiveness of selected bands.
From the perspective of the availability and use of labeled data, band selection methods are grouped into the following three categories: unsupervised, semi-supervised, and supervised.Semi-supervised and supervised models exploit labeled samples to learn a band selection strategy.Such labeled data, however, are not often available in practical remote sensing applications.Hence unsupervised band selection is more desirable in the community.In this direction, the existing methods can be approximately sorted into the following categories: • Others.Some hybrid approaches, e.g., combining ranking and clustering [46]- [48], are proposed for band selection tasks.Furthermore, sparse learning, low rank representation, and deep learning also provide new insights [45], [49], [50].
In essence, hyperspectral band selection can be treated as a combinatorial optimization problem.The aforementioned methods that use exact and heuristic algorithms have proven to be effective for such a task.However, these heuristic algorithms are devised based on domain knowledge from human experts by trial and error.Hence we are curious as to whether this heuristic design procedure for unsupervised band selection tasks can be automated using artificial intelligence techniques.If feasible, there would be much to be gained.Reinforcement learning systems are trained from their own experience, in principle allowing them to operate in tasks where human expertise is lacking and thus being suitable for discovering new band selection methods without any hand-engineered reasoning.Recently deep reinforcement learning, introducing deep learning into reinforcement learning, has demonstrated breakthrough achievements in various fields [59]- [63].In this paper, we propose a framework that can solve the problem of unsupervised band selection using deep reinforcement learning.This work's novel contributions are in the following aspects: • We cast the problem of unsupervised hyperspectral band selection as a Markov decision process of an agent and then solve this problem with a deep reinforcement learning algorithm.To the best of our knowledge, this is the first study that makes use of deep reinforcement learning for the task of band selection.• We propose an effective solution to parameterize the Markov decision process for optimal band selection.More specifically, for the agent, we devise the set of actions, the set of states, and an environment simulation tailored for this task.Fig. 1.An overview of the proposed deep reinforcement learning model for unsupervised hyperspectral band selection.In the training phase, an intelligent agent (Q-network) interacts with a tailored environment in order to learn a band-selection policy by trial and error.Specifically, the Q-network takes as input the state representation encoding selected bands and outputs a vector whose each component is a Q-value for each band.In the test phase, the agent selects bands according to the learned policy.
tion entropy and correlation coefficient, for unsupervised hyperspectral band selection.• We train a deep reinforcement learning model using Qnetwork to learn a band-selection policy whose effectiveness has been validated extensively with various data sets and classifiers.We organize the remainder of this paper as follows.Hyperspectral band selection is detailed in Section I. Section II introduces the proposed model.Section III tests the proposed model and presents experimental results as well as the discussion.Lastly, the paper is concluded in Section IV.

II. METHODOLOGY
Let us consider a hyperspectral image with L bands.Our goal is to select K optimal bands to reduce redundancy.The number of all possible combinations is L K .Suppose that L = 200 and K = 30, the number is about 4 × 10 35 .In this work, we first formulate the task as a Markov decision process, as detailed in Section II-A.Afterwards, a deep reinforcement learning model is used to solve this problem (see Section II-B).Section II-C discusses implementation details.

A. Problem Formulation: Band Selection as a Markov Decision Process
We view the task of hyperspectral band selection as a sequential forward search process, i.e., a sequential decisionmaking problem of an agent which interacts with a tailored environment (cf.Fig. 1).To be more specific, the agent needs to decide which spectral band it should pick at each time step so that it can find an optimal combination of K bands in K steps, and during this procedure, the agent explores the environment through actions and observes rewards and states.In this paper, we cast this problem as a Markov decision process that offers a formal framework for modeling the procedure of sequential decision-making when outcomes are partially uncertain.
A 5-tuple (A, S, P, R, γ) is often used to define a Markov decision process [64].Here A denotes the set of all actions, S is the set of all countable or uncountable states, P : S × A → P(S) represents Markov transition function, R : S × A → P(R) is the distribution of immediate rewards of state-action pairs, and γ ∈ (0, 1) denotes a discount factor.In specific, upon taking an action a ∈ A at a state s ∈ S, the probability distribution of the next state can be defined by P (•|s, a), and R(•|s, a) depicts the distribution of the immediate reward for the chosen action.In what follows, we detail how we parameterize the Markov decision process for our case.
Action.The action of the agent in our case is to choose a spectral band from the hyperspectral image at each time step.The complete set of all actions C is identical to the set of bands, i.e., C = {1, 2, • • • , L}.Let B be a set consisting of actions that have been taken before.Then the actual set of actions for the current time step is A = C \ B. During the training phase, an action a, a ∈ A, is taken by the agent and subsequently sent to the environment, and the latter receives the action, evaluates it, and gives the agent a positive or negative reward.In the test phase, the agent acts according to a learned policy to sequentially select bands.
State.The state s in our case is represented as the action history of the agent and is denoted as a L-dimensional vector with multi-hot encoding that records which actions have been taken (i.e., which spectral bands have been chosen) in the past.For example, s i = 1 means that the i-th band has been picked in previous time steps, while s i = 0 represents that it is still selectable.Taking the action history as the state implies dependencies among spectral bands, which helps to select the next band.Note that there exists a one-to-one correspondence between s and B.
Transition.The transition function P deems the next state as a possible outcome of taking an action at a state.In this work, the transition function is deterministic, which means that the next state is specified for each state-action pair.Specifically, P updates the state by changing the action history as follows: where B represents the set of selected bands associated with the next state s .
Reward.The reward function R should be in proportion to the advancement that the agent makes after picking a specific band.In this work, we discuss two ways to instantiate our reward scheme and measure the improvement from one state to another in our setup.They are detailed as follows.
• Information entropy: The information entropy is capable of measuring the information amount of a random variable quantitatively.Hence we make use of it to evaluate the richness of spectral information of bands.More specifically, denote x i ∈ R N as the i-th band vector, we calculate the mean information entropy of selected bands as follows: where B is associated with the state s.When the agent takes an action a and moves from state s to s , the reward R(s, s ) can be calculated as follows: • Correlation coefficient: The correlation coefficient measures how strong the relationship between two variables is.Here, we use it to estimate intra-band correlations among selected bands.There are several types of correlation coefficients, and we exploit a commonly used one, Pearson's correlation, a.k.a., Pearson's R, to calculate the mean correlation coefficient for s as follows: Then the agent can be rewarded by the following formula: Intuitively, Eq. ( 3) and Eq. ( 5) tell that the reward is positive if the quality of selected bands is improved from state s to state s , and negative otherwise.Driven by this reward scheme, the agent pays a penalty for choosing a non-informative band and is rewarded to add a band that results in an increase in the informative content of the whole set of selected bands.We quantitatively compare the above two instantiations of the reward scheme in Section III-C.The information entropy and correlation coefficient are two commonly used metrics to assess the quality of bands selected by an unsupervised band selection model [42], [50], which is the reason why we consider them as the reward scheme.Furthermore, we believe that more alternatives are possible and may improve results in the future.

B. Deep Reinforcement Learning for Band-selection Policy
In Section II-A, we discuss the parameterization of the Markov decision process for our task.By doing so, the band selection task is transformed into a sequential decision-making problem.Next, we show how we use deep reinforcement learning to learn a band-selection policy in this setup.
The policy we seek is an action-value function, denoted by Q(s, a)2 , that specifies the action a to be taken when the current state is s.Based on this function, the agent chooses the action which is associated with the maximum reward value.That is to say, in our task, bands with high information entropy or low correlation are expected to be chosen.Q-learning [65], a classical reinforcement learning algorithm, is often employed to approximate Q(s, a) by iteratively updating the actionselection policy using the Bellman equation: add the experience (s, a, r, s ) into M; s ← s ; end randomly sample a mini-batch B from M; for all (s, a, r, s ) ∈ B do calculate the learning target according to Eq. ( 8): y = r + γ max a Q(s , a ; w); end carry out a gradient descent step on L w.r.t.w according to Eq. ( 9): where r denotes the immediate reward and the second term Q(s , a ) is a future reward.In Q-learning, a lookup table, termed Q-table, serves as the Q-function for the agent to query the best action.However, this becomes impractical when action and state spaces are very large.To tackle such a problem, in this paper, we exploit a network named Q-network to approximate the action-value function.
Q-network architecture.The Q-network takes as input the state representation introduced in Section II-A and outputs a vector whose each component is a Q-value for each action.A detailed description of the Q-network we use is as follows.The input consists of a L-dimensional vector.The first fully connected layer has 2L units, followed by rectifier linear units (ReLUs) [66].The second fully connected layer has the same structure as the first layer, again followed by ReLUs.Finally, the last layer, a linear fully connected layer with L units, follows.The structure of the Q-network is outlined in Table I.
Q-network learning.The Q-network is learned by minimizing the following mean squared Bellman error: where w represents network weights, and y is the one-step ahead learning target: From Eq. ( 8), it can be seen that the target is composed of the immediate reward r and a discounted future reward.
Ideally, the prediction of the current action-selection policy is Actually, the learning of the Q-network for estimating the action-value function tends to be unstable.Therefore, in deep Q-learning, several techniques are used to address this problem, and they are detailed below.
Experience replay.Here an experience refers to a 4-tuple (s, a, r, s ).Consecutively generated experiences in our model are highly correlated with each other, and this could result in unstable and inefficient learning that is also a notorious problem in Q-learning.One solution to make the learning converge is to collect and store experiences in a replay memory, and during the training phase of the Q-network, mini-batches are randomly taken out from this replay memory and utilized for the Q-network training.This method has the following advantages: • One experience can be potentially used for many gradient descent steps, which improves data efficiency.• Randomizing experiences breaks correlations among consecutive samples and therefore reduces the variance of gradient descent steps and stabilizes the learning of the network.Exploration-exploitation.To train the Q-network, we use an -greedy policy, which means the agent either chooses actions at will with a probability or takes the best actions relying on the already learned band-selection policy with a probability 1 − .The learning of the Q-network starts with a relatively large and then gradually decays it.The main idea behind this policy is that the agent is encouraged to try as many actions (i.e., various band combinations) as possible to begin with before it starts to see patterns.When it does not select actions at random, given a state, the agent is able to estimate the reward for each action.Thus the best action leading to the highest reward can be picked.Moreover, note that the -greedy policy of our model is carried out on the actual action set A, instead of the complete action set C.

C. Implementation Details
In this work, we set the maximum size of the replay memory as 50000 and make use of a batch size of 100.The -greedy policy starts with = 1 and decreases until = 0.01 in

III. EXPERIMENTS AND ANALYSIS
A. Hyperspectral Data Set Description 1) Indian Pines: This scene was collected in June 1992 via NASA/JPL's Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) sensor.It covers a geographical area in Northwestern Indiana, United States.This data set includes 145×145 pixels, and its spatial resolution is 20 m/pixel.There are totally 220 spectral bands, and their wavelength values range between 400 nm and 2500 nm.The ground truth provided by the data set involves 16 classes of interest, of which the majority of these classes are related to crops at variant growth stages (cf.Fig. 2).Before performing band selection algorithms, we remove 20 bands, i.e., 104-108, 150-163, and 220, as they are both water absorption ones, and as a result, 200 spectral bands are eventually used in total.
2) Pavia University: The second scene was captured through Reflective Optics Spectrographic Imaging System (ROSIS) on an aircraft operated by the German Aerospace Center (DLR) in 2002.It covers an area of the University of Pavia and is composed of 103 spectral bands in the wavelength range of 430-860 nm after discarding 12 noisy bands and 610×340 pixels.The spatial resolution of this scene is 1.3 Table II outlines the number of labeled samples and classes of each data set.

B. Experiment Setting 1) Evaluation:
We use classification tasks to validate the effectiveness of selected bands.As to evaluation measurements, we make use of the following ones: • Overall accuracy (OA): This metric is calculated by summing the amount of correctly identified data and dividing by the total amount of data.• Average accuracy (AA): This measurement is computed by averaging all per-class accuracies.• Kappa coefficient: This coefficient evaluates the agreement between predictions and labels.

2) Band Selection Methods in Comparison:
To evaluate the proposed approach, we compare it with several state-of-the-art band selection algorithms that are listed as follows: • MVPCA [36]: A ranking-based band selection method that uses an eigenanalysis-based criterion to prioritize spectral bands.
• ICA [37]: A band selection approach that compares mean absolute independent component analysis (ICA) coefficients of individual spectral bands and picks independent ones including the maximum information.
• IE [38]: A ranking-based band selection algorithm in which band priority is calculated based on information entropy.• MEV-SFS [40]: A searching-based band selection method that combines maximum ellipsoid volume (MEV) [71] method with sequential forward search (SFS).The MEV model deems an optimal band subset as a band combination with the maximum volume.
• OPBS [40]: An accelerated version of MEV-SFS that takes advantage of a relationship between orthogonal projections (OPs) and the ellipsoid volume of bands to find out an optimal band combination.
• WaLuDi [41]: A hierarchical clustering-based band selection method that uses Kullback-Leibler divergence as the criterion of the clustering algorithm.• E-FDPC [42]: A clustering-based band selection approach that makes use of an enhanced version of fast density peak-based clustering (FDPC) [72] algorithm by introducing an exponential learning rule and a parameter to control the weight between local density and intracluster similarity.
• OCF [43]: A band selection method using an optimal clustering algorithm that is capable of achieving optimal clustering results for an objective function with a carefully designed constraint.
• ASPS [44]: A clustering-based band selection method that exploits an adaptive subspace partition strategy.• k-NN: A k-nearest neighbors algorithm (the number of neighbors is set to 3).• RF: A random forest being made up of 200 decision trees.
• MLP: A multilayer perceptron that consists of three fully connected layers.The first two layers contain 256 units, and their outputs are activated by Leaky RuLU.For the last layer, the number of units equals the number of classes, and the used activation function is softmax.In the learning phase, we select Adam as the optimizer and define the loss function as categorical cross-entropy.The learning rate is set to 0.0005, and the training epochs is 2000 for the purpose of sufficient learning.• SVM-RBF3 : A support vector machine (SVM) equipped with radial basis function (RBF) kernel.A five-fold crossvalidation method is utilized to determine optimal hyperparameters, i.e., γ and C. For both the Indian Pines and the Botswana data sets, we randomly select 10% samples from each class as training instances, while the remaining are exploited to test models.Regarding the Pavia University and MUUFL Gulfport data sets, 1% samples per class are chosen randomly to build the training set, and all the other samples are utilized for the purpose of testing.In order to know the stability of various band selection models, final results are achieved by averaging 10 individual runs, and we report the mean and standard deviation of performance metrics of different approaches over the 10 runs.

C. Information Entropy or Correlation: Whose Call Is It in Building The Reward Scheme?
Fig. 3 compares two instantiations of the reward scheme, namely information entropy and correlation coefficient (cf.Section II-A), on the Pavia University data set.To quantitatively evaluate them, we make use of k-NN to perform classification using spectral bands selected by models using these two schemes.From Fig. 3, it can be observed that the former can achieve higher OA, AA, and Kappa coefficient compared to the latter.Moreover, the computation cost of information entropy is lower than that of correlation coefficient.Hence we choose information entropy as the reward scheme in our model for the following experiments.

D. Results and Discussion
In this subsection, we assess the proposed approach by comparing it with several state-of-the-art band selection methods mentioned in Section III-B.For each data set, we plot OA curves showing OA variations w.r.t. the number of chosen bands K in Fig. 4, Fig. 5, Fig. 6, and Fig. 7.In our experiments, K varies from 5 to 60.Furthermore, in Table III, Table IV, Table V, and Table VI, we also report OAs, AAs, and Kappa coefficients of different methods with a fixed K (following the setup in [52], it is set to 30).
Fig. 4 and Table III present results on the Indian Pines data set.As can be seen in Fig. 4, the proposed DRL is capable of achieving the highest OA using an SVM classifier with 5 to 60 selected bands.Although the OA of DRL is a little bit lower than that of WaLuDi when a k-NN is employed with 5 bands, our DRL model outperforms other competitors when more spectral bands are chosen.Moreover, we can see that as compared to other band selection models, the proposed approach can also provide gains when using an MLP (the only exception is when K = 5).With an RF classifier, OCF and WaLuDi outperform DRL when K = 5 and 20, but in other cases, the proposed model is able to provide the best results.In Table III, we take an example of selecting 30 bands for classification and report numerical results.It can be observed that our DRL obtains the best results.Particularly when using a k-NN classifier, our approach can gain an improvement of 2.05%, 2.13%, and 2.31% in OA, AA, and Kappa coefficient, respectively, compared with the second best model.In addition, it is noteworthy that in comparison with original data with all bands, our method can offer almost same or better results at some point, e.g., when over 20 bands are selected for k-NN.
Fig. 5 and Table IV exhibit classification results for the Pavia University data set.In Fig. 5 (with k-NN, RF, and MLP), when only a few bands are selected, e.g., 5, the OA of DRL is lower than that of WaLuDi and/or E-FDPC.But when that number goes beyond 5, the proposed method outperforms other competitors.On the other hand, DRL performs well with an SVM classifier, and its OA exceeds accuracies of all competitors (cf.Fig. 5(d)).For instance, Table IV shows that as compared to the second best model, MEV-SFS and OPBS, our DRL is able to obtain a gain of 1.54% and 2.07% in OA and Kappa coefficient, respectively.Besides, OAs produced by most methods grow when more bands are selected, and our band selection method can achieve higher accuracies than all bands at certain locations, e.g., K ≥ 20 in Fig. 5(b) and K = 30 and ≥ 50 in Fig. 5(d).
Classification results on the Botswana data set are shown in Fig. 6 and Table V.As shown in Fig. 6, the proposed DRL model delivers the best and most stable results with k-NN, RF, and SVM, except for K = 60 in Fig. 6(a) and Fig. 6(d).For MLP, DRL performs best when K = 30, 50, and 60, and in other cases, it achieves the second best results.Besides, we notice that the performance of DRL and some competitors is very similar when selecting more than 50 spectral bands.But overall we can see significant gains on this data set.
In Fig. 7 and Table VI, we report results on the MUUFL Gulfport data set.It can be seen that several band selection models, e.g., WaLuDi, E-FDPC, OCF, ASPS, and DRL, behave very similarly.This may be because as compared to the other three data sets, the MUUFL Gulfport data set has only 64 bands.
Overall, from the tables, we can see that among all band selection models, the ranking-based methods perform relatively poorly, while the clustering-based approaches tend to achieve good results.The searching-based models, i.e., MEV-SFS and OPBS, can deliver good selected bands on some data sets like the Pavia University scene, but it is noteworthy that they are not robust against different data sets, for example, their performance on the Indian Pines scene is not satisfactory.By contrast, our method shows superior performance.This may be due to the fact that our approach is a data-and objective-driven learning-based model.Compared to other heuristic algorithms, it is able to explore more possible band subsets during the training phase.
In addition, we visualize bands selected by the proposed method on both four data sets in Fig. 8, Fig. 9, Fig. 10, and Fig. 11.From these figures, we see that DRL tends to select spectral bands with high information entropy.This is in line with our presumption and existing studies in hyperspectral band selection, in which information entropy is an important measurement.Classification maps using 30 bands selected by the proposed DRL model and an SVM-RBF classifier on the four data sets are shown in Fig. 12. Basically, these maps present satisfactory classification results, although we see some salt-and-pepper noise which are inevitable in spectral classification.

E. Stability against Classifiers and Robustness against Data Sets
From experimental results, we observe that some competitors have unstable behaviors with different classifiers.For example, ASPS works quite well on the Indian Pines data set when RF, MLP, and SVM are employed but a little bit poor when using a k-NN.Similarly, E-FDPC can provide decent results on the Pavia University data set with k-NN, while with an MLP or SVM classifier, it performs rather poorly as compared to other band selection algorithms.This is probably because there exist noisy bands in selected bands, which leads to an unsatisfactory performance on noise-sensitive classifiers.In contrast to most competitors, the proposed DRL model is more stable against classifiers.
Furthermore, we also notice that the robustness of several competitors against different data sets is not satisfactory.For example, when 30 bands are selected and making use of an SVM classifier, OCF is capable of achieving the second highest OA and Kappa coefficient on the Indian Pines data set (cf.Table III) but shows a lackluster performance on the Pavia University and Botswana data sets (see Table IV and Table V).This may be because choosing an optimal combination of spectral bands is a non-trivial task, and locally optimal solutions are not easy to always avoid.In this aspect, the proposed method is more robust against data sets.

F. Limitations
Further, we would like to discuss limitations of the proposed method.Firstly, as to computational time, compared to other heuristic band selection methods, the proposed model needs more time, as it is a learning-based algorithm and takes some time to explore an effective band-selection policy during the training phase.Taking the Indian Pines data set and 30 selected bands as an example, most heuristic band selection approaches take a few seconds to several tens of seconds [50], and the proposed model needs around 350 seconds.But we note that DARecNet [45], a CNN-based unsupervised band selection model, takes about 9000 seconds under recommended settings.Overall, the computational time of our model is acceptable.Secondly, since the objective function of our unsupervised DRL is structured such that the learning is aiming to maximize the reward rather than classification accuracy, we cannot intuitively assess the quality of the model in terms of classification accuracy during the training phase, which may lead to unstable model training and the inconvenience of monitoring model training.

IV. CONCLUSION AND OUTLOOK
This paper proposes a deep reinforcement learning model for unsupervised hyperspectral band selection.In the training phase, the goal of the deep reinforcement learning agent (i.e., Q-network) is to learn a band-selection policy that guides the sequential decision-making process of this agent.The policy is a function specifying the band to be chosen given the current state.Note that the training process does not need any labeled data.In the test phase, the agent acts sequentially according to the learned policy.We conduct extensive experiments, and results show the effectiveness of our approach.Moreover, two instantiations of the reward scheme in Section II-A are quantitatively compared, and we believe that more alternatives are possible and may improve results.In the future, several studies intend to be carried out.For example, combining deep reinforcement learning and some heuristic band selection frameworks (e.g., the clustering-based method) is likely to offer better band selection solutions.Considering that different classes may have different optimal band subsets (with a variable number of bands), how to determine the best band combination for each category is an interesting but challenging problem.A supervised deep reinforcement learning model may be able to provide insights.Moreover, we believe that deep reinforcement learning can be applied to more remote sensing applications, such as multitemporal data analysis, visual reasoning in airborne or space-borne images, and other combinatorial optimization tasks in remote sensing.

•
Ranking-based methods.These methods aim at seeking an effective criterion to measure the significance of each spectral band and prioritize all bands.Afterwards, topranked bands are selected.Some representative rankingbased band selection methods are [36]-[38].• Searching-based methods.The searching-based band selection approaches usually have two components: an objective function and a sequential search algorithm.The former is a criterion that the latter seeks to minimize over all feasible band subsets by adding or removing bands from a candidate set.The searching-based methods have two variants: sequential forward selection and sequential backward selection.[39], [40] are both representative works in this direction.• Clustering-based methods.In these methods, all spectral bands are first grouped into several clusters via a clustering algorithm.Afterwards, the most representative band is selected from each cluster.Representative clusteringbased band selection methods include [41]-[44].

Algorithm 1 :
Training randomly initialize Q-network weights w; initialize replay memory M; initialize the complete set of all actions C; while not converged do initialize state: s = 0; empty the set of chosen bands: B = ∅; for t = 1 to K do compute the actual set of actions: A = C \ B; simulate one step with the -greedy policy π : a = π (s); s , r = STEP(s, a); B ← B ∪ {a};

Algorithm 2 :
Environment simulation (based on information entropy) function s , r = STEP(s, a) : get s based on s and a; if s is 0 then r = − N n=1 P (x n a ) log 2 P (x n a ); else calculate r according to Eq. (3): r = MIE(s ) − MIE(s); end supposed to be very close to the target, i.e., we want the error decrease.Hence we carry out a gradient descent step on L w.r.t.w according to:

Fig. 2 .
Fig. 2. From left to right and top to bottom: True-color composite images and ground truth data of the Pavia University, Botswana, Indian Pines, and MUUFL Gulfport Data Sets.

Fig. 3 .
Fig. 3. Comparison of two reward schemes, namely information entropy and correlation coefficient, on the Pavia University data set.

Fig. 4 .Fig. 5 .
Fig. 4. OA curves of different band selection methods on the Indian Pines data set.The x-axis indicates OA (%), and the y-axis indicates the number of selected bands.(a) OA by k-NN.(b) OA by RF.(c) OA by MLP.(d) OA by SVM-RBF.All OAs are achieved by averaging 10 individual runs.

Fig. 6 .
Fig. 6.OA curves of different band selection methods on the Botswana data set.The x-axis indicates OA (%), and the y-axis indicates the number of selected bands.(a) OA by k-NN.(b) OA by RF.(c) OA by MLP.(d) OA by SVM-RBF.All OAs are achieved by averaging 10 individual runs.

Fig. 7 .
Fig. 7. OA curves of different band selection methods on the MUUFL Gulfport data set.The x-axis indicates OA (%), and the y-axis indicates the number of selected bands.(a) OA by k-NN.(b) OA by RF.(c) OA by MLP.(d) OA by SVM-RBF.All OAs are achieved by averaging 10 individual runs.

Fig. 8 .
Fig.8.Visualization of bands selected by the proposed method on the Indian Pines data set.We also show the average spectral signature of each class.30 bands are selected here.

Fig. 9 .
Fig.9.Visualization of bands selected by the proposed method on the Pavia University data set.We also show the average spectral signature of each class.30 bands are selected here.

Fig. 10 .
Fig.10.Visualization of bands selected by the proposed method on the Botswana data set.We also show the average spectral signature of each class.30 bands are selected here.

Fig. 11 .
Fig.11.Visualization of bands selected by the proposed method on the MUUFL Gulfport data set.We also show the average spectral signature of each class.30 bands are selected here.

Fig. 12 .
Fig. 12. Classification maps using 30 bands selected by the proposed DRL model and an SVM-RBF classifier on the four data sets.

TABLE I ILLUSTRATION
OF THE Q-NETWORK WE USE.TAKING THE INDIAN PINES DATASET AS AN EXAMPLE.

TABLE II NUMBER
OF LABELED SAMPLES IN THE INDIAN PINES, PAVIA UNIVERSITY, BOTSWANA, AND MUUFL GULFPORT DATA SETS.
Our proposed deep reinforcement learning model for unsupervised hyperspectral band selection.3) Classification Setting: We consider four commonly used classifiers in the remote sensing community to implement hyperspectral image classification.They are as follows: [45]RecNet[45]: An unsupervised convolutional neural network (CNN) for band selection tasks.It employs a dual-attention mechanism, i.e., spatial position attention and channel attention, to learn to reconstruct hyperspectral images.Once the network is trained, bands are se-lected according to entropies of the reconstructed bands.•DRL: