An Encoder–Decoder Convolution Network With Fine-Grained Spatial Information for Hyperspectral Images Classification

Convolutional Neural Network (CNN) is widely used in Hyperspectral Images (HSIs) classification. However, the fine-grained spatial (FGS) details are discarded during a sequence of convolution and pooling operations for most of CNN-based HSIs classification methods. To address this issue, a unified encoder-decoder framework is proposed to integrate high-level semantics and FGS details for HSIs classification, denoted by FGSCNN. The encoder, including a series of convolution and pooling layers, captures the high-level semantic information with low resolution feature maps. The decoder fuses the high-level low-resolution semantic and the fine-grained high-resolution spatial information, namely, to get the FGS features with high-level semantics. The deconvolution layers and skip connection are used in the decoder to retain the FGS details, while, convolution layers are also used to combine the FGS features with high-level semantics. Based on the encoder-decoder framework, a unified loss function is exploited to integrate the high-level semantic information and FGS details with an end-to-end manner for HSIs classification. Experiments conducted on the three public datasets, i.e. the Indian Pines, Pavia University and Salinas, demonstrate the effectiveness of the proposed method on HSIs classification.


I. INTRODUCTION
Hyperspectral Images (HSIs) contain a great deal of spatial geometric information and spectral information reflecting various characteristics of ground objects. The goal of HSIs classification is to assign each pixel to a set of landuse/land-over classes with a classifier. HSIs classification has attracted extensive attention due to the active effect in atmospheric environment research [1], ocean remote sensing [2], environmental monitoring [3], military reconnaissance, urban planning [4], [5] and so on. HSIs offer great potentials for finer classification [6]. Meanwhile, HSIs classification present numerous challenges, particularly in utilizing high-dimensional spectral data and high resolution spatial information.
The associate editor coordinating the review of this manuscript and approving it for publication was Shouguang Wang . In the past decades, many traditional machine learning methods are applied on HSIs classification, such as K-Nearest Neighbor (KNN), Support Vector Machine (SVM) [7], Low Rank [8] and Sparse Representation Classifier (SRC) [9]. These methods utilized spectral data of a pixel to determine its class label with the advantages of conceptual simplicity and easy implementation. However, due to the phenomenon that the same objects may present different spectral discrepancy and different material may share the same spectrum signatures, it often results in an unsatisfactory effect in distinguishing different classes with spectral information only. The abundant spatial information in HSIs were neglected. By observing the fact that the neighboring pixels usually carry correlated information with a smooth spatial domain, a large amount of researchers paid attention to integrate spectral and spatial information for HSIs classification. Nevertheless, the early spectral-spatial methods make use of handcrafted features, which requires massive experience for specific application scenarios.
With the development of Convolutional Neural Network (CNN), various CNN architectures [10]- [13] are performed on HSIs to extract high-level spectral, spatial and spectralspatial features [14]- [16], such as Google Inception [17], VGG, ResNet [18] and DenseNet [19]. These CNN-based methods [20]- [22] made an end-to-end training process with the supervision of high-level class labels [23]- [26]. Only the outputs of the last convolution layer with low resolution are sent to the fully connected layer for HSIs classification. The low resolution feature maps are certified to be effective to capture the high-level semantic labels. However, it leads to a major limitation that the fine-grained spatial (FGS) details are discarded during a sequence of convolution and pooling operations. Nevertheless, the high resolution feature maps lack of high-level semantic information. Therefore, extracting effective features to carry the FGS details with high-level semantics is a challenging task for HSIs classification.
By considering the issues mentioned above, we aimed at building a unified encoder-decoder framework to integrate high-level semantics and FGS details for HSIs classification, namely FGS based CNN (FGSCNN). In this method, the encoder is used to capture the high-level semantic information with low resolution feature maps. The decoder is exploited to retain the high resolution spatial details, which is a new branch compared to the traditional CNN-based HSIs classification methods. The high-level semantic information and FGS details are fused in a unified loss function for HSIs classification with an end-to-end manner. To summarize, the main contributions of this paper are as follows: 1) The encoder-decoder network with skip connection is proposed to capture the FGS features with high-level semantics. The FGS features with high-level semantics is achieved by the decoder part to retain the FGS distribution for HSIs classification. 2) The high-level low-resolution semantic information and the FGS features with high-level semantics are aggregated for end-to-end HSIs classification. To alleviate the loss of high-level semantic information during the decoder process, the high-level semantics from the encoder output are used as an auxiliary loss to assist the final classification. 3) Experimental results conducted on the public datasets demonstrate the effectiveness of the proposed FGSCNN, which achieves competitive performances with existing methods.

II. RELATED WORK A. HAND-CRAFTED HSIS CLASSIFICATION
Over the past two decades, a great deal of effective methods have been developed for HSIs classification. Before the surge of deep learning, HSIs classification relies on hand crafted spatial-spectral features such as Markov Random Fields (MRFs), SVM and SRC. Tarabalka et al. [27] used the MRFs to combine spectral features with spatial features.
Authors of [28], [29] enhance the traditional SVM by considering the spatial information as well as the spectrum. However, the hand-crafted methods are still unable to extract abundant semantic information and require massive experience for specific application scenarios.

B. CNN-BASED HSIS CLASSIFICATION
With the success of deep learning in machine learning and computer vision fields, deep learning based methods have shown great potential by learning discriminative spectral and spatial features adaptively. Lin et al. [30] [35] proposed a multi-grained network not only to extract the joint spectral-spatial information, but also to combine different grains spectral and spatial relationship for HSIs classification. However, the FGS details are discarded during a sequence of convolution and pooling operations for most of CNN-based HSIs classification. Jiao et al. [36] used the pre-trained Fully Convolutional Networks (FCN) [37] for spatial distribution prediction. However, the pre-trained FCN model on natural image dataset cannot retain the high-resolution spectrum information in HSIs. In this paper, a unified encoder-decoder network is builded to extract the FGS features with high-level semantics for HSIs classification.

III. METHODS
In this section, we explain the proposed FGSCNN for HSIs classification in details, which contains three blocks. The first block is applied to extract high-level semantics with lowresolution feature maps. The second block is implemented to retain the FGS distribution with high-resolution feature maps. The third one utilizes a balance parameter to unify the previous two parts into a loss function for HSIs classification.

A. OVERVIEW OF THE FGSCNN
In general, the traditional HSIs classification methods make use of the high-level semantics. However, these methods pay little attention to retain the FGS information. To address this issue, we propose an encoder-decoder convolution network with FGS information for hyperspectral images classification. The overall frameworks of FGSCNN is illustrated in Figure 1. The encoder block is used to extract highlevel semantics, while the decoder part is utilized to fuse the high-level low-resolution semantics and the fine-grained high-resolution spatial information, namely, to get the FGS features with high-level semantics. The FGS features with high-level semantics are fed into the fully-connected network.
There are two fully-connected layers applied: the first one is used for dimensionality reduction of features to increase computational efficiency, and the another one is served to generate the final classification results by a softmax classifier. Moreover, the overall loss function of the FGSCNN can be divided into two parts: a main loss function and an auxiliary loss function. The auxiliary loss function guarantee that the encoder block extracts the effective high-level semantics, which is only used in the training process. The main loss function is applied to ensure a corrected pixel labels. Figure 3 illustrates the details of the proposed FGSCNN, and more detailed information will be introduced in the next subsection. Different from AutoEncoder, which takes the output of the encoder as the feature, FGSCNN utilizes the FGS features with high-level semantics which is controlled with the supervision of high-level semantics to classify the pixel. The proposed FGSCNN is also similar to the U-Net. U-Net is designed for segmentation, while, FGSCNN is implemented for classification. And above all, an auxiliary item is performed on the output of encoder to retain the high-level semantic information as much as possible, which is one of the most important difference with U-Net.

B. THE ENCODER ARCHITECTURE OF FGSCNN
The encoder block via CNN are designed to exploit the spatial and spectral correlation across neighboring pixels. let us assume X = {x 1 , x 2 , . . . , x n } ∈ R 1×1×B , where B is the number of spectral bands of HSIs and n is the number of pixels which have a ground truth label, namely Y = {y 1 , y 2 , . . . , y n }. For each pixel of X set, the image patch x * i with the size of w × w × B (w is the window size) are extracted, where x i is its centered pixel and the ground truth of the image patch is y i . Therefore, the pre-processed X is represented as X * = {x * 1 , x * 2 , . . . , x * n } ∈ R w×w×B . Then, each patch x * i is fed into the encoder block. The encoder block can be formulated as: where x * i is the extracted high-level semantics and f (·) denotes the encoder model by CNN.
The general process of f (·) includes convolution, pooling and batch normalization steps. The detailed information of this encoder network is illustrated in Figure 2. Specifically, the encoder network regards x * i as input, and employ convolution layers to extract semantics. The pooling layers reduce the number of parameters in networks, and enhance the generalization ability of networks. The hyperspectral pixels in a small neighborhood around the central pixel are jointly represented by the CNN model for spectral-spatial feature extraction. In general, after a series of operations, the lowresolution feature map is obtained, carrying the high-level semantics for HSIs classification. However, the obtained low-resolution feature map loses the correlations information across neighboring pixels. For example, the FGS details } captured from early convolutional layers are abandoned.

C. THE DECODER ARCHITECTURE OF FGSCNN
In the decoder block, the FGS details of the encoder network are retained to generate the FGS features with high-level semantics. In the traditional CNN model, early convolutional layers with high spatial resolutions capture local details while the later convolutional layers with low spatial resolutions extract high-level semantics. To fuse the advantages of the two stages, the decoder network consists of deconvolution layers and skip connection layers which combines the FGS details generating from the encoder. Most of the HSIs classifier take advantage of the high-level semantics to assign the labels for pixels, discarding the FGS details. It is insufficient for high-resolution spatial and spectral HSIs classification by only exploiting high-level semantic features. To tackle this issue, a decoder network is designed, which translates the learned high-level low-resolution features to the FGS features with high-level semantics. The skip connection operations are exploited to retain the FGS details for the high-level semantics. The details of the decoder network are illustrated in Figure 4.
The inputs of the decoder block are the learned highlevel semantic features x * i and the FGS details  which are extracted from the encoder block. Exploiting the FGS details {x * i1 , x * i2 , x * i3 } for HSIs classification directly is ineffective, lacking the guidance of semantics. In order to fuse the fine-grained details and high-level semantics, the deconvolution and the skip connection are implemented to generate joint features. It is inadequate to combine the two types of information directly for effective HSIs classification. Therefore, the convolution layers are also adopted to capture the FGS features with high-level semantics. In this way, the FGS features with high-level semantics is achieved with the supervision of semantic labels. The decoder block can be formulated as: where g(·) denotes the decoder block, which includes deconvolution, convolution, skip connection and batch normalization steps, s i is the FGS features with high-level semantics.

D. LOSS FUNCTION
Two kinds of information can be used for classification: the high-level semantics from the encoder output ( x * i ) and the FGS features with high-level semantics from the decoder output (s i ). In the proposed FGSCNN, the FGS features with high-level semantics are used to calculate the main loss. In order to alleviate the loss of high-level semantics during the decoder process, the high-level semantics from the encoder output are used as an auxiliary to assist the final classification.
The main loss function is implemented on the FGS features with high-level semantics from the decoder output (s . The cross entropy is adopted for loss function calculation: where p = {p 1 , p 2 , . . . , p m } is a probability distribution captured by the classification of decoder block, p k is the probability of class k, m is the number of kinds of pixel labels, and y = { y 1 , y 2 , . . . , y m } is the one-hot representation of the ground truth of pixel. The auxiliary loss helps to optimize the learning process, which is performed on the high-level semantics from the encoder output: where p = { p 1 , p 2 , . . . , p m } is a probability distribution captured by the classification of encoder block, p k is the probability of class k. Hence, we add a parameter to balance the two loss functions. The unified loss function is then formulated as: where L m , L a denotes the main loss and auxiliary loss respectively, and ω is a balance parameter.

A. DATASETS
The performance of the proposed FGSCNN is evaluated on the three publicly available datasets: Indian Pines, Pavia University and Salinas. For each dataset, we randomly selected 200 labeled pixels per class for training and the other pixels in the ground-truth map for testing. The details of the three datasets are as follows: VOLUME 8, 2020 1) The Indian Pines dataset was collected by AVIRIS sensor over Northwestern Indiana, which contains 145 × 145 pixels. This dataset contains 200 spectral bands with a wavelength range from 0.4 to 2.5µm. There are 16 land-cover classes in ground-truth of this dataset. We removed the classes with fewer samples and selected 8 classes with more samples [34], [38], [39]. The numbers of training and testing samples are listed in Table 1.
2) The Pavia University dataset was captured by the ROSIS sensor over Pavia, northern Italy. This image contains 610 × 340 pixels and 103 spectral bands with the spatial resolution of 1.3m. There are 9 classes and 42, 776 samples in total. Table 2 indicates the numbers of training and testing samples.
3) The last dataset is the Salinas dataset, and it is also the most sampled dataset. The Salinas dataset gathered with the AVIRIS sensor over Salinas Valley, which contains 512 × 217 pixels, 204 spectral bands. It has a total of 16 classes and 54, 129 samples. The numbers of training and testing samples are listed in Table 3.

B. DATA AUGMENTATION
A deep network usually requires sufficient training data to learn the model with a great deal of parameters. However, in HSIs classification tasks, only a few labeled samples may be available in practice. To address this issue, we take advantage of a data augmentation method to train the proposed FGSCNN in a small sample environment. The first step is flip, for which the process is applied by flipping the original samples horizontally or vertically. The second step is to add the Gaussian noise to the flipped data. The last step is to assign the newly generated data with the same class labels as the original data. Table 4 investigates the effects of data augmentation. The performance of FGSCNN has an improvement of 0.88%, 0.44% and 0.42% on Indian Pines, Pavia University and Salinas respectively. On the whole, the augmentation   operation for training data plays a positive role for HSIs classification.

C. ON THE TRADE-OFF BETWEEN L M AND L A
Parameter w is used to balance L m and L a . L m uses the FGS features with high-level semantics and L a utilizes the highlevel semantics. When ω = 1, the decoder block (auxiliary loss) is completely abandoned. With the decrease of w, the decoder block plays a more and more important role. When ω = 0, the architecture of FGSCNN is similar to U-Net [40]. U-Net is proposed for image segmentation, while FGSCNN is exploited for HSIs classification. Table 5 reports the results with the varied ω on the three datasets. When ω = 0.3, FGSCNN achieves the best classification performance on all the three datasets. The high-level semantics and the FGS features complement with each other for the best classification results. The output of the encoder (high-level semantics x * i ) is not used in the testing phase, however, it can provide the enhanced semantic information for s * i (the FGS features with high-level semantics). Compared with no auxiliary loss, FGSCNN (ω = 0.3) improves 0.88% on Indian Pines, 0.18% on Pavia University and 0.55% on Salinas.
Besides, we also investigate the classification performance by using the output from the encoder (x * i ) and the decoder (s i ). The results are shown in Table 6. The output of decoder, carrying with FGS details by skip connection operations, contains more abundant information compared with that of encoder.

D. THE EFFECT OF DIFFERENT WINDOW SIZES
The proposed FGSCNN utilizes the spatial information across neighboring pixels for HSIs classification. It is necessary to investigate the influence of the window size for HSIs classification. Figure 5 illustrates the classification performance with different window sizes. The window sizes are varied from 3 × 3 to 15 × 15. When the window size is 11 × 11, the classification performance of FGSCNN tends to be satisfied. Although, better performance can be found by using the window size of 13 × 13 and 15 × 15, considering  the computational cost, the window size of 11 × 11 is applied for all the experiments except in this subsection.

E. CLASSIFICATION PERFORMANCE
To validate the effectiveness of the proposed FGSCNN, the method is compared with some related HSI classification approaches, such as CNN [41], CNN-PPF [32], SS-CNN [42], DRCNN [34].  Tables 7-9. In general, the classification performance of the proposed FGSCNN is superior to the other methods, especially on Indian Pines and Pavia University dataset. For example, in Table 8, the proposed FGSCNN yields OA 99.76%, nearly 7% higher than that of the CNN (i.e., 92.27%), 3% higher than that of the CNN-PPF  (i.e., 96.48%), 1% higher than that of the SS-CNN (i.e., 98.41%) and approximately 0.2% improvement compared to the DRCNN (i.e., 99.56%). The similar phenomenon can also be found on the other experimental data. The main reason is that the proposed FGSCNN retains the FGS details compared to the other CNN based methods, in which only the high-level semantics are used for classification. It is confirmed that the FGS features with high-level semantics show superiority on HSIs classification.  It is worth noting that the performance of DRCNN is better than our FGSCNN in Salinas dataset. We argue that FGSCNN can achieve a good performance by using a small amount of training samples. To further demonstrate the advantage, experiments are conducted on the three datasets with varied numbers of training samples per class. The results are shown in Table 10. From the Table 10, we can see that FGSCNN achieves the best performance among all the methods when the numbers of training samples are 100 and 150. Specially, the proposed FGSCNN yields OA 96.95%, nearly 3% higher than that of the CNN-PPF (i.e., 93.88%), 1% higher than that of the DRCNN (i.e., 95.54%). Both of the CNN-PPF and DRCNN are devoted to solve the problem of few label samples. It demonstrates that the proposed FGSCNN is also suitable for HSIs classification with inadequate training samples. FGSCNN depends on not only the traditional highlevel semantics but also the proposed FGS details. However, the performance of FGSCNN is worse than CNN-PPF and DRCNN when the numbers of training samples is 50. There is a sharp decline in the performance of FGSCNN when the numbers of training samples vary from 100 to 50. The number  of training samples is too inadequate to extract effective highlevel semantics for classification, even with the support of the FGS details.
All the experiments were conducted on a computer that has a NVIDIA GeForce RTX-2070 SUPER GPU (8GB GDDR5). Based on the stand back-propagation algorithms, the stochastic gradient descent algorithm is adopted to learn the network parameters, where the batch size is set to 160 and the learning rate is 0.001. Table 11 shows the computational complexity of training and test process using the proposed FGSCNN, CNN, CNN-PPF and DRCNN. The proposed FGSCNN has the faster convergence speed compared to other methods.

V. CONCLUSION
In recent years, deep learning methods have attracted much attention on HSIs. These methods are able to automatically capture spatial and spectral information. However, the FGS details are discarded during a sequence of convolution and pooling operations for HSIs classification. To tackle this issue, a new branch compared to the traditional CNN-based HSIs classification has been presented to build an encoderdecoder convolution network with FGS information for HSIs classification, denoted by FGSCNN. This model contains two main blocks: (i) the encoder block, which is used to extract high-level semantics from the original images, and (ii) the decoder block, which is utilized to retain the FGS details with high-level semantics by using a series of deconvolution, skip connection and convolution operations. In this study, the performance of the proposed FGSCNN has been assessed in three public datasets. From the results, it can be concluded that the proposed FGSCNN achieves convincing results compared with the state-of-the-art.