Deep Neural Network Based Hyperspectral Pixel Classification With Factorized Spectral-Spatial Feature Representation

Deep learning has been widely used for hyperspectral pixel classification due to its ability of generating deep feature representation. However, how to construct an efficient and powerful network suitable for hyperspectral data is still under exploration. In this paper, a novel neural network model is designed for taking full advantage of the spectral-spatial structure of hyperspectral data. Firstly, we extract pixel-based intrinsic features from rich yet redundant spectral bands by a subnetwork with supervised pre-training scheme. Secondly, in order to utilize the local spatial correlation among pixels, we share the previous subnetwork as a spectral feature extractor for each pixel in a patch of image, after which the spectral features of all pixels in a patch are combined and feeded into the subsequent classification subnetwork. Finally, the whole network is further fine-tuned to improve its classification performance. Specially, the spectral-spatial factorization scheme is applied in our model architecture, making the network size and the number of parameters great less than the existing spectral-spatial deep networks for hyperspectral image classification. Experiments on the hyperspectral data sets show that, compared with some state-of-art deep learning methods, our method achieves better classification results while having smaller network size and less parameters.


Introduction
Hyperspectral imaging has opened up new opportunities for analyzing a variety of materials in remote sensing as it provides rich information on spectral and spatial dis-tributions of distinct materials. One of its most important applications is pixel classification, which is widely applied in material recognition, target detection, geoindexing, and so on [23,19,11,24]. However, the classification of hyperspectral image (HSI) still faces some challenges such as, the unbalance between a small number of available training samples and a large number of narrow spectral bands, the high variations of the spectral signature from identical material, high similarities of spectral signatures between some different materials, and the noise impact from the sensors and environment [5]. In order to address the above problems, it is necessary to extract robust and discriminant features. The popular spectral feature extraction algorithms include principal component analysis (PCA) [2], independent component analysis (ICA) [46], linear discriminant analysis (LDA) [3], manifold learning [1,7], and various band selection methods [32,30,12]. In addition, many studies have demonstrated that it is difficult to well distinguish pixels with spectral information alone, hence the spatial-spectral feature extraction attracts more and more attention. A number of joint spectral-spatial features have been proposed such as extended morphological profiles [4] and 3-D discrete wavelet transform [35].
Among them, the deep neural network based approaches for HSI classification can also be sketchily categorized into spectral based methods and spectral-spatial based methods. The first type of deep learning methods directly use the spectral signatures of pixels for classification, including stacked autoencoder based methods [10], pre-training based methods [48], and convolutional neural network (CNN) based method [18]. Even though they bring notable improvement of HSI classification over conventional classification techniques such as k-nearest-neighbor (KNN) and support vector machine (SVM), some pixels are still difficult to be accurately classified using the spectral information alone due to the high inter-class similarity and the high intra-class difference. On the other hand, spectral-spatial information based deep-learning approaches have been proved to achieve better performance than those spectral information based ones. The examples are spatial stacked autoencoder (SAE) based method [29], CNN based spectral-spatial feature extraction methods [31,37] , spectral-spatial CNN based classifiers [27,51,40]. Particularly, three dimensional (3D) CNN [9] is a typical model used in most of these spectral-spatial information based deep-learning approaches. 3D CNN modifies standard CNN to convolve along both spatial and spectral dimensions for HSI classification. Such a scheme can always employ local spatial information in HSI patches to perform classification learning and inference. However, compared with the 1D and 2D CNN, 3D CNN requires a greater size of model to capture useful features as the number and size of kernels grow rapidly in respect to input size and dimension, especially when the spatial dimension and spectral dimension are distinct in terms of physical mechanism. In general, 3D CNN tends to have excessive parameters, so as to be prone to over-fit and hard to train. As has been suggested by K. He et al. in [15], the oversized network may encounter certain approximation difficulties. Han et al. also pointed out in [14] that the weights of CNN may be redundant and most of them do not carry significant information. Li et al. in [22] showed a similar fact that prevailing deep models suffer from weight redundancy problem. Zhang et al. in [50] also indicated that the oversized models with the excessive amount of parameters often tend to memorize data sets instead of learning the generic task solutions. Furthermore, on account of the limited resources of class-labeled pixels in hyperspectral datasets, this problem will cause severer degradation of classification performance. Therefore, re-ducing the size of model becomes a significant problem for spectral-spatial CNN based HSI pixel classification methods.
Besides of the routine techniques of model compression for deep learning models such as normalization, regularization, and network pruning, operation factorization is popularly used in CNN and other deep neural networks to decouple a complex computation into many much smaller steps which have far less parameters in total [43,17]. Such as in [16], a convolutional layer with 73728 parameters are factorized into two separate operations: one depth-wise and one point-wise convolution, there are totally 576 + 8192 = 8768 parameters. In [44], n × n convolution is factorized into a 1 × n convolution followed by a n × 1 convolution. The decoupled/factorized operations can improve the generalization capability, consume less resources, and make training and inference faster. Therefore, operation factorization is also one of the prevailing as well as practical schemes applied in spectral-spatial deep learning models for HSI classification. For example, in [25], CNN with pixel-pair features (CNN PPF) is based on the similarity of local constituents, which takes a pair of pixels as input, and the output tells if these two pixels are of different classes or which class they all belong to. The spectral-spatial feature in CNN-PPF is factorized into class related latent spectral feature and pixel-pair consistency based spatial feature. The final label of the target pixel is determined via a voting strategy based on the neighboring pixel-pair information. CNN-PPF performs well using its augmented data and L 2 regularization scheme. Similarly, in [28], Siamese CNN (S-CNN) uses paired image patches as training samples, and trains them to classify the central pixels of patches, in which the spectral-spatial feature is factorized into the spectral feature extracted by Siamese network and the spatial information used by an SVM that combines the spectral feature vectors of patches outputted by Siamese network. Unfortunately, the combination schemes of the spectral and spatial operations in both of CNN-PPF and S-CNN are fixed rather than adaptively learned, as the voting strategy is used by CNN-PPF, and a separated SVM is used in S-CNN. Therefore an end-to-end CNN with global optimization cannot be achieved.
In this paper, we propose a deep neural network that utilizes both spectral and spatial information in a novel decoupled/factorized manner, which is designed to be concise, adaptive, end-to-end and easy to train. Our spectral-spatial classification model is factorized into a pixel-based spectral feature extraction subnetwork (SFE-Net) and a patch based spatial classification subnetwork (PSC-Net). The SFE-Net is pre-trained in a supervised learning scheme, as a rough prediction based on spectral information alone, and it can be constructed like any spectrum-based deep neural network. In order to refine our prediction, the PSC-Net is then trained to determine the class of the central pixel in a patch by taking the prior spectral information of surrounding pixels into account, as there are strong spatial relationships among these neighboring pixels. These two subnetworks are combined via a sequential connection. We first share the SFE-Net across each pixel in the patch, then concatenate all the extracted spectral features as the input vector of the PSC-Net. Besides, two factorized subnetworks are trained collaboratively as an entire network by the backpropagation algorithm to improve the overall performance. Because both of SFE-Net and PSC-Net used in the paper are based on deep neural networks, and spectral-spatial features are factorially represented, we name the proposed model Factorized Spectral-Spatial Feature Network (FSSF-Net). A demonstration of the proposed framework is shown in Fig. 1.
In summary, a novel deep learning based framework is proposed in the paper that better utilizes joint spectralspatial structure of HSI by a flexible and generalized feature factorization scheme. It allows various kinds of deep neural networks to act as subnetworks or hidden layers, meanwhile it is an integrated end-to-end model. Compared to some state-of-the-art deep learning based HSI classification methods including CNN based ones, the proposed approach achieves competitive classification performance, especially the light-weight model resulted from feature factorization brings faster training, lower storage, and better generalization with a small number of class-labeled samples.
The paper is organized as follows. In Section 2, we introduce the mathematical formulation of the problem and the proposed model, followed by the details of the implementation. In Section 3, a number of experiments are done to investigate the effectiveness of the proposed model and the primary techniques used in it, and experimental comparisons with some state-of-the-art approaches are also given. Finally, we conclude our work in Section 4.

Methods
In this section, we further explain some details of our network including specific network architecture and training scheme. We first highlight the detail of two subnetworks, SFE-Net and PSC-Net, and how they are connected together. As one of the major concern, training scheme is also well explained, especially how the error is back propagated through PSC-Net to SFE-Net.

Model Architecture
FSSF-Net takes a 3D patch extracted from an HSI as input. As aforementioned, there are two parts in FSSF-Net: SFE-Net and PSC-Net. Given a 3D patch, we first apply and share the SFE-Net across each pixel to extract spectral features. Concatenating these spectral features into a vector to fuse spatial information embodied in the patch. Utiliz- Generally, there is no restriction on the structure and the number of layers in these two subnetworks. SFE-Net and PSC-Net are assumed to be any type of neural networks, i.e., they are not limited to the specific form. For example, the classic 1D-CNN and SAE can be used. In the case of the classification task on hand, the flexibility of our framework allows us to choose and/or develop any network structure in order to maximize the performance. In this paper, we use MLP (Multilayer Perceptron) based architectures for both subnetworks. To improve the generalization capability and advert over-fitting, batch-normalization layers (BN), self-normalizing ELU (SELU) layers and Dropout layers (Drop-out) are added to the architecture as regularization/normalization components. The network architecture for SFE-Net and PSC-Net is illustrated in 2, where FC stands for fully-connected layers. For both SFE-Net and PSC-Net, the hidden fully-connected layers have the same amount of output units u = 100 and all their Drop-out layers share the same probability of retaining r = 0.5. Both SFE-Net and PSC-Net output C-dimensional vector activated by softmax as the final output, where C is the number of classes.

Model Training
An patch based HSI dataset can be represented as X = {x i |x i ∈ R W ×W ×D , i = 1, 2, . . . , N }, where W is the width of the square HSI patch, D is the the number of spectral bands/channels, N is the number of patches in the data set. The corresponding classification label set is Y = {y i |y i ∈ Z C , i = 1, 2, . . . , N } where each element y i is the one-hot label of the central pixel in x i , and C still is the number of classes. The pixel classification of HSI is to find a function f (x) such that the output Y = {y i |y i = f (x i ), i = 1, 2, ..., N } is as close to classification label Y as possible.
In FSSF-Net, the submodule SFE-Net is pre-trained using classification supervision, so as to provide softmax vectors as output spectral features. Then, the submodule PSC-Net takes these spectral features of an HSI patch and outputs the final classification result, which is also achieved by using a classification loss metric. In summary, both SFE- Net and PSC-Net are trained using cross-entropy loss metric. Specifically, given an output vector y and its corresponding one-hot label y, with the j-th element in vector y indexed by y (j) , the loss function can be formulated as The training of FSSF-Net includes two stages: pretraining SFE-Net, and fine-tuning SFE-Net, PSC-Net together. In the pre-training stage, we extract the central pixels in each patch x i and train SFE-Net with corresponding labels. As for the fine-tuning stage, FSSF-Net uses 3D patches x i as input, whose label is provided by the central pixel. During this stage, both SFE-Net and PCS-Net are set to be trainable. The entire training process completes when the fine-tuning training of FSSF-Net converges. Especially, when back-propagating the error through PSC-Net to SFEnet, the error used for updating SFE-Net is the average of errors calculated on each pixel in a patch. ADAM (Adaptive Moment Estimation [20]) optimizer is used in our experiments with a learning rate set to 0.001 in both stages. The input patch size W is set to 7, which is a tradeoff between the workload of training storage and the amount of spatial information fed into the network. Overall, there are four hyper-parameters C, W, u, r in our implementation where W, u, r are constant through all our experiments.

Results and Discussion
In this section, we first introduce five data sets used in our experiments and clarify corresponding experimental settings. Then, we examine three significant ideas applied in our method: the pre-training of SFE-Net (g), the sharing of such spectrum extractor spatially, and the flexibility of types of hidden layers. Finally evaluate our method further by comparing with some state-of-art approaches. Tensorflow based on CUDA library is selected as the computational framework, and the unified interface wrapper Keras is applied to simplify implementation development. They are running on a workstation equipped with an Intel Xeon E5-2620 v4 with 2.1 GHz and Nvidia GeForce GTX 1080 graphics card. Overall accuracy (OA), average accuracy (AA) and kappa coefficient are used as the criteria of classification accuracy. All experiments were repeated five times.

Experimental Data]Data for Experiments 1
The first is Indian Pines data set gathered by Airborne Visible/Infrared Imaging Spectrometer (AVIRIS) in 1992 from Northwest Indiana, including 16 vegetation classes. There are 220 spectral bands in the 0.4-45 µm region of the visible and infrared spectrum and have 145 × 145 pixels for each band. The pseudocolor image of the Indian Pines data set is shown in Fig. 3.
The second data set Salinas is also collected by the AVIRIS sensor over Salinas Valley, California, including 512 × 217 pixels for each band. After 20 water absorption bands are removed, 204 bands remain for the experiments. There are 16 classes contained in the image such as vegetables, bare soils, and vineyard fields. Its pseudocolor image is illustrated in Fig. 4.
The Third is KSC dataset still acquired by the AVIRIS sensor over the Kennedy Space Center (KSC), Florida, in 1996. After removing water absorption and low SNR bands, 176 bands were used. There are 13 land-cover classes and 512 × 614 pixels with 5211 labeled. Its pseudocolor image is shown in Fig. 4.
The fourth is the University of Pavia data set acquired by Reflective Optics System Imaging Spectrometer (ROSIS) in Northern Italy in 2001. The image scene contains 9 urban land-cover types and 610 × 340 pixels for each band. Once the very noisy bands have been removed, the remaining 103 spectral bands, in the 0.43-0.86 µm range of the visible and infrared spectrum, are employed. Its pseudocolor image is shown in Fig. 6.
The fifth is the Pavia Center data set also acquired by the ROSIS sensor over Pavia, northern Italy. After removing noise bands, there are 102 spectral bands left. The original size of Pavia Center is 1096 × 1096, but some of the samples in the image contain no information and have to be discarded, thus a subset with a size of 1096 × 715 is selected, which includes 9 different classes. Its pseudocolor image can be seen in Fig. 7.

Model Setting
We implement the proposed architecture, as stated in Section. 2. Specially, considering our network adapted to different data sets, the spectral dimension may differ in the input layer of SFE-Net. The number of neurons in the last softmax layer of both SFE-Net and PSC-Net may also vary corresponding to the number of classes (C) in each data set.
Besides, optimization process is also a very important factor influencing the classification performance. We adopt the ADAM [20] optimizer to harness and speed up the training process. The initial learning rate is set to 0.001 for both pre-training and fine-tuning. The learning rate of decay is set to 0.005 for the pre-training, while set to 0.01 for the fine-tuning. Considering the limited training data in our task, we are able to set the batch size to be equal to the whole training set for better gradient behaviour. The pre-training and fine-tuning have 10000 epochs and 1000 epochs respectively.
Last but not least, it is worth noting that we fix all hyperparameters mentioned above in all experiments, which means that we do not tune our model to overfit any specific data sets.

Effectiveness of Proposed Model
Here we will evaluate the effectiveness of three significant ideas applied in our proposed architecture, which are the supervised pre-training of SFE-Net for extracting spectral feature, the sharing of such feature extractor spatially to include the local spatial correlation into PSC-Net, and the flexibility of network type for hidden layers. In addition, during the fine-tuning stage, the whole FSSF-Net is trained, including the pre-trained network SFE-Net. Therefore, we demonstrate whether these three schemes in the design of the architecture of our model actually contribute to improving accuracy in the following experiments.

Supervised pre-training
To verify the necessity of pre-training, we compare the model that the parameters in SFE-Net g are initialized by the pre-training process with the model whose parameters of g are randomly initialized. In other words, the structure of the whole model FSSF-Net remains the same, and only the parameter initializations in the network g differ.
The University of Pavia and the Indian Pines data sets are chosen to perform such comparison, and the training sets are selected with the small and medium sizes. For the Indian Pines data set, we randomly selected 5% and 10% of the labeled samples from each class to form the training sets respectively, leaving the rest as the test sets. As for the Pavia University data set, because the training samples have been separated from the total labeled ones, see Table 1, we randomly selected 1% and 10% samples from each class in the separated training samples as the training sets, and all the remaining labeled samples are used for test sets. The compared results are displayed in Table 2, from which we can observe that both in the Pavia University and in the Indian Pines, the supervised pre-training can significantly improve the classification accuracy, especially in the case of small training-sample size.

Sharing of pre-trained network
Besides of the pre-training scheme, parameter sharing is also a popular strategy embodied in a variety of deep learn- ing models. For example, CNN extracts shifting invariant features from images by the convolutional operation, and recurrent neural network (RNN) models the sequential dependence by designing the network with loops in them. Different from these sharing methods implied in the scenario of CNN and RNN, we assume that the adjacent pixels around a central pixel share similar spectral information, therefore it is natural to use the same pre-training network g to extract spectral features from each pixel belonging to a 7 × 7 × D patch. To verify the effectiveness of this sharing architecture, we set up a contrastive experiment. First, we use the same network g to extract each pixel's spectral features in a patch. Second, on the contrary, for different pixels in a patch, different networks are used, which have the same structure as the network g but do not share their parameters. In the later case, the parameters of those networks are still initialized from the pre-training network, but during the fine-tuning process, those networks for different pixels in a patch may learn different parameters. We apply the same experimental setting in Section 3.3.1 and classification results are displayed in Table 3. Seen from Table 3, classification accuracy has been dramatically boosted due to applying the sharing scheme. The improvement with sharing scheme in the Indian Pines data set is more obvious than that in the Pavia University data set, because the local spatial correlation is stronger in the Indian Pines data set than in the Pavia University data set. In addition, with the less number of parameters of the whole network that adopts the sharing scheme, it is less likely to overfit as the model complexity is reduced.

Flexibility of network type in hidden layers
When applying the proposed framework to build our classification network, we adopt MLP, or called fully connected network (FCN) as hidden layers. Specifically, the hidden layers in SFE-Net g are FCN together with dropout and batch normalization layers, see Fig. 2(a), and the hidden layers in PSC-Net are FCN with dropout layer, see Fig. 2(b)). However, the main contribution of this paper is proposing an network architecture suitable to fuse the spatial-spectral information in HSI, while whichever network type used to implement this architecture is flexible.
To explore the flexibility of the proposed architecture, we compare the FCN with locally connected network (LCN) and convolutional neural network (CNN), because CNN and LCN are also widely used for hidden layers as FCN.
In fact, FCN layer in network g can be viewed as a convolutional layer with kernel size (1, 1, D) and stride size (1, 1, 0), and the number of convolutional kernels equals the number of neurons in the FCN layer. On the contrast, LCN is also considered as a CNN but it does not share the parameters in the convolutional process. In the comparative experiment, we only replace the first two hidden layers in Fig. 2(a) with LCN or CNN layers, keeping rest structure in the whole network the same as the case of FCN. In the case of LCN, the first hidden layer in network g contains 20 convolutional kernels with kernel size (1, 1, 5) and stride size (1, 1, 3), and the second hidden layer includes 15 output channels with the same kernel and stride sizes as the first layer. As for CNN, to make a fair comparison, the first two hidden layers have the same number of kernels and kernel structure as LCN. The Pavia University dataset was used in the experiment, 50 samples were randomly selected from each class as training samples and the compared results are displayed in Table 4, in which spectral only model is just SFE-Net used for classification, and spectral-spatial model is the whole network including SFE-Net and PSC-Net. We can find when taking adjacent spatial information into consideration, the classification performance can be further im-proved whichever type of hidden layers is used. Moreover, LCN and CNN receive competitive results compared with FCN when spatial-spectral information is used. These facts indicate that the classification improvement comes from the proposed architecture rather than any specific type of hidden layers.

Comparison with Other Methods
To further demonstrate the effectiveness of the proposed method, we compare it with some traditional prominent methods such as spectral feature based SVM [47], 3D-DWT [35] spectral-spatial feature based SVM, and some state-of-art deep learning methods, like CNN [18], 3D-CNN [9], CNN-PPF [25], S-CNN [28]. In the experiments, SVM has radial basis function (RBF) kernel, and is implemented by the libsvm toolbox. As described earlier, our proposed model is linked to the convolutional operation, therefore, we compare our method with several CNN based methods. CNN applies the convolutional operation into hyperspectral image classification but it only considers the spectral information. 3D-CNN inputs a 3D patch and convolves it with 3D convolutional kernels. CNN-PPF and S-CNN adopt data augmentation by paired similar pixels.
It is difficult to make a fair comparison between various deep learning methods because there are many hyperparameters needed to be adjusted properly. Therefore, firstly we compare our method to each one of those deep learning methods (CNN, 3D-CNN, CNN-PPF, S-CNN) under the same setting as they are stated in their original papers respectively, and the classification results of these four methods are directly copied from their papers since the reported results are obtained with the optimal or near optimal structures and parameters tuned by those authors. According to the papers of CNN and CNN-PPF, Salinas, Indian Pines, and Pavia University are used as experimental data sets, and we denote their experimental seting as "setting 1" where 200 samples are randomly selected from each class as the training samples and leave the rest as the test samples. The "setting 2" refers to the experimental settings in the paper of 3D-CNN, in which the separations of training and testing samples on Kennedy Space Center, Indian Pines, and Pavia University are illustrated in Tables 5, 6, 7 respectively. The "setting 3" is based on the paper of S-CNN, in which 200 samples are randomly selected from each class as training samples in Pavia Center, Indian Pines, and Pavia University data sets respectively, whereas the whole labeled samples in each data set as test samples.
For all three "settings", we obtain the results of the proposed FSSF-Net method with the before mentioned architecture and parameters in Section 3.2. The comparing results of all five deep learning methods in three "settings" are gathered in Table 8. In addition to classification accuracy, we also evaluate the complexities of these models by calculating their numbers of parameters for different data sets, which are recorded in Table 9. Overall speaking, observed from Table 8 and Table 9, our method reaches a good tradeoff between the model complexity and the classification accuracy. CNN and CNN-PPF have fewer parameters than ours, but 3D-CNN and S-CNN contain much more parameters than ours. Our method obtains better results than those of CNN and CNN-PPF, and outperforms 3D-CNN and S-CNN in most cases.  Training  Test  1  Alfalfa  30  16  2  Corn-notill  150  1198  3  Corn-min  150  232  4  Corn  100  5  5  Grass-pasture  150  139  6 Grass-trees 150 580 7 Grass-pasture-mowed 20 8 8 Hay -windrowed  150  130  9  Oats  15  5  10  Soybean-notill  150  675  11  Soybean-mintill  150  2032  12  Soybean-clean  150  263  13  Wheat  150  55  14  Woods  150  793  15 Buildings-Grass-Trees 50 49 16 Stone-Steel-Towers 50 43 Total 1765 6223 The next experiment is to evaluate the HSI classification  Training  Test  1  Asphalt  548  5472  2  Meadows  540  13750  3  Gravel  392  1331  4  Trees  542  2573  5  Metal Sheets  256  1122  6  Bare soil  532  4572  7  Bitumen  375  981  8  Bricks  514  3363  9 Shadows 231 776 Total 3930 33940 methods in the situation of small size of training samples. As we know, the deep learning technique always need a significant amount of labeled data for training. However, due to the limit of available labeled data in HSI, it is necessary to evaluate the deep learning methods using limited training samples. Four spectral-spatial deep learning models 3D-CNN, CNN-PPF, S-CNN, and FSSF-Net are evaluated, and we also choose two prominent traditional methods, SVM, and 3D-DWT, so as to contrast to these deep learning methods. In our experiments, we randomly select 50 samples from each class to form our training set, leaving the rest as the test set. Salinas, Indian Pines, and Pavia University are used as our experimental data sets. Especially, as for Indian Pines, because some classes have fewer than 50 labeled samples, we only choose 9 classes that contain more than 400 samples for classification. The experimental setting of our method maintains the same as stated in Section 3.2, and for the other methods their settings follows their original papers. The compared results are listed in Tables 10, 11, 12. Compared with the classification results in Table 8, all deep learning methods suffer the performance degradation in different degrees. However, our method consistently outperforms the other ones, indicating that our method can better deal with small training sample size problem by taking full advantage of spectral and spatial properties of HSI in a well-designed network architecture. As the size of the training set decreases, there is more impact on 3D-CNN and S-CNN than CNN-PPF, because CNN-PPF has fewer parameters and is less likely to over-fitting. Surprisingly, 3D-DWT achieves pretty competitive results when comparing to deep learning methods.
We further drill down to details of these deep learning methods and compare to our model. To overcome the problem of small size of available labeled samples in HSI, CNN-PPF and S-CNN use data augmentation technique, in which each pixel is combined with other pixels belonging to the same or different classes to form the paired training samples. In our model, we merge spectral and spatial properties of HSI into the network structure, so it also plays a role of data augmentation but by a new way in which the cen-tral pixel in a patch is enhance by the neighboring pixels with or without lables. However, both CNN-PPF and S-CNN are just feature extractors and adopt separated classifiers to classify the extracted features. On the contrary, our model is end-to-end, i.e., combines the feature extraction with classification at the same time, which is not only easy to implement, but improve the performance of feature extraction and classification as well. 3D-CNN also takes a patch in HSI as its input the same as our method, but it has not the sharing architecture used in our method, so it contains more parameters, making it work worse than our model in the case of small training sample size.
As a supplement, we further record the training and inference time in Table 13, from which it can be found that our method converges faster than 3D-CNN, CNN-PPF, and S-CNN, but costs more time than CNN-PPF and S-CNN to inference. The reason may be that in our method, the pre-training PSE-Net g needs to process each pixel in a patch individually whereas CNN-PPF and S-CNN process the patch as a whole once. Moreover, for better visualization of classification performance, we plot classification maps of our method on three data sets, see Figs. 8, 9, 10. Comparing to the corresponding ground truth, our results are pretty close to the real ones except failing at some verges of certain patches in the classification maps.    Figure 9. Classification Map of Salinas

Conclusions
Deep learning applied for the HSI classification is a fast developing area. Motivated by the spectral-spatial structure of HSIs, a novel end-to-end network architecture is proposed in this paper, which factorizes spectral-spatial feature into two subnetworks. Because of rich spectral information of the hyperspectral data, pretraining the subnetwork used   for spectral feature extracting with labeled samples not only reduces the feature dimensionality but also learns inherent spectral structure that is better for classification. To fuse the local spatial information, we share the pre-trained subnetwork in a patch and combine the extracted features into a vector feeding to another subnetwork for final classification. The sharing scheme can decrease the number of parameters to avoid over-fitting problem. Compared with some state of the art deep learning methods and typical conventional classification methods, the proposed method consistently outperforms them in all experiments.