Deep Multiview Learning for Hyperspectral Image Classification

Recently, the field of hyperspectral image (HSI) classification is dominated by deep learning-based methods. However, training deep learning models usually needs a large number of labeled samples to optimize thousands of parameters. In this article, a deep multiview learning method is proposed to deal with the small sample problem of HSI. First, two views of an HSI scene are constructed by applying principal component analysis to different bands. Second, a deep residual network is designed to embed the different views of a sample to a latent space. The designed deep residual network is trained by maximizing agreement between differently augmented views of the same data sample via a contrastive loss in the latent space. Note that the training procedure of the designed deep residual network does not use labeled information. Therefore, the proposed method belongs to the category of unsupervised learning, which could alleviate the lack of labeled training samples. Finally, a conventional machine learning method (e.g., support vector machine) is used to complete the classification task in the learned latent space. To demonstrate the effectiveness of the proposed method, extensive experiments are carried on four widely used hyperspectral data sets. The experimental results demonstrate that the proposed method could improve the classification accuracy with small samples.


I. INTRODUCTION
H YPERSPECTRAL image (HSI) classification involves assigning a category tag to each sample according to its spectral information and spatial information [1], [2]. In this task, one of the greatest challenges is determining what types of features should be used as the input of a classifier. In HSI, each pixel can be regarded as a high-dimensional vector whose entries correspond to the spectral reflectance in a specific wavelength. Naturally, traditional classification methods focus on exploring the spectral signatures for the classification tasks. Thus, support vector machines (SVMs) [3], extreme learning machine (ELM) [4], random forest (RF) [5], sparse representation [6], and other pixelwise classifiers are used for HSI classification tasks. A disadvantage of these pixelwise Manuscript  classifiers is that it could not consider spatial information in the classification procedure. In this context, feature extraction methods that could include spatial information are introduced to improve the classification performance, e.g., Gabor filters [7], local binary patterns [8], morphological profiles [9]- [11], and wavelet [12]. A major limitation of these spatial features is that they require a great deal of tuning to get them to work well on a particular data set. Deep learning can learn to extract features for classification from data without the artificial design of feature extraction rules. Thus, deep learning has gained more attention in the field of HSI classification [13]. Deep learning-based pixelwise classifiers for HSI include stacked autoencoder (SAE) [14], 1-D convolution neural network (1D-CNN) [15], deep belief network (DBN) [16], and recurrent neural network (RNN) [17]. To utilize spatial-spectral features to improve the classification accuracy, 2-D-CNN [18]- [22] and 3-D-CNN [23]- [26] are also designed for HSI classification. Moreover, a multiscale dense network [24] is designed to make full use of different scale information in the network structure and combine scale information throughout the network, which improves the training speed and accuracy for HSI classification. A deep learning ensemble framework [27] based on the integration of deep learning model and random subspace-based ensemble learning is proposed to further boost the classification performance. A cascaded RNN model is designed to fully explore the redundant and complementary information of the high-dimensional spectral signature in [28]. In addition to the traditional deep learning models (SAE, CNN, and RNN), some variants of deep learning are also applied to HSI classification, e.g., deep multigrained cascade forest [29] and capsule network [30].
The abovementioned supervised deep learning classifiers could achieve a higher classification accuracy than that of the traditional methods. However, the shortage of training samples remains one of the main obstacles in applying deep neural networks to the HSI classification tasks. Unsupervised learning can learn useful information from unlabeled data for subsequent tasks. Therefore, researchers have conducted some valuable explorations of unsupervised learning for HSI. Autoencoder is a common unsupervised framework and has been widely used for HSI [31]- [33]. Meanwhile, some improved methods of autoencoder are used in HSI classification, e.g., deep residual conv-deconv network [34] and 3-D convolutional autoencoder [35]. To further improve the classification accuracy, a self-taught learning framework This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ is designed in [36]. In addition, the generative adversarial network [37] is introduced to train a deep learning-based feature extractor in an unsupervised manner, which improves the results of unsupervised training. The majority of published unsupervised feature learning frameworks for HSI involve complex training procedure, and they could not deal with small sample problems.
In real-world applications, data are usually collected from diverse domains or obtained from various feature extractors and exhibit heterogeneous properties, as variables of each data example can be naturally partitioned into groups. Each variable group is referred to as a particular view. Multiview learning, aiming to learn one function to model each view and jointly optimize all the functions to improve the generalization performance, has made great progress and developments in recent years [38]. Notably, deep multiview learning has achieved great success in image classification and recognition. For example, a contrastive coding scheme is proposed in [39] and achieves state-of-the-art results on the Imagenet benchmark. Momentum contrast learning [40] has obtained positive results of unsupervised learning in a variety of computer vision tasks and data sets.
Bands of HSI can be naturally considered as different views of a scene, as different bands reflect the different properties of ground objects. Motivated by this and the recent success of multiview learning, a novel deep multiview learning method is proposed to deal with the classification issue of HSI with a small sample. Especially, a deep residual network, a variant of Resnet50 [41], is designed to map the different views to a latent space. The designed deep residual network is then trained by maximizing agreement between differently augmented views of the same data sample via a contrastive loss in the feature space. Unsupervised training manner of the proposed method ensures enough training data and alleviates the problem of lacking labeled training samples. More importantly, the learned view-invariant features could greatly improve the small sample classification accuracy of HSI.
The main contributions of this article can be summarized as follows.
1) A deep multiview learning method is proposed for HSI classification, which makes the deep network learn to extract view-invariant features. Extensive experiments and analysis on four benchmarks demonstrate the effectiveness of the proposed deep multiview learning method. 2) A deep residual network with 51 layers is designed to extract features from different views of HSI. Note that the 51-layer residual network is larger than the existing networks in the field of HSI classification. The depth of the designed network ensures the generalization ability of the learned features. 3) Two data augmentations are adopted to improve the unsupervised learning results. Experiments show that data augmentation can further improve the classification accuracy of the proposed method. The remainder of this article is organized as follows. In Section II, the proposed classification framework is described in detail. In Section III, the experimental results and corresponding analysis are presented. In Section IV, this article concludes with some discussion.

II. PROPOSED METHOD
In this section, the proposed method will be described in detail. First, we will give the architecture of deep multiview learning.

A. Deep Multiview Learning
The traditional unsupervised loss function [e.g., mean square error (MSE)] calculates the distance between the predicted value and the original input. However, it is difficult to guarantee the effectiveness of the features only by optimizing the reconstruction error. In order to make the learned features more effective for classification tasks, we optimize the contrastive loss function to make the features from different views of the same sample consistent. This makes the features of the same class aggregate with each other, and the features of different classes are far away from each other. Therefore, the features obtained by optimizing the contrastive loss function of different views could effectively improve the classification accuracy. We use a deep CNN as the base feature extractor. We call this proposed method deep multiview learning.
The forward propagation of the proposed deep multiview learning illustrated in Fig. 1 mainly includes three steps: constructing two views of a sample, input each view into the networks f (·) and g(·) to generate the latent feature h and the output feature z, and calculating contrastive loss function according to the output feature z. Note that f (·) is a Resnet50 without the classification layer, and g(·) is a fully connected network to reduce the dimensions of output features.
A large number of studies show that considering the spatial neighborhood information in a neural network can improve the classification performance. Thus, a m × m × b cube is used as the feature of a sample, where m is the size of the neighborhood and b is the number of bands. In fact, each band of an HSI could be treated as a view of a scene, as different bands reflect the different properties of ground objects. On the one hand, there is a strong correlation between the adjacent bands of HSI. On the other hand, there are many bands in HSI, which means a large number of views. A large number of views complicate the process of deep multiview learning. Taking these factors into account, we design a simple method to construct two views of an HSI. As shown in Fig. 1, the bands of HSI are divided into two groups. The first group band is transformed by PCA to generate the first view. Here, the first three principal components are taken as the first view. Similarly, the second group band is used to generate the second view. Taking the Indiana Pines data set as an example, there are 200 bands in this HSI. The first 100 bands are used to generate the first view. The remaining bands are used to generate the second view.   high-level feature vector. A multilayer perceptron with two fully connected layers g(·) is used to transform the latent features h (i) 1 and h (i) 2 into z (i) 1 and z (i) 2 . Then, we define the contrastive loss on z In the training procedure, a minibatch of N samples is randomly selected as the training samples of one parameter update. The contrastive prediction task is defined on pairs of samples derived from the minibatch, resulting in 2N views. In 2N views, two views from the same sample are taken as a positive pair, and two views from different samples are taken as negative pairs. The contrastive loss is defined as where l [k =i] is an indicator function evaluating to 1 if k = i , and sim(z i , z j ) = z T i z j / z i z j denotes the cosine similarity between two vectors z i and z j . In a minibatch, the total loss is computed across all positive pairs. Our goal is to learn representations that capture information shared between multiple sensory views without human supervision. This loss is defined according to the similarity between views, which means that it does not need any human supervision information. In other words, it is an unsupervised method. More importantly, this loss ensures that the network learns to extract view-invariant features, which is a useful representation of the samples.
When the number of views is more than 3, the features of different views are combined in pairs. The contrastive loss is calculated respectively for the features of two combined views. Then, the sum of the contrastive loss calculated from different combined views is calculated as the final loss function.

B. Deep Residual Network
The f (·) that extracts representation vectors from views could be various network architecture. Recent studies [41], [42] reveal that the classification performance benefits from bigger models. Residual learning has become a common method to improve the accuracy in natural image recognition and HSI classification [43], [44]. Therefore, a variant of Resnet50 [41] is used as the network that extracts representation vectors from views. Deep residual learning could make training deep network easier. Thus, it has been widely used in a variety of classification tasks. As shown in Fig. 2, the core idea of deep residual learning is to introduce a shortcut connection, which directly skips one or more layers. The deep residual network is based on residual block. As shown in Fig. 2, there are three convolutional layers in a standard residual block. Each convolution layer is followed by a batch normalization layer (BatchNorm) and a ReLU layer. The original input is then added with the output of the last convolutional layer as the output of a residual block, which is a shortcut operation. The output of a residual block is activated by a ReLU layer as the input of the later residual block. Note that the dimension of the input data may be different from the output dimension of the last convolutional layer of the residual block. When the dimension of the input and the last convolutional layer of the residual block is different, a 1 × 1 convolutional layer followed by a batch normalization layer is applied to the input to conduct a resample operation in order to ensure consistent data dimensions.
The original Resnet50 consists of one convolutional layer, 16 residual blocks, two pooling layers, and one classification and g(·) to minimize ζ end for layer (fully connected layer). The purpose of the training network is not to classify but to learn the representations of views. Therefore, the Resnet50 without classification layer is used as the base feature extractor f (·). As shown in Fig. 3, the Resnet50 f (·) actually consists of 49 convolutional layers, 1 + 3 × (3 + 4 + 6 + 3) = 49. The details of the deep residual network used as the base feature extractor f (·) are shown in Table I. Note that the output of the deep residual network is a 2048 vector. Subsequently, a multilayer perceptron with two fully connected layers g(·) is applied to the output vector of the Resnet50 f (·) to reduce the dimensions of output features. In fact, the network used to extract features includes 49 convolutional layers and two fully connected layers.

C. Training and Testing Procedure
The contrastive loss is defined on the outputs of the multilayer perceptron. More specifically, the pseudocode for a training minibatch procedure is given in Algorithm 1.
Data augmentation is a common technique that can effectively improve the generalization ability of a model and has been widely used in supervised deep learning. However, data augmentation has not been used in the contrastive prediction task. Consequently, two data augmentations (random cropping and random Gaussian blur) are used to improve the robustness of network training.
In the testing procedure, the deep residual network trained on a specific HSI is used as a feature extractor. Then, all samples of this specific HSI pass through the deep residual network to output the corresponding feature vectors. So far, conventional machine learning methods could be applied to the extracted features to complete the classification task. Here, an SVM classifier and an RF classifier are used.

III. EXPERIMENTAL RESULTS AND ANALYSIS
The proposed method is implemented by the PyTorch library. The results are generated on a PC equipped with an Intel Core i7-9750H with 2.6 GHz and an Nvidia GeForce RTX 2070M. The PC's memory is 16G.

A. Data Sets
To demonstrate the effectiveness of the proposed method, the University of Pavia data set, the Indiana Pines data set, the Salinas data set, and the Houston data set are used to conduct classification experiments. In the feature learning procedure, 50% unlabeled samples are used as the training data, and the remaining 50% samples are used as the testing data. In each data set, five labeled samples per class are randomly selected as the training samples for the supervised classifier in the classification procedure.
The University of Pavia data set is acquired by the ROSIS sensor during a flight campaign over Pavia, Northern Italy. It has 103 spectral bands coverage from 0.43 to 0.86 μm and a geometric resolution of 1.3 m. The image size is 610 × 340 pixels. In this data set, 42 776 pixels with nine classes are labeled. Labels, the number of labeled training samples, and the number of testing samples are listed in Table II.
The second data set is the Indiana Pines data set. This data set is gathered by Airborne Visible Infrared Imaging Spectrometer (AVIRIS) sensor over the Indian Pines test site in Northwestern Indiana and consists of 145 × 145 pixels and 224 spectral reflectance bands in the wavelength range 0.4-2.5 μm; 24 bands covering the region of water absorption are removed, resulting in 200 bands for classification. This scene contains two-third agriculture and one-third forest or other natural perennial vegetation; 10 249 pixels with 16 classes are labeled. Labels, the number of labeled training samples, and the number of testing samples are listed in Table III. Fig. 3. Illustration of the deep residual network used as the base feature extractor. ZeroPAD denotes padding operation, CONV denotes a convolutional layer, BatchNorm denotes a batch normalization layer, ReLU denotes a ReLU layer, CONV Block denotes a residual block consisting of three convolutional layers, and MaxPool and AVGPool denote the max-pooling layer and the global average pooling layer, respectively. Block*n represents the residual block n times.  The third data set is the Salinas data set gathered by the AVIRIS sensor in Northwestern Indiana. There are 224 spectral channels ranging from 0.4 to 2.5 μm with a spatial resolution of 3.7 m. The area covered comprises 512 × 217 pixels. As with the Indian Pines data set, 20 water absorption bands are discarded; 54 129 pixels with 16 classes are labeled.  Table IV.
The fourth data set is the Houston data set gathered by The ITRES CASI-1500 sensor. This data set is composed of 349 × 1905 pixels with 144 spectral channels ranging from 364 to 1046 nm. There are 15 classes in this scene. Labels, the number of labeled training samples, and the number of testing samples are listed in Table V.

B. Parameter Setting and Analysis
The neighborhood size is an important parameter that affects the classification performance. To analyze the influence of neighborhood size on classification accuracy, the neighborhood size is set to be 5,7,9,11,13,15,17,19,21,23,25,27,29,31,33, and 35, respectively. The classification results are shown in Fig. 4. According to the experimental results, we find that a small neighborhood size will reduce the classification accuracy, and the optimal neighborhood sizes of the four data sets are 25, 27, 35, and 27, respectively. However, as for the University of Pavia data set, setting neighborhood size to 25 or 27 has little effect on the final classification results. The Salinas data set has a similar situation. Considering the adapt- ability of parameters to different data sets, the neighborhood sizes of four data sets are set to be 27 × 27. Consequently, the dimension of each view for the four data sets is 27 × 27 × 3.
In general, training a CNN requires setting the learning rate, the number of epochs, the optimizer, and the batch size. In this article, the widely used Adam [45] optimizer is used to optimize the designed deep residual network. The batch size is set to be 128. The training loss value is adopted as the evaluation index of the network training, as the training procedure is unsupervised. The learning rate is set to be 0.1, 0.01, and 0.001, respectively. The training loss values with different learning rates are shown in Fig. 5. From Fig. 5, we could find that a large learning rate (e.g., 0.1 and 0.01) is not conducive to the network training and would lead to a large loss function value. In contrast, a small learning rate (e.g., 0.001) could enable the network to be fully trained. Therefore, a small learning rate and a large number of epochs are used to ensure the convergence of the network. Finally, the learning rate is set to be 0.001, and the number of epochs is set to be 50.
The nonlinear SVM with radial basis function kernel is used as the supervised classifier in this section. The goal of this article is to deal with the problem of small sample classification. Therefore, only five labeled samples per class are randomly selected as the training samples of the SVM classifier to analyze the influence of parameters on classification accuracy. The SVM classifier with radial basis function kernel needs to set the parameters C (a parameter that controls the amount of penalty during the SVM optimization) and γ (spread of the RBF kernel). The optimal hyperplane parameters C and γ have been traced in the range of C = 2 −2 , 2 −1 , . . . , 2 7 and γ = 2 −2 , 2 −1 , . . . , 2 7 using fivefold cross validation [46].
To study the importance of data augmentation composition, the network is trained with applying augmentations individually or in pairs. The classification results are shown in Fig. 6. In Fig. 6, "None" represents that no data augmentation is applied, "RC" represents that only random cropping is applied, "RG" represents that only random Gaussian blur is applied, and "RC + RG" represents that random cropping and random Gaussian blur are applied. From the results of Fig. 6, we find that both random cropping and random Gaussian blur could improve the classification accuracy. In contrast, without data augmentation, the subsequent classification accuracy will be greatly reduced. Therefore, both random cropping and random Gaussian blur are used in the training procedure.
The studies have shown the advantages of large-scale deep CNNs (e.g., GoogLeNet, Resnet50) in natural image classification and recognition. However, large-scale deep CNNs have not been used in HSI classification tasks. In this article, a deep residual network with 51 layers derived from the standard Resnet50 is used as the feature extractor. To prove the necessity of using large-scale networks in multiview learning, the LeNet, Resnet18, and Resnet34 are also used as the feature extractors of four HSI data sets. We also test the channelwise attention Resnet [47] (SE + Resnet50). The depth and parameters of different networks are listed in Table V. The classification results are shown in Fig. 7. It is found that classification accuracy decreases with the decrease of the network scale. In addition, the introduction of channelwise attention has little effect on improving the final classification accuracy but will increase the training time. This is because the classification accuracy will be gradually stable or decrease with the increase in the network complexity. When the classification accuracy tends to be stable, increasing the network complexity (e.g., introducing channelwise attention) will further reduce the classification accuracy. Therefore, we finally use the Resnet50 as the base feature extractor. Note that Resnet50 is one of the most commonly used classical network models in computer vision tasks. Using Resnet50 enables us to reuse the classic network model. This not only saves the work of network design but also proves that a deep network model can be used to improve the classification accuracy in the HSI classification task. . Classification accuracy with different data augmentation strategies on the four HSI data sets. "None" represents that no data augmentation is applied, "RC" represents that only random cropping is applied, "RG" represents that only random gaussian blur is applied, and "RC + RG" represents that random cropping and random gaussian blur are applied. To analyze the influence of the number of views on classification accuracy, we also divide the HSI into four views and eight views. The results are shown in Fig. 8. From the results of Fig. 8, we can find that more views (e.g., 4) could improve the classification accuracy slightly. However, more views will increase the complexity of the model, which will lead to a significant increase in training time. More importantly, only two views can achieve satisfactory classification results. Consequently, only two views are used to train the designed deep network in subsequent experiments.
In order to prove the necessity of using PCA, we test that the grouped HSI data cube (raw data) is directly input into the designed deep residual network for multiview learning. The results on four data set are shown in Fig. 9. It could be found that directly using raw data as input will greatly reduce the accuracy of subsequent classification. This is because PCA can not only retain the main information of each view but also increase the difference between the two views. Thus, it is helpful to use the contrast loss function to mine the features of HSIs.

C. Comparison Results With the State-of-the-Art Methods
In this section, the performance of the proposed deep multiview learning (DMVL + SVM) is compared with several  state-of-the-art methods. The compared methods are listed as follows.
1) EMP + SVM [10] is a traditional spatial feature extraction method that could improve the classification accuracy. It has been widely used in the HSI classification task. As for EMP, two commonly used morphological filters based on square structure elements (opening and closing) are used to construct the morphological attribute profiles. The radius of structuring elements is set to be 1, 3, 5, 7, and 9, respectively. The optimal hyperplane parameters of the SVM classifier are determined by fivefold cross validation. 2) TSVM [48], [49] is a semisupervised method that could use unlabeled samples to improve classification accuracy. It also uses an SVM classifier with a radial basis function kernel. All unlabeled samples are used for training. 3) Joint within-class collaborative representation (JCR) [50] is a representation-based classification method. In this method, neighbors near the test pixel are simultaneously represented via linear combinations of available training samples. Several strategies of incorporating  contextual information are used to improve the classification performance. 4) 3DCAE [35] is an unsupervised spatial-spectral feature learning method based on 3-D convolutional autoencoder. It is very effective in extracting spatial-spectral features. The parameters are set the same as the paper. 5) GAN [51] is an unsupervised feature learning method based on a generative adversarial network. PCA is used to reduce the HSI to three dimensions. Then, a 2-D GAN is used to learn features. The neighborhood size is also set to be 28 × 28. 6) DFSL + SVM [52] is a transfer learning method.
It trains a deep 3-D-CNN to learn a metric space on the data collected in advance. The trained network is then transferred to the target HSI. It achieves excellent results with small samples. The parameters are set the same as the paper. 7) In the case of only a single view, DMVL degenerates to a supervised classifier. In other words, a ResNet50 classifier is used for HSI classification. Therefore, we also test a supervised ResNet50 classifier. The learning rate is set to be 0.001, the batch size is set to be 5, and the number of epochs is set to be 300. We also test the proposed DMVL with an RF classifier. Note that five labeled samples per class are randomly selected as the supervised samples and all labeled  training samples for different methods are exactly the same.
The class-specific accuracy, overall accuracy (OA), average accuracy (AA), and κ of different methods for four HSI data sets are listed in Tables VII-X. From these results, we can learn that the Resnet50 classifier has the worst classification accuracy when only five labeled samples are used in each class. This shows that training a deep network with a small sample will produce a serious overfitting problem, which leads to low classification accuracy. In contrast with the Resnet50 classifier, the proposed method DMVL combined with an SVM classifier or an RF classifier could achieve higher OA, AA, and κ than the other compared methods. For example, in Table VII, DMVL + SVM (i.e., 86.96%) yields over  10% higher accuracy than EMP + SVM (i.e., 70.77) and approximately 5% higher accuracy than DFSL + SVM. Note that DFSL is a transfer learning method designed for small sample problems. Especially, in Table VIII, the OA of DMVL + SVM (78.01%) is over 20% higher than that of 3DCAE (52.21%).
In fact, the contrastive loss function is an unsupervised loss function. Thus, an unsupervised autoencoder (3DCAE) that adopts the traditional reconstruction error loss function is used as the compared method. Compared with the traditional reconstruction error, the contrastive loss function could greatly improve the performance of subsequent HSI classification tasks. In order to further illustrate the effectiveness of the algorithm, the proposed method is compared with a semisupervised loss function GAN. The experimental results show that the contrastive loss function is superior to GAN. In addition, the contrastive loss function is compared with the traditional cross-entropy loss function (ResNet50). The experimental results show that the classification accuracy of the contrastive loss function is much higher than that of the traditional cross-entropy loss function.
In order to better observe the classification results, the classification maps of different methods on four HSI data sets are shown in Figs. 10-13. To facilitate the comparison between different methods, the ground-truth maps are shown in Figs. 10-13. From these maps, we can learn that the compared methods exhibit higher classification errors than the proposed method.
The abovementioned experimental results have proved the effectiveness of the proposed method in the case of small samples. To further test the effectiveness of the proposed method, we change the number of labeled samples used for supervised training. The classification results are shown in Fig. 14. First, when the number of labeled samples is further reduced (e.g., one sample per class and three samples per class), the accuracy of all classification algorithms is greatly reduced. However, the proposed method could still achieve the highest classification accuracy. Second, the classification accuracy of all methods increases with the increase in the number of labeled samples. This is easy to understand because the overfitting problem is alleviated as the number of labeled samples increases. Third, DMVL + SVM, DMVL + RF, and DFSL + SVM generally outperform the other classification methods. More importantly, DMVL + SVM and DMVL + RF could obtain the best classification accuracy in most cases. This further demonstrates the proposed method can not only deal with the problem of small samples but also has good applicability to the number of labeled samples. That is to say, the proposed method can obtain higher classification accuracy than the other compared methods in the cases with a different number of labeled samples. In addition, with the increase in the number of labeled samples, the difference of classification accuracy between DMVL + SVM and DMVL + RF is less and less. Anyway, the proposed method combined with SVM and RF can achieve better classification results than the compared methods.

D. Feature Visual Analysis
In order to analyze the effectiveness of the proposed method, we visualized the original spectral features, the features obtained by the MSE loss function, and the features extracted by the proposed method. The results of feature visualization are shown in Fig. 15. In Fig. 15, the first line is the feature visualization results obtained from the original spectral features of the University of Pavia data set, the MSE loss function results, and the contrastive loss function results. Similarly, lines 2-4 correspond to the visualization results of the Indiana Pines, Salinas, and Houston data sets, respectively. In Fig. 15, different colors represent the spatial distribution of different classes of samples. By observing the visualization results, we could see that the distribution of features extracted by the MSE loss function is not significantly improved compared with the distribution of the original spectral features. On the  TABLE XI   EXECUTION TIMES OF TRAINING AND TESTING  PROCEDURES IN THE IP DATA SET contrary, the spatial distance of different classes of features extracted by the proposed method is significantly increased, which benefits from the contrastive loss function used in this article. Thus, the proposed method could effectively improve the classification accuracy of HSIs.

E. Execution Time Analysis
The training time of a deep neural network is mainly affected by the number of samples, the dimension of inputs, and the network parameters. The training time and feature extraction time for different methods are listed in Table XI. As for DFSL, the model for different HSIs is trained on the same data. Thus, the training time is the same for three HSIs. The number of labeled testing samples is different on three HSIs, which leads to different feature extraction time for three HSIs. On the one hand, all unlabeled samples are used to train the deep residual network. On the other hand, a deep residual network with 51 layers is used to extract features. Note that the 51-layer residual network is larger than the existing networks in the field of HSI classification. Therefore, compared with similar algorithms, our method is more timeconsuming. It takes about 3 min to train at an epoch on the Indiana Pines data set. The number of epochs is set to be 50, resulting in approximately 150 min of the training procedure on the Indiana Pines data set. The training procedure takes more time, which is a disadvantage of the method proposed in this article. However, the increase in time is acceptable. More importantly, the proposed method could greatly improve the classification accuracy in the case of small samples.

IV. CONCLUSION
Recently, deep learning-based methods have been widely explored in HSI classification. However, training a deep-learning classifier notoriously requires hundreds or thousands of labeled samples. Therefore, training the models to learn the useful representations of HSIs in an unsupervised manner is the Holy Grail for researchers. In this article, we proposed the deep multiview learning method for HSI classification. By training the network to learn view-invariant features, the proposed method could greatly improve classification accuracy, especially in the case of small samples. Moreover, we first explore using a deep residual network with 51 layers in the HSI field. Experiments demonstrate the necessity of using bigger models. Although this method has achieved excellent classification performance, the improvement of classification accuracy is based on the premise of sacrificing training time. The time-consuming training procedure is a disadvantage of the proposed method. In order to simplify the process, we have only constructed two views. In the future, we will construct more views to further improve classification performance. Finally, the proposed method is easy to combine with the existing supervised classifiers. We only test the SVM classifier and RF classifier. In the future, we would test more classifiers.
Xuchu Yu received the Ph.D. degree from the Institute of Surveying and Mapping, Zhengzhou, China, in 1997.
He is working at Information Engineering University, Zhengzhou, as a Professor and Doctoral Supervisor. His research interests include photogrammetry, remote sensing, and pattern recognition.
Ruirui Wang received the B.S. degree in surveying and mapping engineering from Xuchang University, Xuchang, China, in 2014, and the B.S. degree in photogrammetry and remote sensing from Information Engineering University, Zhengzhou, China, in 2017.
She is working at the Institute of Surveying Mapping and Geo Information of Henan, Zhengzhou, as an Assistant Engineer. Her research interest includes machine learning and feature extraction.
Kuiliang Gao received the B.S. degree in remote sensing science and technology from Information Engineering University, Zhengzhou, China, in 2019, where he is pursuing the M.S. degree.
His research interests include hyperspectral image processing, pattern recognition, and deep learning.
Wenyue Guo received the bachelor's and master's degrees in cartography and geographic information engineering and the Ph.D. degree in surveying and mapping from Information Engineering University, Zhengzhou, China, in 2012, 2015, and 2018, respectively.
She is working at PLA Strategic Support Force Information Engineering University, Zhengzhou, as a Lecturer. Her research interests include geographic information science and graph representation.