Bag-of-Features-Driven Spectral-Spatial Siamese Neural Network for Hyperspectral Image Classification

Deep learning (DL) exhibits commendable performance in hyperspectral image (HSI) classification because of its powerful feature expression ability. Siamese neural network further improves the performance of DL models by learning similarities within-class and differences between-class from sample pairs. However, there are still some limitations in siamese neural network. On the one hand, siamese neural network usually needs a large number of negative pair samples in the training process, leading to computing overhead. On the other hand, current models may lack interpretability because of complex network structure. To overcome the above limitations, we propose a spectral-spatial siamese neural network with bag-of-features (S3BoF) for HSI classification. First, we use a siamese neural network with 3-D and 2-D convolutions to extract the spectral-spatial features. Second, we introduce stop-gradient operation and prediction head structure to make the siamese neural network work without negative pair samples, thus reducing the computational burden. Third, a bag-of-features (BoF) learning module is introduced to enhance the model interpretability and feature representation. Finally, a symmetric loss and a cross entropy loss are respectively used for contrastive learning and classification. Experiments results on four common hyperspectral datasets indicated that S3BoF performs better than the other traditional and state-of-the-art deep learning HSI classification methods in terms of classification accuracy and generalization performance, with improvements in terms of OA around 1.40%–30.01%, 0.27%–8.65%, 0.37%–6.27%, 0.22%–6.64% for Indian Pines, University of Pavia, Salinas, and Yellow River Delta datasets, respectively, under 5% labeled samples per class.


I. INTRODUCTION
H YPERSPECTRAL image (HSI) consists of dozens or even hundreds of subdivided continued spectral bands with an extremely high spectral resolution. HSI has the ability to finely describe the characters of ground targets and distinguish subtle differences in spectral features. Therefore, HSI has obvious advantages in target detection and classification. HSI classification aims to assign a unique category label for each pixel of the image [1], which has been widely applied in the field of military investigation [2], precision agriculture [3], environmental monitoring [4], mineral exploration [5], etc.
Among various DL architectures, convolutional neural networks (CNN) [24] has been widely applied in HSI classification. Moreover, CNN becomes the leading architecture due to its powerful image-processing capability. CNN can be divided into 1-D CNN, 2-D CNN, 3-D CNN, and some hybrid variants, according to the dimension of convolution. 1-D CNN convolves HSI along the spectral dimension, and it obtains vectorized features. However, 1-D CNN ignores the spatial information by flattening spatial image to vector. To overcome this limitation, 2-D CNN is proposed to convolve HSI along spatial dimension, which can aggregate the modeling of spatial context information [25]. Nevertheless, considering the fact that 1-D CNN or 2-D CNN only focuses on spectral or spatial features while ignoring the latent spectral-spatial features, 3-D CNN is a very promising approach that convolves HSI in spectral and spatial dimension synchronously by treating it as a data cube. This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Currently, in HSI classification, various 3-D CNN-based DL methods have been sufficiently exploited. A 3-D autoencoder with 3-D CNN was constructed to polymerize the spectralspatial features [26]. To exploit spatial information, in [27], a spectral-spatial residual network (SSRN) was proposed, which used several spectral and spatial residual blocks to reduce dimension and extract features, meanwhile fulfilling shallow and deep features fusion by adopting skip connection. To make more effective use of deep and shallow features, [28] proposed a spectral-spatial unified network (SSUN), which used both the long short-term memory (LSTM) model and multiscale convolutional network to extract features. In order to further extract the spectral space joint information, the authors in [29] combined 2-D and 3-D CNN to propose a hybrid network (HybridSN), which further improved the classification performance.
Nevertheless, 3-D CNN has a complex structure and lots of parameters, training the deep network needs a great quantity of labeled samples, but labeled samples in real classification scenario are very limited. To this end, siamese neural network is proposed for few-shot HSI classification. For example. a dualpath siamese CNN method is proposed for HSI classification by combining extended morphological profiles, CNN, siamese network, and spectral-spatial feature fusion [30]. To obtain discriminative spectral features, a siamese spectral attention network with channel consistency was proposed for few-shot HSI classification [31]. To overcome the difficult brought by limited labeled samples, the work [32] embedded an autoencoder module into siamese network with semisupervised paradigm. To upgrade the performance of siamese network under few training samples, [33] ameliorated the contrast loss function and designed a lightweight spectral-spatial siamese network. Similarly, the authors in [34] proposed a hyperspectral imagery classification algorithm based on contrast learning, which used the information of abundant unlabeled samples. To consider local structure information, an iterative semisupervised CNNs framework was proposed, which combines active learning and superpixel segmentation techniques [35]. To leverage global structure information, a new low-batch GCN (miniGCN) was developed, which allows training large-scale GCN in a minibatch manner and enables inference of out-of-sample data without retraining the network [36].
However, there are still some limitations. On the one hand, these siamese networks usually rely on large number of negative sample pairs, and the negative sample pairs prevent the output of siamese networks from collapsing to a constant. On the other hand, how to effectively learn the spectral-spatial features and enhance network interpretability also needs to be solved. To address the collapsing issue, a stop-gradient operator with prediction head was used in siamese network to take place of negative pair samples [37]. To enhance model interpretability, bag-of-features (BoF) [38] was combined with CNN. In [39], a spectral-spatial residual network learning framework with bag-of-feature was proposed for HSI classification. Overall, BoF uses the obtained statistical information to improve the representation ability of the model. By representing samples as a statistical histogram of all code words, the importance of different features can be quantified, enhancing the interpretability of the model. Meanwhile, BoF has fewer parameters, and the aggregation of features can be performed with a relatively simple structure without increasing model complexity.
In this article, we propose a BoF-driven spectral-spatial siamese convolution neural network (S3BoF) for HSI classification. First, we use principal component analysis (PCA) to obtain a channel-compressed image with less redundant information. Second, patches around a target pixel are sampled from the compressed image and then fed into several 2-D and 3-D convolution modules with batch normalize layer and activate layer in order to aggregate spectral-spatial joint features. Third, a BoF feature encoding module is connected after convolution layers, and spectral-spatial features are extracted as compact layer-wise representations more efficiently. The convolution layers and BoF constitute the backbone network. In our framework, there are two processes in the training stage: contrastive training and classification. In contrastive training process, the backbone network, projection head and prediction head are used. Two 1-D vector are gathered from two branch of the siamese network, then we minimize the loss between the two feature vectors. Meanwhile, we stop the gradient back propagation on the branch that without prediction head, which avoids learning collapse due to lack of dissimilar information between classes. In classification process, image cubes go through the backbone and fully connection layers to perform classification. Finally, we adopt symmetric loss in contrastive training and use cross entropy loss in classification process.
In this article, the main contributions of the work can be summarized as follows.
1) We introduce a simple siamese network framework, which can train our model without negative pair samples, and avoid the calculation burden and time cost brought by a large number of negative sample pairs, meanwhile keeping good classification results. 2) We propose a siamese network embedding BoF, which can encode the features obtained by convolution in siamese network, and get statistical features to increase the distribution shift invariance and enhance feature learning and expression ability of the model.

II. PROPOSED METHODOLOGY
An overview of the proposed framework is described in Fig. 1. First part of the framework is a backbone network, where the sample is fed into the convolution squeeze module which includes a 3-D or 2-D convolution layer, a batch normalize layer, and an ReLU active layer. Through this step, spectral-spatial joint features are extracted. A BoF module is embedded to encode feature maps into statistical features afterwards. The second part of the model is projection head. The third part of the model is prediction head which is added to receive the gradient information. The projection head and prediction head are only used in contrastive training process. In classification process, a fully connection layer is directly added after the backbone network to perform classification. Let denote HSI dataset as X ∈ R N ×C , and the labels Y in X contain C classes. PCA is applied on the original HSI data X in order to reduce the spectral redundancy. The constructed HSI data cube is denoted by X r ∈ R M ×N ×B , where M is the reduced dimension. We take the target pixel as the center point, the square neighborhood of a certain size around the target pixel as a patch, and the patch as the input of the proposed framework. The output of our framework is a 1-D vector representing the predicted label of the target pixel.

A. Siamese Network Backbone
We use 2-D and 3-D convolutions to construct siamese network backbone, which is due to the fact that it effectively aggregates spatial and spectral information, and it can avoid too many parameters caused by 3-D convolution. The structures of 2-D and 3-D convolutions are given as [40] where the position of pixel point in 2-D data matrix or 3-D data cube is represented as (x, y, z), and f xyz ij represents the output in jth feature map in ith layer. The parameter h, w, d, respectively, represent height, width, and dimension of convolution kernel. The parameter p, q, r represent the indexes of kernel, m represents the index of feature maps. The activation function is represented as ϕ. Two other parameters w and b represent the kernel weight and the bias, respectively, which are initialized and updated during training process.

B. Contrastive Training Without Negative Samples
Under normal circumstances, if negative sample pairs are not used in the siamese network, all data converge to the same constant solution after feature mapping. The negative example plays the role of pushing the corresponding representation vectors of each sample away from each other, thus the distribution of representation vectors is uniform, which is beneficial to avoid learning collapse. In [37], the authors explored a stop-gradient operation to prevent model collapse without relying on negative examples. In our framework, the problem of collapse is further prevented due to the alternate contrastive training and classification learning.
In the contrastive training process, (x i , x j ) is a positive pair samples which is a pairwise combination of the same class samples from the training dataset. When training starts, one of a pair of samples is fed into backbone network, projection head and prediction head in turn. We mark the output vector as p, where p = g(h(f (x))). Then another sample of the pair samples is fed into backbone network and the projection head h. We further mark the output vector as z where z = h(f (x)).

C. Bag-of-Features Learning Module
The BoF module is shown in Fig. 2. The first two layers respectively represent the feature map and feature vector obtained by convolution. The third layer is differentiable normalized similarity function, i.e., radial basis function (RBF), which is used to measure the similarity between the extracted feature vectors and the codebook. Then, a histogram can be obtained by accumulating the responses of RBF neurons to each feature vector in the last layer.
After extracting the spectral-spatial features by 3-D and 2-D convolutions, we make the model to learn the length of BoF codebook. The feature vector X ij ∈ R B (i = 1, . . ., N s , j = 1, . . ., N fe ) is clustered into K cluster and these cluster with the is calculated between the feature vector x ij and each centroid of v k using the kth RBF as where σ k represents the hyper-parameter in Gaussian function of each RBF neuron. The normalization used in (3) is able to help the model withstand mild distribution shifts, and thus improves the scale invariance of the model [41]. The extracted feature dimension is determined by the number of RBF neurons used in the BoF layer. The encoding process and training are unsupervised because no labeled information is required. Finally, the histogram is built by accumulating responses of each RBF neurons and represent the feature vectors, which is calculated by The output of RBF layer is The histogram F his has unit L 1 norm which reflects the distribution of RBF neurons. The BoF model calculates a codeword statistical histogram for each patch data to describe the importance of different words, which helps to understand more helpful features for classification, and thus, improving the interpretability of the model.

D. Symmetric Loss and Cross Entropy Loss
We project the feature vector obtained by BoF for contrastive learning. The vectors are fed into projection head and prediction head. With reference to the setting of [42], we make the dimension of the output vector equal to 256, i.e., z ∈ R 256 , where z = h(f (x)). Therefore, the output of the prediction head is p ∈ R 256 , where p = g(h(f (x))). The stop-gradient operation is also indicated in the formula. Because of the same length of the vector p and vector z, we can minimize their negative cosine similarity as where || · || 2 is L 2 -norm, and stop-g represents stop-gradient. The symmetric loss function is defined as Because of the existence of stop-gradient, the encoder on x 2 receives no gradient from z 2 in the first term, and it is same to the second term. In the classification process, the samples of training dataset are fed into backbone network and linear layers is used for classification. Notably, they did not go through projection head and prediction head. The cross-entropy loss is used to train the convolution network, which is represented as where y (i) andŷ (i) are the true and predicted labels, respectively.  respectively, and the spatial resolution is 3.7 m. The false-color composite image of Salinas and the corresponding ground truth image are shown in Fig. 5.

4) Yellow River Delta:
The fourth dataset was collected by ZY-02D Advanced Hyperspectral Instrument (AHSI) from the Yellow River Delta wetland, China. The height and width of image are 1147 and 1600, respectively. Except for 47 bands that were damaged by noisy, the remaining 119 bands were preserved, ranging from 0.395 to 2.501 μm. There are 23 classes in this dataset with 30 m spatial resolution. The false-color composite image of Yellow River Delta and the corresponding ground truth image are shown in Fig. 6.

1) Hyperparameters:
We use exponential decay for the classification learning rate, and the initial learning rate will be determined in the following experiment. The epoch of train is set to 40. The batch size is set to 32 for Indian Pines and Yellow River Delta datasets. The batch size is set to 128 for University of Pavia and Salinas datasets. The reduced dimension is set to 30 for Indian Pines, and 15 for Pavia of University, Salinas, and Yellow River Delta datasets.
2) Training Samples: We randomly select 3% samples per class to construct the training set. Since our framework has two different processes, contrastive training process, and classification training process, the required inputs are described separately. For the classification training process, all samples in the training set are supervised training as inputs to the network. For contrastive training process, we need a pair of samples as input to the siamese network. We construct positive sample pairs from the same class in the training set, while not requiring negative sample pairs. In the contrastive training, 100 random sample pairs are used per epoch to improve the computational efficiency while satisfying the sample space coverage. More details of the train and test samples are listed in Tables I-IV. 3) Performance Comparison: We select several representative methods for comparison, including support vector machine with a radial basis function (SVM-RBF) [43], 2-D convolutional neural network (2-D CNN) [44], 3-D convolutional neural network (3-D CNN) [45], hybrid spectral convolutional neural network (HybridSN) [29], spectral-spatial residual network (SSRN) [27], spectral-spatial unified network (SSUN) [28].  The above methods include one machine learning method, three common DL methods, and two advanced DL methods. We set their hyperparameters according to the recommended values.

4) Running Platforms and Metrics:
All experiments are conducted with Intel (R) Xeon (R) CPU, GeForce RTX 3060 (12 GB), and 32 GB RAM, using PyTorch DL framework. All experiment results are averaged after ten independent runs. We adopt overall accuracy (OA), average accuracy (AA), and Kappa coefficient (κ) to test the classification performance. Moreover, we also conduct complexity analysis, where train time (s), test time (s), and parameter size are used as the metrics. The highest results in Table V-VII are shown in bold typeface.

C. Parameter Sensitive Analysis
1) Learning Rate: In this experiment, we aim to optimize the two learning rates for classification and contrast learning processes. We set the values of classification learning rate and contrast learning rate both in the range from 1e-5 to 1e-2. As shown in Fig. 7, 5e-4 is a demarcation point of OAs. When the classification learning rate is in a range of 1e-5 to 5e-4,  OA shows an increasing trend. When it is equal to 5e-4, OA reaches the highest value. When the classification learning rate is greater than 5e-4, OA will decrease along the increase of the learning rate. This trend is particularly evident on the Indian Pines, University of Pavia, and Yellow River Delta dataset. Based on the above analysis, we set the classification learning rate to 5e-4. Regarding contrast learning rate, when it is less than 1e-4, OA maintains the highest stably. But when the learning rate is greater than 1e-3, OA decreases rapidly with the increase of the learning rate. Therefore, we choose 5e-5 as the optimal value for the contrast learning rate. 2) Window Size: In this experiment, we focus on the impact of window size on the classification performance. To this end, we set the window size to 9,11,13,15,17,19,21. The evolution of OA as a function of window size is shown in Fig. 8. It is easy to find that on the four datasets, OAs shows a trend of first rising and then stabilizing with the increase of window size. For Indian Pines, when the window size is less than 15, the classification accuracy is low, and when the window size is equal to or greater than 15, OA keeps steady. For University of Pavia, Salinas and Yellow River Delta datasets, our model achieves high classification accuracy and remains stable when window size reaches 11. To ensure higher OAs for the four datasets, we set window size to 15.
3) Codebook Size: In this experiment, we try to analyze the influence of codebook size on classification accuracy. We set the codebook size equals to 4, 10, 20, 50, 100, 200, 400 for the four datasets. The evolution of OA as a function of the number of codebook is shown in Fig. 9. We can see that OA increases first and then remains unchanged. For University of Pavia and Salinas dataset, when the size of codebook is equal to or greater than ten, OA remains stable. For Indian Pines and Yellow River Delta dataset, OA is more sensitive to the codebook size. When the number of codebook reaches to 100, OA basically tends to be stable. In order to have high and stable accuracy, the number of codebook is set to 100.
We observed that when code number is set to a large value, it is not sensitive to OA. This phenomenon is particularly evident on Indian Pines and Yellow River Delta datasets. We believe this may be due to the more obvious category imbalance in the two datasets. When the number of codebook is small, the model performs worse in categories with a small sample size, but it performs better overall, resulting in higher OAs and lower AAs.

D. Ablation Study 1) Effectiveness of Discarding Negative Samples:
In this experiment, we compare three variants including "Pos+Neg" (using both positive and negative samples), "Pos" (only using positive samples), and "Pos+Neg+BoF" (using positive and negative samples as well as BoF) to verify the effectiveness of discarding negative samples. As reported in Table V, OA, AA, and Kappa are all improved on the four datasets by comparing "Pos+Neg" and "Pos." For Indian Pines dataset, OA improves 0.33%. For University of Pavia, OA improves 0.08%. For Salinas dataset, OA improves 0.05%. For Yellow River Delta dataset, OA improves 0.01%. By comparing "Pos+Neg+BoF" and "Pos+BoF," the performance improvements are significant. For example, OA improves 0.43%, 0.12%, 0.1%, and 0.32%, respectively, for the four datasets.
2) Effectiveness of BoF: As shown in Table V, OA, AA, and Kappa are all improved on the four datasets by comparing "Pos+Neg" and "Pos+Neg+BoF". The improvement is most pronounced in Indian Pines dataset, where OA improves 7.85%. It also maintains good classification performance in other three datasets. For the University of Pavia dataset, OA improves 0.55%. For the Salinas dataset, OA improves 0.26%. As for the Yellow River Delta dataset, OA improves 0.16%. Similar observations can be found by comparing "Pos" and "Pos+BoF." It is not difficult to find from the experimental results that BoF can significantly improves the average precision and can effectively alleviate the misclassification problem caused by unbalanced samples.

E. Generalization Performance
In this experiment, we evaluate the generalization performance of S3BoF under different sample size. For the four dataset, we randomly select 1%, 2%, 3%, 4%, 5% training samples per class. As shown in Fig. 10, S3BoF performs better than the other methods in most cases. For the Indian Pines dataset, S3BoF achieves excellent generalization performance in 1%-5% training samples per class. For example, S3BoF has an OA of 83.73±1.92% when training size is 1% per class, with 2.87%-27.91% improvements over other methods. For the Pavia of University dataset, the improvement of S3BoF is particularly obvious when the number of samples is small. For example, S3BoF obtains an OA of 97.51±0.39% when using 1% training samples per class, with 1.27%-8.02% improvements over other methods. For the Salinas dataset, S3BoF also achieves better classification performance. For the Yellow River Delta dataset, S3BoF also performs well under 3%-5% training samples. For example, S3BoF obtains an OA of 97.36±0.40% when we use 5% training samples per class, which has 0.92%-7.34% improvements over other methods.

F. Classification Results
The classification results under 5% training samples per class are reported in Tables VI-IX for the four datasets, respectively. We also exhibit the ground truth and classification maps of different methods in Figs. 11-14, respectively.
1) Indian Pines: Table VI reports the classification results on Indian Pines dataset. It can be observed that, comparing with other methods, S3BoF reaches the highest classification accuracy. The OA reaches to 96.82±0.79%, which has 1.40%-30.01% improvements compared with other methods. The AA reaches to 95.28±1.44%, which has 0.64%-33.80% improvements compared with other methods. The Kappa reaches to 96.38±0.90%, which is 1.60%-34.48% higher than the other methods. As for specific classes, S3BoF has a better classification effect on most classes, which obtains the highest accuracy across ten classes. For other methods, HybridSN, SSRN, and SSUN provides competitive performance. We consider that this is because that these three networks combined spectral and spatial information. Fig. 11 exhibits the classification maps of S3BoF and other contrast methods, where SVM and 3-D CNN yield noisy results, because these two methods do not aggregate spatial information, and S3BoF produces more smooth classification map.
2) University of Pavia: The experimental results of University of Pavia are reported in Table VII. According to the results, the OA of S3BoF reaches to 99.73±0.07%, which is 0.27%-8.65% higher than the other methods. The AA of S3BoF reaches to 99.49±0.13%, which has 0.15%-11.82% improvements compared with other methods. The Kappa of S3BoF reaches 99.64±0.09%, which is 0.36%-11.55% higher than the other methods. Out of a total of nine classes, S3BoF achieves the highest classification accuracy on five classes including asphalt, painted metal sheets, etc. Meanwhile, when S3BoF classifies the class self-blocking bricks, the accuracy of the results is greatly improved, which reaches to 99.63±0.36%, with 1.07%-14.56% higher than other methods, and other methods have more misclassifications in this category. The classification maps of S3BoF and other contrastive methods are shown in Fig. 12. From the classification maps, we can see that the methods without spatial information get noisy results, while HybridSN has blurred edges and lost a lot of details.
3) Salinas: The classification results of Salinas are reported in Table VIII. Compared with other methods, the OA of S3BoF is 99.81±0.06%, which has the improvement of 0.37%-6.27%. The AA of S3BoF is 99.82±0.06%, which has the improvement of 0.10%-3.27%. The Kappa of S3BoF is 99.79±0.07%, which has the improvement of 0.41%-6.99%. S3BoF achieves the highest classification accuracy in half of the classes. Among other contrast methods, both 2-D CNN and HybridSN also perform well in terms of OA, AA, and Kappa. We consider that the reason is that Salinas dataset is more susceptible to spatial information. The classification maps of S3BoF and other contrast methods are shown in Fig. 13. For HybridSN, the boundary of classification map is very fuzzy, that is because large window size introduces too much neighborhood information which does not belong to the same category, leading to misclassification of surrounding pixels.

4) Yellow River Delta:
The experimental results of Yellow River Delta dataset are reported in Table IX. We can see that the OA of S3BoF is 96.66±0.68%, the AA is 91.14±1.54%, and the Kappa is 95.57±0.91%. The improvement of OA, AA, Kappa is 0.22%-6.64%, 0.31%-13.90%, 0.29%-7.10% respectively compared with other methods. Owing to the imbalance in the number of labeled samples for this dataset, the other contrast methods have poor performance in AA. However, the S3BoF is able to greatly improve the average classification accuracy on this dataset. Fig. 14 shows the classification maps of S3BoF and other contrast methods. According to the results, S3BoF obtains more smooth classification results on this dataset while preserving more details.     the three methods. It is worth noting that the test time cost is generally low for all compared methods. As for parameter size, S3BoF has more parameters than SSRN, but it has much less parameters than HybridSN and SSUN. In addition, HybridSN has the largest parameter size, which is because it has more fully connection layers. In general, S3BoF guarantees better classification performance without significant computational burden.
B. Feature Analysis 1) Feature Visualization: In order to analyze the feature learning power of the presented S3BoF method, we visualize the output feature maps learned by different variants of S3BoF, including "Pos+Neg," "Pos," "Pos+Neg+BoF," and S3BoF according to the ablation study. As shown in Fig. 15, we observe the following. 1) S3BoF can learn more representative features modeling stronger discriminant information, which is beneficial to guide the model to pay more attention to important features.
2) The variants without BoF (i.e., "Pos+Neg" and "Pos") yield much rough features which cannot well capture the local details of targets. 3) The variants without negative samples (i.e., "Pos" and S3BoF) produce competitive results due to adopting stopgrandient operation, resulting in effective feature representation by maximizing the similarity of positive sample pairs.
2) Feature Separability: In order to verify feature separability, we further plot the clustering results of different variants of the proposed method. To this end, we use t-distributed stochastic neighbor embedding (t-SNE) [46] to map the output features into a 2-D plane. As shown in Fig. 16, the features learned by S3BoF have better separability, especially on the Indian Pines and Yellow River Delta datasets. Whereas, the University of Pavia and Salinas datasets can be distinguished clearly by all the compared variants, which may due to their higher spatial homogeneity. At the same time, it can be verified that BoF also increases the feature separability.

V. CONCLUSION
In this article, we propose an S3BoF for HSI classification. First, PCA is applied to reduce the spectral redundancy. Second, spectral-spatial joint features are extracted by using 3-D and  2-D convolutions. Then, a BoF is embedded as a pooling layer to aggregate and reduce redundancy information by modeling the features with statistics information. The interpretability of deep learning is improved by constructing a feature vector descriptor. Meanwhile, we improve the efficiency by giving up negative sample pairs. Experimental results on four hyperspectral datasets verify the superior classification performance of S3BoF compared with traditional and other state-of-the-art deep learning HSI classification methods. Although our experiments obtain encourage results, it is still worth exploring in some aspects. At present, we select sample pairs randomly, but the quality of sample pairs further influences the accuracy of contrastive learning. Meanwhile, BoF only makes use of first-order statistical characteristics, which ignores higher order statistical features.