DFL-LC: Deep Feature Learning With Label Consistencies for Hyperspectral Image Classification

Deep learning approaches have recently been widely applied to the classification of hyperspectral images (HSIs) and achieve good capability. Deep learning can effectively extract features from HSI data compared with other traditional hand-crafted methods. Most deep learning methods extract image features through traditional convolution, which has demonstrated impressive ability in HSI classification. However, traditional convolution can only operate convolutions with fixed size and weight on regular square image regions. Moreover, it refers to the spectral features of the adjacent pixels but ignores the spectral features of long-range data with the training sample. Although a graph convolution network (GCN) can process irregular image regions, the pixels’ relationships for graph construction cannot be well ensured with limited iterations. Hence, the extracted features have limited performance with the GCN. Aiming to extract more representative and discriminative image features, in this article, the deep feature learning with label consistencies (DFL-LC) method is developed to realize HSI classification. In the proposed method, a multiscale convolutional neural network is adopted to obtain basic HSI features, and the GCN can further capture relationships between pixels and extract more representative HSI features. For obtaining discriminative features, we add the label consistency of single pixels and label consistency of group pixels regularization in the objective function. It can maintain label consistency for the general and long-range data and alleviate deficiently labeled samples. The experimental results on three representative datasets fully demonstrate that the DFL-LC method is superior to other methods in both quantitative and qualitative aspects.


I. INTRODUCTION
T HERE are several hundred channels in hyperspectral images (HSIs) that contain high-resolution spectral information of land covers. Each pixel in HSIs corresponds to the spectral reflectance of a particular wavelength, so it can be considered as a high-dimensional vector. Many spectral signatures have largely been used in HSI classification of land covers. In the past few decades, HSI classification has been developed into a significant part of remote sensing. In general, it is challenging for traditional machine learning to realize precise classification for the complex characteristics of HSIs. In addition, the inherent nonlinear relationship between the corresponding class and the spectral information is processed by HSI recognition [1]. As a powerful tool for extracting features, deep learning is widely adopted in several image processing tasks, which can effectively solve nonlinear problems. Therefore, deep learning has also been used for HSI classification and has shown good performance.
In the initial stage of HSI classification research, there were a number of methods focusing on detecting the role of the spectral characteristics of HSIs. Thus, numerous classification methods have been proposed in HSI classification, such as support vector machine (SVM) [2], multinomial logistic regression [3], and dynamic subspace detection [4]. Although most HSI classifications based on spectral and spatial information have obtained excellent performance, they are heavily dependent upon hand-crafted features. Moreover, traditional hand-crafted methods are limited. On the one hand, hand-crafted features are regarded as shallow features, so images can change considerably where the imaging environment is sharp [5]. On the other hand, most hand-crafted features rely on expert knowledge, limited by human factors. Moreover, crafting hand-crafted labels usually requires considerable time, limiting the applicability of those methods in different scenarios [1].
In recent years, deep learning has become a development trend in HSI classification and has achieved good performance. Deep learning methods can effectively exploit features from HSI data compared with other traditional hand-crafted methods. The process of deep learning is automatic, which makes it more suitable to deal with various situations. Because different networks can extract different feature types, deep learning This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ is considered to be a significant feature extraction approach in HSI classification. Thus, we can divide the deep learning networks to classify HSI into spectral feature networks (e.g., DBN [6], 1-D convolutional neural network (CNN) [7], 1-D GAN [8], [9], and RNN [10]), spatial-feature networks (e.g., 2-D CNN [11], FCN-8 [12], and spectral-spatial-feature-based classification (SSFC) [13]), and spectral-spatial-feature networks (e.g., SAE [14], 3-D CNN [15], and 3-D GAN [8]).
In deep learning approaches, CNNs have become a powerful tool in HSI classification methods, which can effectively extract spatial and spectral features. It has achieved impressive performance to classify HSI. Chen et al. [11] applied PCA to reduce the dimension of HSI first and then used the 2-D CNN to extract the spatial features within the pixel neighborhood. The above method combines PCA and CNN, which extracts spatial features and greatly reduces the computational cost. Liang and Li [16] proposed a sparse representation method to improve the feature representation ability and the classification accuracy. Deep spatial features extracted by the CNN are encoded into low-dimensional sparse features. The SSFC framework [13] is proposed to classify HSI, in which balanced local discriminant embedding and CNN are used to extract spectral and spatial features, respectively. However, the traditional CNN can only perform convolutions with fixed size and weight on regular square image regions. It only refers to the spectral features of the adjacent pixels but ignores the spectral features of long-range data with the sample. For example, some pixels are usually in the same class in different positions in HSI. These pixels should have similar features. Therefore, their classification performances need to be further improved.
Moreover, with the rapid development of graph theories, graph convolutional networks (GCNs) have been widely used in various applications, such as text classification [17]- [20] and semantic segmentation [21]- [24]. In addition, the GCN has made great progress in image classification [25]- [27]. The GCN can process irregular image regions. The learned hidden layers in the GCN can encode both features of node and local graph structure. Therefore, the GCN can flexibly retain class boundaries while adequately exploiting image features. However, it is not appropriate when the GCN is directly applied in HSI classification. The GCN can assemble and transform features from a defined graph containing the neighbor information of each graph node. In the GCN, the neighborhood structure of the graph adaptively governs the graph convolution operation. Although the GCN can capture relationships based on the predefined graph that contains global information, pixels' relationships for graph construction cannot be well ensured. The main reason lies in that accurate image features cannot be obtained only with the limited iterations of the deep learning framework. Thus, this deep learning framework still cannot ensure the quality of HSI features with the predefined graph.
Inspired by the above discussions, aiming to learn more representative and discriminative image features of HSI, label consistency (LC) is embedded into the deep learning framework in this article. LC includes not only the label consistency of single pixels (LCSP) but also the label consistency of group pixels (LCGP). LCSP means that LC is maintained by calculating the error between the predicted label and the real label of the sample. Moreover, LCGP refers to considering long-range data by introducing a group label matrix to solve problems such as different ground objects in the same spectrum. LCGP can also realize label reuse on the basis of limited labeled data so that the model has better learning ability.
In this article, deep feature learning with label consistencies (DFL-LC) is proposed, which considers both LCSP and LCGP, and its framework is show in Fig. 1. In this approach, we adopt the multiscale convolutional neural network (MSCNN) to extract basic HSI features. The features obtained from the MSCNN are further fed into the GCN, which considers pixels' relationships by constructing an adjacency matrix. The output layer of the GCN is activated by the ReLU function. In order to enhance the performance of HSI classification, LC should also be applied to the deep learning framework. LC includes not only LCSP but also LCGP. With LC, the cross-entropy loss is used to calculate the difference between the outputs and the real labels to keep LCSP. Moreover, to keep LC for the long-range data and alleviate deficiently labeled samples, LCGP regularization is added in the objective function. Finally, an iterative optimization algorithm is used to optimize the objective function.
The main contributions of this article are summarized as follows.
1) DFL-LC is developed to extract HSI features and ensure LC, whose structure contains the MSCNN and the GCN. The LC constraint is embedded in the objective function, and end-to-end optimization is implemented.
2) In DFL-LC, we formulate two kinds of constraints to boost the classification accuracy: LCSP and LCGP constraints. LCSP ensures LC between the outputs and the real labels of the sample. LCGP refers to considering the long-range data and alleviating deficiently labeled sample problem.
3) DFL-LC is optimized through an iterative algorithm. The test results on three representative datasets demonstrate that the DFL-LC method is superior to the relevant latest HSI classification methods.

A. Feature Extraction
There is abundant spatial and spectral information in the HSI, which is important to efficiently and accurately exploit spatial and spectral features to classify HSIs. According to the label of data, classification methods can be divided into supervised, semisupervised, and unsupervised methods.
We need a large amount of labeled data in supervised methods. Liu et al. [6] proposed an effective classification model based on active learning and DBN, in which the active learning algorithm is used to repeatedly select high-quality labeled samples for training, and DBN is used to deeply extract spectral features. In [28], a diversified DBN model was proposed, in which the classification performance of the model is significantly improved by normalizing the DBN pretraining and fine-tuning progress. Semisupervised methods need less labeled data compared with supervised methods. In [29], a semisupervised deep feature learning method was proposed with feature consistency, where the CNN is used to extract spectral-spatial features, and fully connected layers are used to model feature consistency. Sun et al. [14] proposed a semisupervised method to obtain features by training SAE using a batch training scheme. Then, a mean convergence method is used to generate deep characteristics by further fusing the spectral and local spatial features. HSI data can often be represented as a 3-D cube. Therefore, it can provide a more effective method to simultaneously extract spectral and spatial features of HSIs by performing 3-D convolution in spectral and spatial dimensions. In [30], a subspace learning with the conditional random field (CRF) method was developed to obtain the subspace of the HSI pixels using the semisupervised approach, in which the CRF is embedded in subspace learning to classify HSI. Different from supervised and semisupervised methods, unsupervised methods do not use labeled data. Some traditional methods (PCA [31] and ICA [32]) can effectively extract spectral features, but these linear models only have simple linear processing, which makes it difficult to process complex spectral features in HSIs. Kuo et al. [33] proposed kernel nonparametric weighted feature extraction, which combined linear and nonlinear transformation.
However, the traditional CNN only refers to the spectral features of the adjacent pixels but ignores the spectral features of long-range data with the sample, which only perform convolutions with fixed size and weight on regular square image regions. Therefore, the GCN is introduced to realize HSI classification, which encodes the graph structure to consider long-range data.

B. Graph Convolution
The GCN has been extensively explored in the problem areas of supervised, semisupervised, and unsupervised networks. Gori et al. first proposed the concept of graph neural network, which can process graph data [34]. Compared with the CNN and the RNN, the advantage of the GCN is that it can process non-Euclidean data with graph structure. The GCN is a multilayer neural network that operates directly on a graph and studies the features of the graph through the eigenvalues and eigenvectors of the Laplace matrix of the graph. The GCN can correctly transform the graph into a new discriminative space by integrating the adjacency relationships and features of the nodes in the graph.
Recently, the GCN has been diffusely used for text classification. Hamilton et al. proposed an inductive framework named Graph SAGE, which efficiently generates node embeddings for previously unseen data utilizing node features [17]. By sampling and aggregating features from the local neighborhood of the node, it learns a function that generates embedding, instead of training individual embeddings for each node. A fast approximation localized graph convolution was proposed to avoid numerical instabilities and explosion or vanishing of gradients [18]. It can encode both features of node and graph structure and lead to more efficient filtering operations, because the GCN was simplified by the first-order approximation of graph convolution. Monti et al. proposed a unified framework that generalizes the CNN to non-Euclidean domains, such as graphs and manifolds, and learns stationary and local features [19].
The GCN is also widely used in image classification. Garcia and Bruna used a reasoning prism to study the problem of few-shot learning on part of the graph observation model, which is composed of a set of input images that can be observed or not observed with labels [35]. Wang et al. proposed a method based on the GCN, which uses semantic embeddings and categorical relationships to classify images [36]. In this method, given a learning knowledge graph, the method inserts each node (representing a visual category) as input semantics. Some scholars also use the GCN to realize HSI classification. Qin et al. proposed a spectral-spatial GCN to approximate convolution by using adjacency nodes in the graph [37]. Thus, this method takes full advantage of the current pixel spatial information in the process of approximate convolution. Wan et al. proposed a multiscale dynamic GCN, whose graph is dynamically updated during graph convolution, and its input graphs have different neighborhood scales to utilize multiscale information in HSIs [38]. The GCN can capture relationships based on the predefined graph that contains global information, but the pixels' relationships for graph construction cannot be well ensured. Therefore, LC is embedded into the deep learning framework in this article to learn more features of HSIs.

III. PROPOSED METHOD
In this section, a new feature learning method, DFL-LC, is introduced. First, the motivation of this article is presented in Section III-A. Next, the DFL-LC framework is given in Section III-B. Finally, we optimize DFL-LC in Section III-C.

A. Motivation
Traditional convolution only refers to the spectral features of the adjacent pixels but ignores the spectral features of long-range data with the sample. The GCN cannot ensure the pixels' relationships for graph construction because accurate image features cannot be obtained with only the limited iteration of the deep learning framework. Therefore, we combine MSCNN and GCN, and LCSP and LCGP are added to the objective function to keep LC. LCSP denotes that the LC is maintained by calculating the error between the predicted output label and the real label of the sample. Moreover, LCGP refers to considering long-range data by introducing a group label matrix and realizing label reuse on the basis of limited labeled data. Finally, the objective function is optimized by an iterative optimization algorithm.

B. Framework of DFL-LC
1) Multiscale Feature Extraction: In recent years, classification, detection, and recognition issues can be addressed by CNNs, which are effected by the structure of the human visual system. There are two special aspects in the CNN architecture: shared weight and local connection, which make CNNs different from other deep learning methods in architecture. Shared weight can reduce network parameters. And the CNN can make use of local connections to exploit the local correlation between the neurons of near layers.
The objects of HSI usually have different geometric appearances, so multiscale features have been proven to be useful to solve the HSI problems [39]. The multiscale structure contains plentiful contextual HSI information [40]. Deep learning can extract abundant local characteristics of image regions from different levels by using the contextual information exposed by different scales. To obtain more detailed features, we embed multiscale information into the CNN. The MSCNN can exploit both shallow features and deep features, which is better adapted to classify HSIs, and multiscale features can effectively improve the results of HSI classification. Using the MSCNN, the spectral-spatial features are introduced to describe HSIs. The MSCNN adopts three different convolutional filters to locally convolve patches X 1 , X 2 , and X 3 with three different sizes. Then, all the features extracted from these three layers are stacked together as the input to the fully connected layer. With the MSCNN process, we can obtain the spectral-spatial features Z 1 : where f is the fully connected operation. ⊗ represents the traditional convolution operation, and ⊗ represents the features that are added together in the third dimension. W (0,i) and b (0,i) are the weight and bias for X i .

2) Graph Convolution Process:
There are hundreds of thousands of pixels in the HSI, which makes the computational complexity for graph convolution and HSI classification difficult to accept. In order to solve this problem, the GCN is introduced by treating each sample as a node in graph instead of a pixel of the HSI. This method can significantly reduce the number of graph nodes and improve the computational efficiency. Different from the CNN, which extracts features by convolution, the GCN studies the features of the graph through the eigenvalues and eigenvectors of the Laplace matrix of the graph. The GCN can find the simple and clear neighbor connections between the nodes from a complex graph and smooth the label information via neighbor connections over the graph until achieving a global steady state.
To perform graph convolution, we first construct an undirected graph, which is defined as G = (V, E). V and E are the sets of nodes and edges, respectively. A denotes the adjacency matrix of G, which represents the connection relationship between nodes in the graph. Here, the adjacency matrix is constructed according to the spatial relationship among patches, which can be calculated as follows: where γ is empirically set to 0.01 in the experiments. x i represents the patch, and N (x i ) is the set of neighbors of x i .
The normalized Laplacian of the graph is L = I N − D − 1 2 LD − 1 2 = UΛU T , where D denotes the degree matrix of G, U is the matrix composed of the eigenvectors of L, Λ is a diagonal matrix containing the eigenvalues of L, and I denotes the identity matrix with the proper size. According to the graph convolution theorem, the graph convolution can be written as Then, we can consider that g θ (Λ) = U T g is a function of the eigenvalues of L. We can define spectral convolutions on graphs as the multiplication of a signal x with a filter in the Fourier domain where g θ = diag(θ) = g θ (Λ) is the filter. However, the amount of computation required to eigendecompose the L of large graphs is prohibitively expensive. To address this problem, Hammond et al. approximated g θ (Λ) up to the Kth order by a truncated expansion according to Chebyshev polynomials T k (x) [41] with a rescaledΛ = 2 λ max Λ − I N . λ max is the maximum eigenvalues of L and θ is a vector of Chebyshev coefficients. Therefore, the convolution can be written as whereL = 2 λ max L − I N is the scaled Laplacian matrix. (UΛU T ) k = UΛ k U T can easily verify (6). Since this formula is a Kth-order polynomial for the Laplacian, the nodes away from the central node at most K steps determine the filtering.
Therefore, in the form of (6), stacking graph convolutional layers can build a graph convolution network model, in which pointwise nonlinearity is after each layer. Therefore, (6) becomes a linear function on the Laplacian spectrum of the graph considering the first-order neighborhood (K = 1). We further approximate λ max ≈ 2 in this linear formulation of a GCN. Therefore, (6) can be simplified to where θ 0 and θ 1 are two free parameters. To avoid overfitting caused by many parameters, (7) is converted to with a single parameter θ = θ 0 = −θ 1 . Since the eigenvalues of I N + D − 1 2 AD − 1 2 are in the range [0, 2], numerical instabilities and explosion or vanishing of gradients will be resulted by repeatedly using this operator. To alleviate this problem, Kipf and Welling performed the renormalization trick [18]. Since the spectral-spatial features Z 1 have been obtained, based on the graph convolution, we build a GCN, which contains two-layer graph convolution for node classification on a graph as an example. Therefore, the forward model can be simplified to and W (0) ∈ R C×H and W (1) ∈ R H×M are the weight matrixes of the input-to-hidden layer and the hidden-to-output layer, which can be updated via the backpropagation algorithm.
3) Output Layer: In the output layer, we formulate two kinds of constraints to boost the classification accuracy: LCSP and LCGP constraints, which are shown in Fig. 2. LCSP is maintained by calculating the error between the label prediction and the real label of the sample. LCGP refers to considering the long-range data by introducing a group label matrix and realizing label reuse on the basis of limited labeled data. Let s be the number of labeled patches, q be that of unlabeled patches, and n = s + q be the number of patches. The cross-entropy loss and LC constraint train the features extracted from the MSCNN and the GCN with the labeled data.
LCSP Loss: In the GCN, the output layer is activated by the ReLU activation function to transmit the features into the probability of all class labels Z. The LCSP loss is used to calculate the difference between the output of the network Z and the real label Y where L is the number of classes and Z i is the label prediction for the ith patch. The value of I(j) is 1 when j equals the desired label Y i of the ith patch (1 ≤ i ≤ s); otherwise, the value is 0. In (10), the probability of all class labels is predicted and optimized using the cross-entropy loss.
LCGP Loss: The LCGP is achieved by introducing the group label matrix G, which considers the long-range data and realizes label reuse. For example, assume that patches X 1 are from class 1; X 2 and X 3 are from class 2; and X 4 is from class 3. Then, G is defined as The objective function of the LC is then described as where T ∈ R s×L is a transformation matrix for transforming the predicted label Z ∈ R L×s into the matrix of the same size as G ∈ R s×s , and α is the balance term.

4) Overall Objective Function of DFL-LC:
Considering the constraints of (10) and (12), we formulate the joint objective function of DFL-LC as follows: where λ is the balance term.

C. Optimization of DFL-LC
In this section, we propose an iterative algorithm to optimize the parameters in the DFL-LC, and Algorithm 1 summarizes the optimization procedure. Let ϕ 1 be the collection of weights and bias of the CNN, and ϕ 2 be the collection of weights of the GCN. In each iteration of the algorithm, the parameters ϕ 1 , ϕ 2 , and T are optimized.
The parameters ϕ 1 and ϕ 2 are solved when T is fixed, so the optimization problem defined in (13) can be rewritten as Then, we update the parameters on each iteration where β is the learning rate of the DFL-LC. T is resolved when ϕ 1 and ϕ 2 are fixed. Therefore, the optimization problem defined in (13) is rewritten as Then, we update the transform matrix T as  Fig. 3 shows the false color image and ground-truth map of the dataset, and Table I lists the number of labeled and unlabeled pixels of various classes.
2) Salinas: The spatial resolution of Salinas dataset was 3.7 m, which was collected by the 224-band AVIRIS sensor over Salinas Valley, CA, USA. After removing 20 water absorption bands, the image comprises 204 bands with 512×217 pixels. It includes vegetables, bare soils, and vineyard fields. The ground truth of Salinas contains 16 classes. The false color image and ground-truth map are shown in Fig. 4, and Table II shows the numbers of samples to train and test in the Salinas dataset, respectively.

3) University of Pavia:
The University of Pavia dataset is with a spatial resolution of 1.3 m during a flight over Pavia in northern Italy, which was obtained by the ROSIS sensor. The scene is 610 × 340 × 103 after removing 12 noisy bands. The University of Pavia dataset contains nine classes of interest. Fig. 5 shows the false color image and the ground-truth map of the University of Pavia dataset, and Table III shows the information of training and test samples on the number.

B. Experimental Settings
The proposed method is enforced through PyTorch with the Adam optimizer, and the backpropagation algorithm is used to optimize the parameters of the whole network. For learning the network, we set the learning rate to 0.01 with 9000 epochs and a hidden layer size of 24 units. We crop each pixel and its surrounding neighboring pixels as the input of DFL-LC. The   TABLE II  NUMBERS OF TRAINING AND TEST SAMPLES IN THE

C. Comparison Approaches
To verify an evaluate the classification ability of the proposed DFL-LC, other traditional and state-of-the-art methods for HSI classification (SVM [2], CNN [11], CNN-MRF [42], HybridSN [43], SSCNN [44], and SDP [45]) are also used for comparison. We compare CNN-GCN with DFL-LC to verify the validity of MSCNN. 1) SVM [2]: Combining SVM with a feature-reduction technique is sufficient in HSI classification.
2) CNN [11]: It is a 3-D CNN model to effectively extract spectral and spatial for HSI classification.
3) CNN-MRF [42]: The CNN is used to learn the posterior class distributions, and then, Markov random field prior is used to consider the spatial information. 4) HybridSN [43]: It is a spectral-spatial 3-D CNN followed by spatial 2-D CNN. The 3-D CNN can represent spectral and spatial features, and the 2-D CNN can further learn more spatial features. 5) SSCNN [44]: It is a novel semisupervised CNN to classify HSIs, which can automatically learn features from complex data structures.
6) SDP [45]: It is a new semisupervised active learning approach to classify HSIs that improves machine generalization by using pseudo-labeled samples.

D. Classification Results
In these experiments, three objective metrics (overall accuracy (OA), average accuracy (AA), and the Kappa coefficient) adopted are used to quantitatively evaluate the capability of DFL-LC and other methods. The OA is obtained by calculating the ratio of the number of correctly classified test samples to the total number of test samples. The AA is the average of the classification accuracies of each class. The Kappa coefficient represents the robust measure of the degree of consistency, which is calculated by weighting the classification accuracies. The experiments are conducted on Indian Pines, Salinas, and University of Pavia datasets. The quantitative classification results are summarized in Tables IV-VI, and the highest accuracy in each class is highlighted in bold. And the classification maps obtained by different methods are shown in Figs. 6-8. Therefore, we can obtain the following observations. 1) Compared with other methods, the DFL-LC can achieve a higher classification accuracy and the best performance on three datasets than other methods. It demonstrates that DFL-LC can learn more representative features of HSI, which considers long-range data and keeps LC. Compared with the ground-truth map and other classification maps, the result of DFL-LC method shows fewer misclassifications and produces a smoother visual effect. This indicates that DFL-LC is very useful to classify HSIs, which can effectively construct the relationships among the samples.
2) We can observe that the CNN-based methods, including CNN, CNN-MRF, HybridSN, SSCNN, and SDP, achieve relatively low accuracy combined with DFL-LC. The reason is that they can only perform convolutions on a regular image region and cannot extract specific local spatial information. It also proves that GCN and LCGP can consider spectral features of long-range data, which play a significant role in HSI classification.
3) By contrast, we also observe that the DFL-LC methods can yield relatively good performance compared with SSCNN and SDP, which are semisupervised classification methods. It explains that LC can realize label reuse based on limited labeled data to improve the feature learning ability of DFL-LC.

E. Parameters Analysis
1) Impact of γ in the Adjacency Matrix: In the proposed method, the calculation method of the adjacency matrix in this article is shown in (2). It can be seen that different values of γ affect the classification accuracy in (2). Thus, we vary the   value of γ from 0.001 to 0.1 and report the OA on the Indian Pines, Salinas, and University of Pavia datasets. The results of classification are shown in Fig. 9. We can observe that the classification accuracy is improved when the maximal value of the nonzero elements in the adjacency matrix approaches 1. According to the experimental results, we can find that the classification accuracy is highest when γ is 0.01. Moreover, the capability of DFL-LC is more stable than that of CNN-GCN with a changed value of γ.
2) Impact of the Number of Hidden Nodes: The GCN learns feature by encoding features of node and graph structure in the hidden layer. There is a hidden layer in the proposed method, and the number of nodes of hidden layer also has some influence on the classification result. Therefore, we vary the number of hidden nodes in 16, 24, and 32. The OA on the Indian Pines, Salinas, and University of Pavia datasets are shown in Fig. 10. We can observe that the classification accuracy is improved when the number of hidden nodes is more than the number of categories in the dataset. Thus, we choose the number of hidden nodes according to the best accuracy in the experiment, so 16 hidden nodes are set in the method.
3) Influence of the Multiscale Filter Bank: To verify the validity of filter banks with different scales in extracting feature information, we compare the filter banks with different configurations, which are 1 × 1, ∼ 3 × 3, ∼ 5 × 5 and ∼ 7 × 7. The ∼ 7 × 7 denotes that the sizes of the convolutional filters are 1 × 1, 3 × 3, 5 × 5 and 7 × 7, and the others are in the same way. As shown in Table VII, the classification accuracy of multiscale filters is better than that of the method with a 1 × 1 convolutional filter. Multiscale convolution can exploit the spatial-spectral feature, which is better adapted for HSI classification. Additionally, since ∼ 7 × 7 contains more noise, the ∼ 5 × 5 multiscale filter shows better performance.   In order to effectively extract features and keep LC, we propose a novel DFL-LC to achieve HSI classification, which is based on traditional convolution and graph convolution. In DFL-LC, the MSCNN is used to obtain basic features, the GCN can capture relationships between pixels and realize HSI classification, and LCSP and LCGP are embedded in the objective function. LCSP can ensure LC between the predicted label and the real label of the sample. DFL-LC is a semisupervised method, and the method considers the truthful neighborhood information of all samples. LCGP can ensure the quality of extracted features when a small number of labeled samples are obtained, so DFL-LC can alleviate the deficiently labeled sample problem. Compared with the traditional and state-of-theart classification methods, the experimental result demonstrates that the proposed method can yield better HSI classification performance. In future research, DFL-LC will be applied into other recognition tasks, such as high-spatial-resolution remote sensing image segmentation. He is currently a Professor with the School of Land Science and Technology, China University of Geosciences, Beijing, China. His research interests include temporal-spatial data analysis, surveying adjustment, applied statistics, and their associated application in surveying engineering, image geodesy, remote sensing, and satellite geodesy. He also held visiting faculty long-term honorary academic appointments as Guest Professor with South West Jiao Tong University, Chengdu, China, and a Guest (Global) Professor with Keio University, Tokyo, Japan. He has delivered numerous invited presentations, including plenary and keynote lectures, and has taught many short courses all over the world. He has coauthored 135 journal papers published mainly in various IEEE journals, one book (edited), five book chapters, and more than 140 conference papers in the areas of his research activities and contributions, which have dealt with wireless terrestrial and satellite communication systems and network as well as in remote sensing, LiDAR systems, and information technology, including blockchain systems.
Dr. Mathiopoulos has been or is currently on the editorial board of several archival journals, including the IET Communications as an Area Editor, the IEEE TRANSACTIONS ON COMMUNICATIONS, the Remote Sensing Journal, and as a Specialty Chief Editor of the Arial and Space Network Journal of Frontiers. From 2001 to 2014, he was a Greek Representative to high-level committees in the European Commission and the European Space Agency. He has been a member of the Technical Program Committees (TPC) for numerous IEEE and other international conferences and was the TPC Vice-Chair of several IEEE conferences. As a faculty member UBC, he was the recipient of an Advanced Systems Institute (ASI) Fellowship as well as a Killam Research Fellowship. He was also the co-recipient of two best conference paper awards and the Satellite and Space Communication Technical Committee 2017 Distinguished Service Award for outstanding contributions in the field of Satellite and Space Communications from the IEEE Communication Society. Yong Li received the Ph.D. degree in pattern recognition and intelligent systems from Northeastern University, Shenyang, China, in 2020.
He is currently an Assistant Professor with the School of Electrical Engineering, Guangxi University, Nanning, China. His research interests include intelligent robots, point cloud processing, computer vision, and pattern recognition.