Patch-Based Training of Fully Convolutional Network for Hyperspectral Image Classification With Sparse Point Labels

Fully convolutional network (FCN), which has excellent capability for capturing spatial context, was introduced to improve the performance of hyperspectral image classification (HSIC). However, training FCN usually requires a huge amount of pixel-level labels, which is difficult to obtain for HSIC in practical applications. How to train an FCN effectively with the supervision of limited sparse point labels has attracted the attention. The patch-free training pattern with sparse point labels was proved to be effective for HSIC task. Then, as a general training mode for remote sensing image semantic segmentation, is patch-based training of FCN also effective for HSIC with sparse point labels? To answer this question, a patch-based training framework with a novel fully convolutional network is proposed for HSIC in this article. First, cropped hyperspectral image (HSI) patches with sparse labels are input for training. Second, considering the limitation of supervision with sparse points for training, a lightweight network on the basis of an encoder–decoder structure with shallow channels is specially designed for HSIC with the aid of residual connections in the encoder and the integration of multiple attention modules to fully exploit the spectral-spatial information of HSI. Third, conditional random field-loss is adopted as a prior complement to the point supervision for further excavation of spatial context information. The performance of the proposed method is quantitatively evaluated on three HSI datasets and achieves state-of-the-art performance in comparison with other representative methods, demonstrating the effectiveness of the patch-based training framework for HSIC.


I. INTRODUCTION
C OMPOSED of hundreds of contiguous and narrow spectral bands, hyperspectral images (HSIs) are able to capture more abundant spectral signatures about the ground surface than multispectral images [1], [2], [3] and have been widely used for earth observation missions, including land cover mapping [4], precise crop identification [5], [6], mineral mapping [7], and environmental monitoring [8], [9]. Hyperspectral image classification (HSIC), which aims to assign a label to every pixel in an image, is a challenging task due to the limited labeled data and the nonlinear relation, feature redundancy, and the curse of dimensionality in high spectral resolution [1], [2].
For the past few decades, plentiful methods have been proposed for HSIC, which can be roughly categorized into spectral-based classification and spectral-spatial classification [10]. Spectral-based methods, such as support vector machine (SVM) [11], random forest [12], and dynamic or random subspace [13], [14], aim to explore the spectral signature difference in terms of pixels. However, adjacent pixels can provide vital spatial information for improving the performance of HSIC [15]. Hence, an increasing number of spectral-spatial classification methods integrating spectral signatures and spatial contexts emerged, including Gabor filter [16], [17], extended morphological profiles [18], [19], and multiple kernel [20], [21]. Although the abovementioned methods were proven to be effective for the HSIC task, they heavily relied on the handcrafted feature descriptors. Furthermore, the setting of the parameters always depends on expert knowledge [22].
With the rapid development of deep learning, deep learningbased methods have been widely used for HSIC due to their prominent ability to automatically extract discriminative features in a hierarchical learning way [23], [24], [25]. The stacked autoencoder (SAE) was first introduced to HSIC in [26]. Chen et al. [27] proposed a deep belief network (DBN) to learn the restricted Boltzmann machine network layer by layer. However, due to the fully connected layers in both SAE and DBN, these two models have a large number of parameters, which leads to a longer training process. Considering the characteristics of HSIs, recurrent neural network (RNN) models were used for HSIC to better understand the spectral signatures by processing the spectral information as time sequences [28]. Mou et al. [29] first proposed an RNN framework with modified gated recurrent This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ units and proved the effectiveness of deep recurrent networks for HSIC. After that, multiple RNN-based networks were proposed for improving the performance of HSIC. In addition to making full use of spectral signatures, some researchers attempted to combine spatial context in RNN-based structures. For example, Zhou et al. [30] adopted a two-branch long short-term memory (LSTM) network integrating spectral and spatial branches, and Sharma et al. [31] designed a patch-based RNN to capture spatial information from adjacent pixels.
Convolutional neural network (CNN), which reduces the numbers of parameters due to the local receptive fields and sharing weights and has hierarchical spectral-spatial feature representation ability [32], is a powerful technique for geographical information extraction [33], [34], and has been widely used for HSIC. HSI patches cropped with the centric pixel and its surrounding adjacent pixels with a fixed size are the input of CNN structures. Since the aim for HSIC is to assign a unique label for every single pixel, the CNN-based networks for HSIC can be summarized into two categories: classification networks and fully convolutional networks (FCNs).
Classification network aims at classifying the centric pixel of the target image patch. The image patches and their centric pixel labels are utilized for training the network. At the inference stage, given an input image patch, the category of its centric pixel can be predicted by the trained classification network. Numerous CNN-based classification networks were developed to mine spectral-spatial information for HSIC. One kind of methods focuses on proposing more effective spectral-spatial modules. For example, three-dimensional convolutional layers were introduced into a CNN network to extract effective features [35], [36]. A CNN structure was proposed to obtain high-level spectral-spatial features [37]. The capsule units were redefined to become spectral-spatial units for accurate classification [38]. The other kind of methods develops the fusion of spectral and spatial features for better performance. Two-branch CNNs were designed for extracting spectral and spatial information separately and an integration module for feature fusion [39], [40]. In [41], a fusion module was proposed for multilevel feature integration.
The size of the HSI patch (patch size) determines the local receptive field of the network, which influences the extraction of spatial contexts. In theory, a larger patch size brings more spatial information about the surrounding spatial dependence. For classification networks, however, since the HSI patch represents the spectral-spatial information for the central pixel, the neighborhood pixels are required to have spatial homogeneity as similar as possible [42]. Unfortunately, HSI patches may inevitably contain pixels from different land-cover categories when they are located near the edges of land-cover regions [43]. As a result, small patches have been used to maintain a high spatial homogeneity in most classification networks to avoid the negative effect from other categories of pixels [44], which will affect the receptive field.
In order to enlarge the receptive field and bring in more spatial information for CNN models, the FCNs were introduced for HSIC in recent years. Compared with classification networks, FCN forms as an end-to-end structure by jointly learning feature extraction from original input data to final output with great generalization, and is trained with pixelwise labels, as shown in Fig. 1(a). With the benefit of the consecutive down-sampling operations in convolutional forward stage, the extracted features contain rich context and work well for pixel-level classification. Furthermore, the training and inference processes of FCNs are performed whole-image-at-a-time by dense feedforward computation and back propagation [45]. Hence, FCNs have no limitation on the input of images and could greatly improve the efficiency for inference.
However, it requires a huge number of pixel-level labels to train the FCNs, and there are few pixel-level labels available for HSIC in practical applications. How to train an FCN effectively with the supervision of limited sparse point labels has aroused the concern. Since the prior information provided by the sparse point labels is inadequate, the key of training FCN for HSIC is to ensure the convergence of the network with limited constraint.
Several works have successfully trained FCNs with sparse point labels. In [46], an end-to-end fully convolutional and fully supervised network was proposed for HSIC with a fine pixelwise label style. Instead of labeling all pixels for the HSI patch, an HSI patch generation method was used to generate fully pixel-level labels with existing training pixels. However, the HSI patches for training are generated, which cannot reveal the complexity of realistic imagery. The other works adopted patch-free training style of FCNs. As shown in Fig. 1(b), patch-free framework takes the whole hyperspectral image as input and the sparse point training samples as labels. A spectral-spatial FCN was proposed for HSIC with a mask matrix to assist back-propagation in the training stage considering the sparsity of point training samples [47]. A global stochastic stratified sampling strategy was proposed to guarantee the convergence of the patch-free FCN with a spectral attention-based encoder and a lightweight decoder [48]. A spectral-spatial dependent global learning framework was introduced [49], based on global convolutional LSTM and a global joint attention mechanism with a hierarchically balanced sampling strategy and weighted Softmax loss for insufficient and imbalanced HSIC. However, the sparsity of the training samples has to be considered when applying patch-free methods for HSIC. Besides, it is hard for patch-free methods to applications of large-scale images due to the limitation of computational memory.
In this article, instead of the patch-free training style, we propose a patch-based training framework in supervision of sparse point labels, aiming at providing a new way of applying FCN to HSIC. As shown in Fig. 1(c), the training framework takes the cropped patch as input. Considering the limitation of supervision with sparse points for training, a novel lightweight fully convolutional network (LFCNet) is designed on the basis of an encoder-decoder structure of shallow channels with the aid of residual connections in the encoder and the integration of multiple attention modules to fully exploit the abundant spatialspectral information of HSIs, where attention gates (AGs) are applied to make the network pay attention to more important areas in each level of layers, and gated channel transformation is utilized to weight the different significance of abundant channels. Furthermore, considering the limitation of the prior information provided by the sparse point labels, we adopt conditional random field (CRF) loss as a complementary constraint for further excavation of the spatial context information of HSI. The main contributions of this article are as follows.
1) Patch-based training of FCN with sparse point labels is proposed and demonstrated to hold the advantages of capturing spatial context and inference efficiency compared with classification network, and be less sensitive to the sparsity of training samples in comparison with patch-free training of FCN, which provides a novel way for HSIC. 2) A lightweight and effective fully convolutional network, LFCNet, is designed for HSIC with patch-based training, consisted of AG and gated channel transformation (GCT) modules for better exploitation of spectral-spatial information, as well as the CRF-loss for further excavation of the spatial context information of HSI.
3) The proposed method achieves state-of-the-art performance through a large number of experiments on three HSI datasets, and LFCNet is proved to be appropriate and universal for HSIC.

II. METHOD
A patch-based training framework on sparse point labels is proposed for HSIC with the LFCNet, which takes HSI patches as input and is designed on the basis of a shallow-layer encoderdecoder structure with residual connections and multiple attention modules to fully exploit the spectral-spatial information under limited point supervision. Furthermore, CRF-loss is adopted as a complement for capturing spatial information between pixels.

A. Patch-based Training Framework for HSIC
The overall framework of the patch-based training is shown in Fig. 2. In this framework, HSI patches for training and for inference are generated in different ways. According to a certain number of randomly selected sample points for each category, HSI patches for training are generated by taking each labeled pixel as a centric pixel and cropped in a fixed patch size s×s with its surrounding adjacent pixels, which makes sure that each training patch has at least one-point label. Patches for inference are cropped in order from left to right and top to bottom with an overlay ratio of 50% to make use of information from all its surrounding patches. The fully convolutional network is trained by HSI patches and sparse point labels with CRF-loss. When inferencing, cropped patches with overlay are predicted by the trained network to prevent context limitation and are stitched together by averaging the probability of land cover category for each pixel in overlapped areas.

B. Lightweight Fully Convolutional Network (LFCNet)
Since the FCN is trained with sparse point labels that provide limited supervision and prior information for training, a lightweight and efficient structure is designed to ensure the convergence and effectiveness of the network. As shown in Fig. 3(a), LFCNet is designed on the basis of a lightweight encoderdecoder structure with shallow channels for each block. In both the encoder and decoder parts, the residual block (ResBlock) is adopted as a basic unit to extract features. Due to the limited number of training samples and the high depth of the network structure, the degradation problem would arise for HSIC [4]. The degradation of accuracy can be alleviated by bringing in shortcut connections in convolutional layers as the successful usage by ResNet [50], [51]. Hence, a shortcut connection is adopted in our ResBlock to facilitate the model training process [36]. As shown in Fig. 3(b), our ResBlock consists of two convolutional layers with a batch normalization (BN) layer and a ReLU activation function after each convolution to speed up the convergence of the network. In the decoder part, the UpBlock plays the role of gradually improving the spatial resolution to restore the detailed information, which is made up of an up-sampling layer and a convolutional layer with BN and ReLU, as illustrated in Fig. 3(c).
The encoder-decoder structure can extract high-level context information but leads to the loss of detailed information, which is of great importance for accurate segmentation [52]. Hence, skip connections between the encoder and decoder are adopted to combine multilevel features from shallow convolutional layers to leverage detailed spatial features. As illustrated in Fig. 3(a), an AG is inserted into the encoder-decoder structure with skip connections to precisely supplement the spatial details. Furthermore, considering the high spectral resolution and correlation of HSI, GCT is applied as a stem block in the front end of the encoder part to learn the spectral relationships. The detailed structures of AG and GCT are described below.
1) Attention Gate: Since the down-sampling operations in the encoder part capture a large receptive field for high-level context information, the loss of detail information comes into being, which does harm to precise segmentation performance. A skip connection between the encoder and decoder in shallow layers can recover detailed spatial information to a certain extent. However, it is vital to use low-level features accurately as a supplement for segmentation. Hence, we adopt an AG in the skip connection to progressively stress effective attention for features of each level [53].
The structure of AG is illustrated in Fig. 3(d). Here, g represents feature maps from the encoder part, x l represents feature maps from the decoder part, and α represents the attention coefficient. First, g and x l go through a convolutional layer with BN and add together. Second, the added feature orderly goes through the ReLU activation function, a convolutional layer with BN, and a sigmoid activation function to obtain the attention coefficient, which contains low-level feature responses. The output of the AG is the elementwise multiplication of the input x l and the attention coefficient. By inserting AG into skip connections, the network tends to pay attention to more important areas in each level of layers.
2) Gated Channel Transformation: Due to the high spectral resolution and complexity of HSI, how to fully extract the useful spectral information to model spectral correlation is crucial for HSIC. We adopt GCT as a stem block to process the original HSI patches. GCT is a lightweight channel normalization layer that employs a normalization method to create competition or cooperation relationships among channels [54]. As illustrated in Fig. 3(e), GCT consists of three parts as follows.
First, a global context embedding module is proposed with the L 2 -norm to aggregate global context information in the channel dimension. The model is expressed as where c represents a channel of the input, α is a trainable embedding weight to evaluate the different significance of different channels, and is a small constant to avoid the problem of derivation at the zero point.
Second, L 2 normalization is adopted to realize channel normalization, which can be expressed aŝ where √ C is used to normalize the scale ofŝ c . Finally, a gating adaptation is proposed to adapt the original feature for competition or cooperation of the channels. The gating function is expressed aŝ where γ represents the weight, β represents the bias, and 1 + tanh(γ cŝc + β c ) represents the gating weight for the original feature.
In essence, GCT is a channel attention module to learn channel relationships by weighting the significance of different channels in each location. Hence, we adopt it as a stem block in the front end of the network to explore the spectral correlations of the original HSI patches.

C. CRF-Loss as a Complement Training Constraint
An effective loss function plays a critical role for training FCN to prevent the gradient from disappearing. However, supervision with only the sparse point labels is insufficient for training, which leads to imprecise segmentation [55], [56]. We adopt CRF-loss as an auxiliary loss for further constraint to promote the performance of the network by making full use of spatial context information in HSI [57]. Hence, we train LFCNet with the composite loss function, which is defined as where L pce represents partial cross-entropy loss for the sparse labeled pixels, L CRF represents CRF-loss for extra unlabeled pixels, and λ represents the weight of CRF-loss. The detailed principle of CRF-loss is shown as follows.
A quadratic relaxation version of the standard Potts/CRF model adapted for FCN is adopted as the energy formulation [54] where Φ represents the set of pixels in the HSI patch, C represents the number of categories, p is the Softmax output of the network, and G is a matrix of pairwise discontinuity costs, which is expressed as where G(i, j) makes use of the appearance kernel by [58], W represents the normalized weight, which can be found by combining high-dimensional filtering and expectation maximization [56], d is the spatial position of the pixel, I is the spectral value of the pixel, and θ α and θ β are hyperparameters that control the scale of the Gaussian kernels. Hence, the CRF-loss calculation is expressed as III. EXPERIMENTS

A. Dataset Description
The experiments are conducted on three publicly available benchmark HSI datasets. The chosen datasets contain different types of landscape patterns to ensure the richness of the validation area.
1) Wuhan UAV-Borne Hyperspectral Image (WHU-Hi) Dataset: The dataset was acquired in farming areas with various crop types in Hubei Province, China, via a Headwall Nano-Hyperspec sensor mounted on a UAV platform [5]. WHU-Hi-HanChuan (WHHC) and WHU-Hi-HongHu (WHHH) were selected from WHU-Hi for the experiment. WHHC was acquired in Hanchuan with an 8-mm focal length imaging sensor equipped on a Leica Aibot X6 UAV V1 platform. The study area contains seven crop species: strawberry, cowpea, soybean, sorghum, water spinach, watermelon, and greens. There are 274 bands from 400-1000 nm, and the size of the imagery is 1217 × 303 pixels. The spatial resolution of the data is approximately 0.109 m. WHHH was acquired in Honghu City with a 17-mm focal length imaging sensor equipped on a DJI Matrice 600 Pro UAV platform. The study area is typical of the regions affected by fragmentation and is planted with 17 crop types, including cotton, rape, and cabbage. Notably, the region is planted with different cultivars of the same crop type, for example, Chinese cabbage/cabbage and Brassica chinensis/small Brassica chinensis. The size of the imagery is 940 × 475 pixels, there are 270 bands from 400-1000 nm, and the spatial resolution of the data is approximately 0.043 m.
2) Pavia Center (PC) Dataset: PC was acquired by the PO-SIS sensor during a flight campaign over Pavia, northern Italy. The imagery of the PC dataset has 1096 × 715 pixels with 102

4) Network Settings:
We adopt the batched stochastic gradient descent optimizer with momentum of 0.9 and weight decay of 0.0005 to train the network. The "poly" learning rate policy is utilized with the initial learning rate setting as 0.001. The network is trained for 50 epochs, and the batch size is set as 3 for all the experiments. The setting of the parameters for the CRF-loss follows [59]. The detailed settings of the network, such as the output channels of each ResBlock, are listed in Table IV.

1) Comparison With Representative Methods:
In order to validate the effectiveness of the proposed method, we  II  NUMBER OF TRAINING AND TEST SAMPLES FOR THE WHU-HI-HONGHU  DATASET   TABLE III  NUMBER OF TRAINING AND TEST SAMPLES FOR THE PAVIA CENTER DATASET   TABLE IV  CONFIGURATION DETAILS OF THE LFCNET compare the LFCNet results on the three HSI datasets with seven representative methods, including five classification methods, and two FCN-based methods with patch-free training pattern. Specifically, the classification methods include SVM [11] and four state-of-the-art CNN-based classification methods, i.e., DHCNet [60], SSRN [37], CNNCRF [5], and SPRN [61]. The patch-free FCN methods for comparison are FPGA [48] and SSDGL [49]. It is noted that all the experiments share the As illustrated in the tables, the performance of the proposed LFCNet outperforms the other methods on all three datasets. LFCNet surpasses the second-highest SSDGL method by approximately 0.5% in OA, 0.006 in Kappa, and 0.6% in AA for the WHHC dataset and 0.7% in OA, 0.01 in Kappa, and 0.3% in AA for the WHHH dataset. For the PC dataset, LFCNet outperforms other FCN methods in a relatively large margin, surpassing the second highest method FPGA by approximately 0.5% in OA, 0.007 in Kappa, and 0.6% in AA. The accuracy superiority comparing with several classification methods is inapparent, e.g., SPRN, because the accuracies of these classification methods on this dataset have already attained a high level (larger than 99.7% in terms of OA). Besides, we speculate another reason is that larger receptive field do not bring in more spatial information due to the large-area blanks in the labels of this dataset. Through the visualization of the classification results from the above methods, we can go step further to understand and show the advantage of the proposed patch-based FCN. Overall, FCN methods have better and smoother visual performance than classification methods owing to the improvement of the obvious salt and pepper noise. For the WHHH dataset, which is characterized with class imbalance, LFCNet is able to precisely capture the large-area categories and delicately extract small-area categories. For the WHHC dataset characterized by a large number of edges, the results from LFCNet possess clear boundaries. For the PC dataset in which a lot of unlabeled areas exist, LFCNet is able to extract abundant features and obtain accurate results. To sum up, for the three FCN methods, the results of the proposed LFCNet have clearer edges and show a better performance in capturing thin and broken areas.
2) Sensitivity Analysis of LFCNet: To demonstrate the effectiveness of each component of the patch-based FCN framework, numbers of experiments are conducted for sensitivity analysis.
In the proposed framework, we designed an efficient and effective network for HSIC with the aid of AG and GCT to explore the abundant spectral signatures and spatial context of HSI. To verify the effectiveness of the modules in LFCNet, we set the network without AG and GCT as the baseline and compare the performance of the networks with AG and with both AG and GCT to the baseline. Moreover, we compare the performance of LFCNet with and without CRF-loss to demonstrate the effectiveness of further constraints. All the abovementioned experiments are conducted on the WHHC dataset. It is noted that the weight of CRF-loss is set as 0.02 here, and the overlay inference is not adopted. As shown in Table VIII, both the application of AG and GCT could improve the performance. When only AG is used, all the evaluation metrics increase approximately 0.3% compared to the baseline. Furthermore, the addition of GCT further boosts the performance, which results in an OA improvement from 98.18%-98.3% and a Kappa improvement from 0.9787-0.9801. Hence, the application of AG and GCT is proven to be effective The classification maps on WHHC are presented in Fig. 7 to visually illustrate the effect of each component. Overall, with the addition of different components, the classification maps gradually perform better visually. As shown in the red rectangle in the WHHC, which has a complex distribution of different land cover, the addition of AG brings more accurate capture of fragmented and sparse small regions. In addition, GCT and CRF-loss further boost the performance. Finally, the classification map trained with CRF-loss has a more complete result and accurate shape than other network structures.
Since the classification maps are obtained by patch-based inference, the phenomenon of seam stitching occurs, as shown in the black circle of Fig. 8(a). To eliminate the effect of seam stitching, an overlay strategy for inference is adopted. Here, we set the overlay ratio of the inference patch as 50%. Experiments on all three datasets with and without overlay inference are   TABLE IX  QUANTITATIVE COMPARISON OF THE IMPACT OF INFERENCE OVERLAY  STRATEGIES ON THE WHHC, WHHH, AND PC DATASETS conducted to verify the effectiveness of the inference overlay strategy. As given in Table IX, the usage of the inference overlay brings an improvement of 0.3 and 0.25 percentage point in the Kappa of WHHC and WHHH, respectively. However, it does not improve the accuracies of PC, which can be attributed to the large unlabeled areas of PC so that it does not result in an obvious accuracy change.
The composite loss function influences the optimization of the network. CRF-loss brings in further constraint on training by taking use of spatial connection between pixels, which is proven to be effective from abovementioned experiments on WHHC. As given in Table X, the addition of CRF-loss for  TABLE X  ACCURACY COMPARISON OF THE IMPACT OF THE WEIGHT OF CRF-LOSS ON THE WHHC, WHHH,  training also improves the performance in the WHHH and PC datasets, which illustrates the effectiveness and universality of the composite loss. Specifically, the parameter λ is introduced to control the proportion of cross entropy loss and CRF-loss. Hence, experiments with different weights are conducted to find the most appropriate proportion for the two kinds of loss. Table X shows the performance of three HSI datasets in different weights, where the weight ranges from 0-0.06 with 0.01 intervals. For the WHHC and PC datasets, all the evaluation metrics achieve the highest values when the weight is set as 0.02 and 0.01, respectively. For the WHHH dataset, AA achieves the highest value when the weight is set as 0.02, but OA and Kappa achieve the best values when the weight is set as 0.04. However, the difference between the accuracy when the weight is set as 0.02 and the best accuracy of the WHHH and PC is very small. It can   be concluded from the table that a weight of 0.02 performs well on all three datasets. Hence, we uniformly set the weight of λ as 0.02 for all three datasets.
Since we adopt patch-based samples to train the network, experiments with different patch sizes on the WHHC dataset are conducted to study the influence of patch size. The patch size is set from 50 × 50 to 300 × 300 in intervals of 50. It should be noted that the patch size for inference is equal to the training patch size. As shown in Table XI, the performance of LFCNet improves as the patch size increases from 50 × 50 to 200 × 200, but the accuracy begins to decline when the patch size surpasses 200. Hence, we set the patch size to 200 × 200.
In order to study the impact of the number of training samples on the proposed framework, we conducted extensive experiments on the WHHC dataset. The results from four methods, including LFCNet, FPGA, SSDGL, and SPRN, are shown in Fig. 9, and the training samples per class are set from 25-150 in an interval of 25. Overall, the performance of four methods improves with the increasing number of training samples per class. It can be concluded from the figure that the increasing range of accuracy decreases with the increase in the number of training samples. After the training samples exceed 100, the increase in evaluation metrics becomes very small. Generally, LFCNet outperforms the other three methods in terms of different levels of training sample number, especially for small numbers, e.g., 50, demonstrating its effectiveness. When the training samples per class are set to 50, which is only approximately 0.6% of the total samples, the proposed LFCNet also achieves a high accuracy at 97.64% of OA, 0.9723 in Kappa, and 96.93% in AA. Hence, it suggests that the proposed framework is also robust when only using limited training samples.

1) Classification Network or Fully Convolutional Network?:
Through the comparison in Section III-C, the FCN methods achieve higher accuracies than the classification methods. To further understand the difference between the FCN and classification network, the classification method with the best performance on all three datasets, SPRN, is used for comparison with the proposed LFCNet. The partially enlarged classification maps of WHHH and WHHC are shown in Fig. 10 for intuitive visual comparison. In addition to the higher accuracy, the results of LFCNet are smoother and have more precise boundaries compared to SPRN. As shown in Fig. 10, unlike SPRN, which has considerable salt-and-pepper noise within a region, LFCNet maintains unity within a region. This can be attributed to capturing more spatial context by FCN since the input patch sizes of FCN are much larger than those of the classification network.  Furthermore, the comparison of inference speed between SPRN and LFCNet is conducted to show the efficiency difference. In theory, the classification networks need to be inferred pixel by pixel, but FCNs predict the result of one patch at a time. Table XII lists the comparison results of inference speed on the WHHC, WHHH, and PC datasets. The inference speed of LFCNet is approximately 40 times faster than that of SPRN on three datasets, which proves the superiority of FCN in time efficiency.
2) Patch-Free or Patch-Based Network?: As presented in Section III-C, the proposed patch-based FCN surpasses the patch-free FCNs in accuracy on all three datasets. Essentially, the patch-free pattern changes the point label for every training iteration but holds the spatial context of images, while the patchbased method is just the opposite. For a fair comparison, we conduct experiments on different networks with different training patterns to show the effectiveness of the proposed patch-based method on the WHHC, WHHH, and PC datasets. Here, the fully convolutional network proposed in FPGA, named FreeNet [48], is trained in patch-free and patch-based training framework, respectively, for comparison with the proposed patch-based training framework with LFCNet. As shown in Table XIII, on all three datasets, training FreeNet with a patch-based pattern has better performance than training FreeNet with a patch-free pattern but is inferior to the patch-based training for LFCNet, which further demonstrates the effectiveness of the patch-based training framework on different FCNs. Furthermore, many strategies and skills from remote sensing image semantic segmentation could be used in the future since the universality of the patch-based pattern.
3) Applicability: The effectiveness of the proposed patchbased training framework has been proved qualitatively and  Fig. 11, with the change of image size, OA of the proposed method is relatively stable, but the performance of FPGA fluctuates greatly. The accuracy gap between the two methods shows an upward trend with the increasing image size. According to the analysis of the sample distribution, as the size of HSI increases, the sparsity of the samples relative to the whole image increases under an unchanged sample number, which may lead to the degradation of the patch-free method for HSI with large size. By contrary, the patch-based method is not sensitive to the sparsity of the samples given an unchanged sample number. Hence, when applying patch-free method for HSIC, we would need to consider the sparsity of the training samples, which relates to both the number of training samples and the image size. When applying patch-based method for HSIC, we would not consider the sparsity of the training samples but need to consider whether the number of training samples is enough.

IV. CONCLUSION
In this article, a novel patch-based training framework of FCN on sparse point labels with a lightweight and efficient network is proposed for HSIC. As a general training mode for remote sensing image semantic segmentation, the training framework is improved in network structure and training constraint in order to achieve good classification result under the limited supervision of sparse point. A lightweight network which guarantees the convergence of training is proposed with the aid of AGs and gated channel transformation on the basis of shallow-channel encoder-decoder with residual connections to excavate spectralspatial information of HSI. CRF-loss is adopted as a complementary constraint for further excavation of spatial context to make up for the insufficiency of prior information from sparse point labels in training stage. Extensive experiments conducted on three HSI datasets achieve state-of-the-art performance compared with other types of methods, which demonstrates the effectiveness of the patch-based training framework of FCN with sparse point labels. In addition to the classification accuracy, the advantages of the proposed method were also demonstrated in terms of the efficiency and the stability to deal with images of various sizes.
In the future, we would like to explore more universal network structures for HSIC in patch-based training patterns with limited point labels. Furthermore, we hope that the proposed method can be used in the practical application of hyperspectral classification and even extended to multispectral remote sensing images.
Xueliang Zhang (Member, IEEE) received the B.S. degree in geographical information system and the Ph.D. degree in remote sensing of resources and environment from Nanjing University, Nanjing, China, in 2010 and 2015, respectively.
From 2014 to 2015, he was a visited student with the Informatics Institute, University of Missouri, Columbia, USA. From 2016 to 2018, he was an Associate Researcher with the Department of Geographic Information Science, Nanjing University. He is currently an Associate Professor with the Department of Geographic Information Science, Nanjing University. His research interests include high-resolution remote sensing image analysis, semantic segmentation, and deep learning for remote sensing. Zixian Zheng received the B.S. degree in geographic information science from Sun Yat-sen University, Guangzhou, China, in 2019, and the M.S degree in cartography and geographic information system from Nanjing University, Nanjing, China, in 2022.
Her research interests include semantic segmentation and deep learning for remote sensing.