Land Use Classification of High-Resolution Multispectral Satellite Images With Fine-Grained Multiscale Networks and Superpixel Postprocessing

Land use recognition from multispectral satellite images is fundamentally critical for geological applications, but the results are not satisfied. The scale dimension of current multiscale learning is too coarse to account for rich scales in multispectral images, and pixel-wise classification tends to produce “salt-and-pepper” labels due to possible misclassification in heterogeneous regions. In this article, these issues are addressed by proposing a new pixel-wise classification model with finer scales for convolutional neural networks. The model is designed to extract multiscale contextual information using multiscale networks at a fine-grained level, addressing the issue of insufficient multiscale learning for classification. Furthermore, a small-scale segmentation-combination method is introduced as a postprocessing solution to smooth fragmented classification results. The proposed method is tested on GF-1, GF-2, DEIMOS-2, GeoEye-1, and Sentinel-2 satellite images, and compared with six neural-network-based algorithms. The results demonstrate the effectiveness of the proposed model in finding objects of large scale difference, improving classification accuracy, and reducing classified fragments. The discussion also illustrates that convolutional neural networks and pixel-wise inference are more practical than transformer and patch-wise recognition.

. Land use classification from remote sensing images. Green denotes the pixel-to-pixel classification, blue denotes the patch-to-pixel (pixel-wise) classification, and red denotes the patch-to-patch (patch-wise) classification. methods, such as k-means clustering, iterative self-organizing data analysis technique (ISODATA) clustering, decision tree, random forest, and support vector machine, perform the pixel-topixel classification with the one-dimensional spectral information. New classification algorithms are based on neural networks and utilize both spectral information and two-dimensional structures. Distinguished by the output form, they can be categorized into two groups, namely pixel-wise and patch-wise. The patchwise classification is essentially a semantic segmentation as the input and the output are both patches. Pixel-wise classification is to use the scene classification pixel by pixel in which the input is a pixel or patch while the output is a class value. The input/output difference of the classification methods are presented in Fig. 1.
Pixel-wise classification will be carried out in this article. Patch-wise classification, or semantic segmentation, has been carried out in numerous works. For example, Liu et al. [1] used multiple images for training in context aware cascade network and expanded the training set using operations, such as overlapping block taking, rotation, and mirroring. With the context-aware encoder network, Liang et al. [2] presented a patch sampling strategy, in which manual intervention is needed to maximize the separation of training and validation sets. The accuracy of segmentation is usually lower than that of classification. In addition, patch-wise classification uses the comprehensively labeled patches matched with the input patches, which requires the labeled data to be as continuous and complete as possible. In contrast, pixel-wise classification requires only discrete or irregular labels. Therefore, pixel-wise classification is more suitable to the application needs of remote sensing classification.
This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ Pixel-wise classification tends to produce the "salt-andpepper" results. The derived land cover maps are highly fragmented to be incorporated into a geographic information system database. The reason is from the limited patch size that judges a label locally. Although the small size is effective for homogenous regions, it faces misclassification in heterogeneous regions due to the lack of training data, illegible features, or vague class. Therefore, fragmentation needs to be avoided when the pixel-wise classification is used.
High-resolution multispectral remote sensing images offer valuable spatial contextual information that can enhance classification accuracy through the acquisition of local multiscale features. Derived from the wavelet transform, multiscale learning is the sampling of different scales of a signal or image. Smaller scales can exhibit structures and textures, while larger scales focus more on the spectral features. In convolutional neural networks (CNNs), convolutional layers and residual modules that are cascaded with different kernel sizes can be used for multiscale learning, as has been proposed with different levels of semantic information or various receptive fields. Li et al. [3] presented an adaptive multiscale deep fusion residual network (AMDF) with the purpose of effectively using the useful information contained in shallow features to mine multiscale features with different levels of semantic information. In contrast, Liu et al. [1] and Hua et al. [4] aimed to perceive boundaries, regions, and semantic categories of the target objects by learning features on multiscale receptive fields. Considering that the integration of shallow and deep features faces the differences in size and amount of channels, multiscale features from various receptive fields are of more potential. In addition, Hang et al. [5] proposed a multiscale progressive network for gradually segmenting objects into small scale, large scale, and other scales by cascading three subnetworks.
However, the scale dimension of the current multiscale learning for pixel-wise classification is too coarse to account for rich scales in multispectral images. Detail emerges in high-resolution multispectral images when the spatial resolution approaches one meters, accompanied with more complex spectral features of ground objects [6]. In very high-resolution remote sensing images, land use are larger than a single pixel, and the phenomenon of "same object with different spectrums" and "same spectrum for different objects" becomes more prevalent. Finer multiscale characteristics with multiscale feature learning at smaller granular are, therefore, needed to deal with these challenges.
This article focuses on the two issues mentioned above. A new pixel-wise classification model is designed to extract the multiscale contextual information in images with multiscale networks at a fine-grained level to address the issue of insufficient multiscale learning for the classification of high-resolution multispectral images. A superpixel combination technique is proposed as a postprocessing solution to smooth the fragmented classification results.
The main contributions of this work are summarized as follows.
1) To achieve fine-grained multiscale learning in CNN, new downsampling and residual modules are proposed for the classification task of high-resolution satellite images.
2) To improve the fragmentation of classification results, a small-scale segmentation method is introduced for postprocessing to combine labels across class boundaries.

II. RELATED WORK
In this section, classical, neural network-based, and objectoriented classification methods are introduced. Classical and neural network-based classification are pixel-wise that outputs only one label for the center pixel of the input patch. Objectoriented classification is a typical representative of patch-wise classification that outputs all the labels for pixels in a patch.

A. Traditional Pixel-Level Classification
Unsupervised classification methods can directly classify image pixels based on gray-scale spatial features, which are suitable for feature classification scenarios with simple prior knowledge. Currently clustering techniques, such as k-means and ISODATA are often used for unsupervised classification. More and more sophisticated unsupervised classification methods have been developed to extract an appropriate group of features for the classification of remote sensing images in a more efficient way. For example, to achieve accurate classification of remote sensing images, Marinoni and Gamba [7] proposed an unsupervised approach for feature extraction based on data driven discovery, which exploits mutual information maximization to retrieve the most relevant features with respect to information measures. Huang et al. [8] proposed a multiview subspace clustering model, which exploits effectively the rich information from multiple features extracted either from a single data source or from multiple sources. Unsupervised classification methods can only distinguish different categories and not determine the attributes of the categories, since they lack the necessary a priori knowledge.
By learning data relationships from a given training set containing ground truth information, supervised classification methods are more suitable than unsupervised classification methods for remote sensing images with complex ground object types, and usually have higher classification accuracy. There are two primary groups of supervised classification methods. The minimum distance method and the maximum likelihood method are two prominent examples of the first group of supervised classification methods based on statistical models. The minimum distance method, which classifies pixels based on how far they are from the center of each category, is a relatively simplified classification method. The maximum likelihood method is a nonlinear classification based on Bayesian criterion with minimal probability of classification error.
The second group is supervised classification methods based on machine learning, mainly including decision tree, random forest, support vector machine, neural network-based classification, and object-oriented classification. Decision tree classification is a method that compares the eigenvalues of pixels with a set baseline value in a hierarchical manner. The classification and regression tree model was proposed by Breiman et al. in 1984 and is a widely used decision tree classification method. Random forest is an integrated classification model proposed by Breiman [9] in 2001. To address the problem of noise in training data, Gislason et al. [10] used random forest to classify multispectral images. support vector machine is based on the structural risk minimization criterion to maximize the generalization ability of the classifier with strong nonlinear and high-dimensional data processing capability while making the sample classification error minimal. A fuzzy support vector machine-based multispectral image classification method was put out by Wang and Ma [11] and has higher accuracy than the method that uses an support vector machine directly.
These classical algorithms belong to the pixel-wise classification. The advantages of these methods are fast training or unsupervised classification. These methods only utilize the spectral information of images and cannot utilize the spatial contextual information in images. With the emergence of large amount of details in high-resolution multispectral remote sensing images and the complexity of spectral features, the classification accuracy of these methods is poor. And since the classification is performed pixel by pixel, the categories between neighboring pixels have contingency. Misclassification is prone to occur in the region of ground object category transition or feature ambiguity, resulting in fragmented classification results.

B. Neural Network-Based Pixel-Wise Classification
Neural network-based classification contains many different approaches. Multilayer perceptron neural network based on error back propagation is the first algorithm that was introduced for remote sensing image classification. In the latest research, deep learning-based classification methods have been widely used, which can be seen as a development of neural networks. Deep belief network (DBN) achieves image classification by unsupervised pretraining of unlabeled data and supervised fine-tuning of labeled data. Liu et al. [12] calculated the texture features of high-spatial resolution remote sensing images through nonsubsampled contourlet transform, and used DBN to classify images based on spectral and texture features. Subsequently Zhong et al. [13] developed a new diversified DBN through regularizing pretraining and fine-tuning procedures by a diversity promoting prior over latent factors. Chen et al. [14] proposed a SAE to extract the high-level features for remote sensing images using spectral-spatial information. Chen et al. [15] used stacked denoise autoencoder to extract features, and used logistic regression approach in the top layer of the network to perform supervised fine-tuning and classification. However, the inputs of these models are in vectorization form that may ignore the neighborhood structures around pixels.
CNN models allow the use of spatial patches as data input, providing a natural way to integrate spatial contextual information with higher classification accuracy compared to BP, DBN, and SAE. Based on this, Maggiori et al. [16] designed a fully convolutional architecture for the dense classification of remote sensing images and addressed the issue of imperfect training data through a two-step training approach. Ji et al. [17] proposed a novel three-dimensional CNN based method that automatically classifies crops from spatio-temporal remote sensing images. Liu et al. [18] proposed an end-to-end learning framework based on deep multiple instance learning, using CNN and SAE to extract the spatial features of panchromatic images and the spectral features of multispectral images, respectively. In recent years, the development of CNN has made continuous breakthroughs in multispectral image classification. Aiming at the problems of gradient explosion, gradient disappearance, and nonconvergence brought by deeper networks, ResNet [19] used the concepts of residual learning and skip connection to deepen the network complexity in exchange for the higher classification performance. On the other hand, some research has focused on improving the computational efficiency of networks, such as LinkNet [20], which is pretty light but superior in performance.
Recent advancements in deep learning networks have significantly improved the extraction of discriminative features from remote sensing data. However, the performance bottleneck in identifying and recognizing objects of interest when only using single satellite data has become increasingly evident. To overcome this limitation, multimodal networks have been proposed and used for remote sensing images. These networks combine multiple sources of data to improve classification accuracy and obtain better results than only using single satellite data source. For example, Gadiraju et al. [21] developed a multimodal deep learning framework for crop classification using multispectral and multitemporal satellite images. Similarly, Hang et al. [22] proposed an unsupervised feature learning model to extract features by using the relationship between hyperspectral and light detection and ranging data. These studies demonstrate the potential of multimodal networks to improve the accuracy and efficiency of remote sensing image analysis. However, the classification of a single multispectral image still has the greatest universality in terms of the burden of data preparation.
The neural network-based pixel-level classification method can obtain higher classification accuracy compared with the classical method. Furthermore, the CNN-based pixel-level classification adopts the patch-to-pixel classification way, which utilizes both the spatial contextual information and the spectral information of images, and achieves a high classification accuracy. However, patch-to-pixel classification still results in the same fragmented classification results as the classical method and has a slower training speed.

C. Object-Oriented Classification
Object-oriented classification is a new remote sensing image classification method that emerged for high-resolution remote sensing image applications. Object-oriented classification takes regions containing similar semantic information as the processing objects for classification, and can use not only the spectral features of images, but also the geometric features, texture features, adjacency relationships, and other spatial features of images. Image segmentation is used in object-oriented classification, where the image to be classified is segmented to generate image objects. Then, the image objects are classified using methods, such as nearest neighbor classification or fuzzy classification.
On the basis of object-oriented classification method, Zhang et al. [23] extracted Zhangjiangkou mangrove communities from QuickBird images with segmentation, merge, computing, and attributes selection. Mirzapour and Ghassemian [24] proposed an object-based method for multispectral image segmentation and classification. In addition Jin et al. [25] presented a method that combines object-oriented approach with deep CNNs. Baroud et al. [26] also proposed an artificial neural network combined with object-oriented method for land cover classification of high-resolution multispectral remote sensing images.
Object-oriented classification can guarantee continuous classification results and is faster than pixel-wise classification methods. However, the classification accuracy of object-oriented classification is limited by the segmentation accuracy. The existing segmentation algorithms have limited accuracy, which is significantly lower than the classification accuracy, making the overall classification accuracy of object-oriented classification slightly worse.
Although transformer shows advantages in many applications, CNN is more friendly to the training amount and computational burden. CNN focuses mainly on local features, while the cross attention in Transformer can capture global similarity over a larger receptive field. However, the attention mechanism contributes weakly to the pixel-wise classification that uses only local features. Instead, it introduces three disadvantages. First, in order to achieve the same performance as CNN, a very large size of training data is required. Second, far more graphical card memory is used to train a transformer than a CNN. Lastly, the computational effort of a Transformer is usually larger than that of CNN. Since a portion of the image to be classified has to be manually labeled to pursue the best accuracy, pixel-wise classification needs to be trained online which prefers smaller labels. In terms of the computational volume, a remote sensing image commonly has more than 100 million pixels, then it is a huge burden to perform scene classification for each pixel. Global features can also be captured by CNN plus attention module, but the abovementioned disadvantages cannot be avoided. The combination of pretrained transformer and shift learning can reduce the amount of training data, but the consumption of memory and computation is greater than that of CNN.

III. METHODOLOGY
In this section, new solutions are proposed for land use classification of high-resolution multispectral satellite images. First, a new pixel-wise classification network is created consisting of downsampling and multiscale residual blocks (MRBs) to capture multiscale contextual information. The MRBs extract multiscale features at a fine-grained level. A postprocessing module is designed that uses superpixels derived from smallscale image segmentation to refine the pixel-wise classification results by stitching the fragmentation. The new model is named as fine-grained multiscale classification network, or FGMCN. The superpixel postprocessing is abbreviated as SPP. The whole method is called FGMCN-SPP.

A. Multiscale Residual Network
There is rich scale diversity in high-resolution satellite images. Woodlands and water can be identified point by point through normalized difference indices. Cultivated land has to be identified over an area. Artificial buildings are identified over larger and more diverse patch sizes. When secondary classification is involved, the diversity of scales is even more complex. However, the multiscale learning in land use classification tasks can only learn features at given scales. Therefore, when plenty of computational resources and training data are given, the classification accuracy can be improved by designing extractors at diverse scales as many as possible.
The proposed network structure is shown in Fig. 2. At the beginning of the network, a batch normalization (BN) layer is first used to normalize the input data. A 3 × 3 convolutional layer is then applied to the normalized image blocks to extract shallow features. Four MRBs are alternatively cascaded with four downsampling blocks (DSBs) to gradually extract high-level features. The separation of downsampling and feature blocks ensures that more features can be learned individually. Multiscale features are learned with the newly designed MRBs. After the feature learning, a high-level feature map is downsampled with a 2 × 2 average pooling with stride 1. Next, a fully connected layer and a softmax activation layer are employed to convert the multichannel feature mapping into a multiclass problem for pixel classification. The dropout operation is used behind the average pooling to reduce possible overfitting.
The network parameter is listed in Fig. 2. A patch with a size of w × w is created as the feature region centered at each labeled pixel. Therefore, the actual size of the input data is I ∈ R w×w×B , where B is the channel number of input image. The suggested size of the input image for satellite images with spatial resolution greater than 1 m is 27 × 27. The output channels for the first convolutional layer is set to 64, and the output channels for each subsequent block are set to 128, 128, 256, 256, 512, 512, 1024, and 1024, respectively. The dropout probability of dropout is 0.5. The scale of training data has been taken into consideration when setting these settings.
1) Downsampling Block: Existing classification networks use either convolution or pooling for downsampling. Maximum pooling improves the nonlinear representation of the network out of commonplace information. Cascaded downsampling convolution maintains the indistinctive features instead of salient features. The average pooling smoothes out salient features, too. However, there are both distinctive and indistinctive features in high-resolution satellite images as the ground resolutions are roughly between 2 and 30 m. The former is beneficial for identifying artificial facilities, while the latter can be used to identify natural resources such as forest land and water. Since using a single downsampling method may lose features, the complementarity of the two downsampling methods is harnessed to design the downsampling module. Fig. 3 depicts the three parallel branches that make up the downsampling procedure. The network automatically searches for branches that are appropriate for various structures. After the first convolution layer, the input of the first DSB is I 1 ∈ R w×w×N . The left branch using a 3 × 3 convolution layer with stride 2 to transform I 1 into I D_L ∈ R w 2 × w 2 × N 2 for extracting features and downscaling. BN is to avoid the gradient vanishing.  The calculation can be formalized as where Conv denotes the convolutional operator, ω and b are the weight and bias of the convolution layer, respectively, E(I D_L 1 ) and Var(I D_L 1 ) are the expectation and variance of I D_L 1 , respectively, and ε is a small constant value (i.e., 1e−5) to maintain stability. The middle branch using layers of 1 × 1 convolution, 3 × 3 convolution, and 3 × 3 convolution with stride 2 to transform for enlarging the receptive field and extracting features at a wider scale. Then, a 3 × 3 max pooling with stride 2 is used to obtain the texture detailed information in the right branch. For input feature map I 1 ∈ R w×w×N , the max pooling selects the maximum value of a specific area R c k,k as its representation where 1 ≤ c ≤ N and 1 ≤ k ≤ w. Therefore, the output of DSB can be denoted as Our downscaling module is based on the Reduction A module from Inception V4 [27] but deliberately modified to suit for classification applications. The number of output channels in the convolution part is reduced to half the number of input channels. The ReLU activation layer in the convolutional branch is removed to prevent feature loss. Finally, the two convolutional branches are concatenated with the maximum pooling branch, each playing a half role. As a result, the number of output channels is expanded to twice the number of input channels when the scale is reduced to half of the input.
2) Multiscale Residual Block: The proposed MRB is depicted in Fig. 4, which incorporates the multilevel residual connection in [28] and two residual modules similar to Res2Net [29]. Four parallel branches make up the Res2Net-like residual module, which extracts features at four different scales. The network is expected to automatically discover scale features that best suits for input image content. To obtain distinguishable receptive fields at a fine-grained level, the Res2Net module is introduced. By stacking convolutional layers, CNNs may learn coarse-tofine multiscale features and increase the receptive field. The Res2Net module builds hierarchical residual connections within one single residual block, which can broaden the range of receptive fields. The multiscale residual module has been applied to extract multiscale convolution features of remote sensing images [30].
The structure of our residual block is slightly different from the Res2Net residual module. The input of the first residual module is I DSB ∈ R w 2 × w 2 ×2 N . The output of the 1 × 1 convolution in the first layer is I M 1−1 ∈ R w 2 × w 2 ×2 N and separated into four equal portions to fed into each branch separately. Assume x i and y i represent the ith input and output parts, respectively, the abovementioned process can be formulated as These four branches are combined to obtain multiscale features I M 1−2 ∈ R w 2 × w 2 ×2 N . After that, multiscale features are convolved and connected with the input of first residual module. The output of the first residual module can be expressed as I M 1−2 = Concat (y 1 , y 2 , y 3 , y 4 ) = {y 1 , y 2 , y 3 , y 4 } (6) In addition, the channel quantities of the input and the output are the same to guarantee the feature learning capacity of MRB. To avoid feature loss, the ReLU activation layer in the Res2Net residual block is removed. Finally, the output of the MRB is

B. SPP to Smooth Fragmented Labels
Our classification method is pixel-wise, that is, a patch is fed into the network and a label value is output endowing the class of the pixel at the patch center. Broken spots are observed in the results of pixel-wise classification methods. Especially, the classification results based on deep learning are not stable enough as a result of the lack of corresponding training data for the mixed pixel features in the cross-category transition zone. Some pixels in heterogeneous regions show "salt-and-pepper" noise style in the output label images with small proportion.
A postprocessing technique is then suggested using smallscale segmentation to address the label inaccuracy issue for heterogeneous regions. The patch size in pixel-wise classification is fixed, which leads to ambiguous category judgments for pixels in transition regions. In this case, the human vision system is accustomed to searching for salient borders ahead of judging the category. Segmentation is, therefore, incorporated into classification as a processing method to mimic the experience of human eyes.
SPP is proposed in light of this. In parallel with the classification, small-scale segmentation is performed on the image to be classified. A superpixel is defined as the pixel set in a segmented image block. An image can be divided into multiple superpixels based on the similarity of feature, shape, or texture.
The superpixel segmentation results will be fused with the classification outcomes. Taking the superpixel S as an example, different categories C n contain different number of pixels in S, and is defined as P (C 1 ) + · · · + P (C n ) = 1 (8) where P (C n ) denotes the proportion of category n in S, that is, the number of pixels of category n contained in S is divided by the total number of pixels in S. The procedure of label correction is analogous to voting. A Superpixel is considered as a uniform class when the fraction of that class is dominant, whereas the remained portion is more likely to be of wrong labels. An image segmentation method is chosen to meet the demand of our task. Small-scale segmentation results are desired because transition zones are typically tiny in size. Hu et al. [31] proposed a stepwise evolution analysis (SEA) framework. In SEA, the evolution of scale, local variance, and Moran's index are analyzed step-by-step, and four LV-and MI-metrics-based methods are technically integrated for automated scale parameterizations. Later, in [32], they experimentally concluded that the optimal scale of an object-based image classification work is a range rather than a single value, and demonstrated the possibility of the framework to automatically estimate the optimal classification scale. This technique is chosen as our segmentation tool because, with the right parameters, it can produce small block segmentation. Fig. 5 displays the results with the suggested segmentation tool.

IV. EXPERIMENTAL SCHEME
The proposed method is tested on five satellite images. Six existing classification algorithms are compared to validate the performance. The details of competing algorithms are given in this section. The objective metrics are used to evaluate the experimental results.

A. Datasets
Five remote sensing images from GF-1, GF-2, DEIMOS-2, GeoEye-1, and Sentinel-2 satellites are used for the experiment. Images of GF-2 and DEIMOS-2 are from public datasets while others are labeled by us. Their spatial resolutions range from 1 to 10 m. Each image contains blue, green, red, and near infrared bands. The numbers of marked pixels are within the range of 900 000 to 32 000 000. The

B. Cluster Sampling
Before the experiment, the cluster sampling strategy in [35] is adopted. Sampling strategies dividing the labeled data into train and test sets has a significant impact on the quality and reliability of the estimated generalization error [36], while the cluster sampling strategy can mostly ensure the fairness.
The K-means algorithm partitions all samples into two groups with regard to the spatial coordinates. One group is chosen at random to extract training samples for each category, and the other group are for test. For each category in GF-1, DEIMOS-2, GeoEye-1, and Sentinel-2, 10% of it in a group was randomly chosen for training while all the pixels in the other group are used for test. Because of the vast amount of labeled data in the GF-2 image, 1% of the entire dataset was randomly chosen from one group for training and 5% from the other group for test.

C. Parameter Setting
The proposed FGMCN and SPP methods are compared with some algorithms for performance validation, including ResNet-34 [19], SSRN [37], MSPSSRN [38], AMDF [3], CANet [39], and SDF 2 N [40] to ascertain the efficacy of the proposed approach, which are all CNN-based. Among them, ResNet-34 is a standard residual network, CANet is a residual network with an attention mechanism, SSRN is for hyperspectral images, while MSPSSRN, AMDF, and SDF 2 N are specifically designed for multispectral images. The programming language is PYTHON with the KERAS framework for deep learning.
Training parameters are set for the algorithms. The input patch has 27 × 27 pixels. AMDF uses the SGD optimizer and trains 200 epochs. The initial learning rate is 0.001 and the input patch has 31 × 31 pixels. SDF 2 N uses the Adam optimizer and trains 100 epochs. The initial learning rate is 0.001 and the input patch has 33 × 33 pixels.
Metrics are used to evaluate the classification results. Accuracy is the most commonly used rule for this topic. In addition to accuracy, recall, and F1-score are also evaluated for category by category. To give a general conclusion related to the accuracy, overall accuracy (OA), average accuracy (AA), and the Kappa coefficient are also used to measure the classification quality. All these metrics output scores within the range [0,1] where the higher gives the better.

V. EXPERIMENTAL RESULTS
The experimental results are visually presented in this section. They are also quantitatively evaluated with metrics. The proposed FGMCN model and the SPP solution are validated separately.

A. Quantitative Comparison on Test Sets
The test sets of the five images were tested using FGMCN and the competition algorithms. To avoid the effect of random clustering, each algorithm runs five times with random training and test set, and the average scores are recorded as the final result, which are presented in Tables I-V. The proposed SPP   TABLE I  ASSESSMENT FOR THE TEST DATASET OF THE GF-1 IMAGE was not used at this stage because of the discrete test samples. Bold digits represent the best scores.
In Tables I-V, OA, AA, and the Kappa coefficient indicate that the newly proposed FGMCN method performs better than competing algorithms. SSRN performs poorly, which is overweighed by ResNet-34 and ADMF. MSPSSRN, CANet, and SDF 2 N are even better, but the proposed FGMCN algorithm gives the best results steadily.
The DEIMOS-2 classification is the most challenging because of the largest number of classes. There are large class imbalance between the different categories. Bridges are only identified by ResNet-34, CANet, SDF 2 N, and our model. In this scene, our method has the highest stability, which can be partially explained by fine-grained multiscale learning. Table IV shows that grassland is difficult to be identified for all algorithms because of insufficient labeled pixels and the similarity to forest or farmland. The confusion matrix shows that grassland is tend to be misclassified as farmland. The accuracy of other algorithms is around poor 45%. The proposed FGMCN algorithm gives the highest recall rate and F1-score.

B. Quantitative Comparison on Full Images
To verify the effectiveness of SPP, classification was performed on the full images using the trained models. The   TABLE II  ASSESSMENT FOR THE TEST DATASET OF THE GF-2 IMAGE complete labeled pixels are evaluated and the results are presented in Tables VI-X. The last columns of these tables give the scores where SPP is used to correct the FGMCN results through the full images which are abbreviated as "+SPP." Tables VI-X show that the classification accuracy is improved for all algorithms, which indirectly indicate the effectiveness of cluster sampling strategy for generalization comparison. The  TABLE III  ASSESSMENT FOR THE TEST DATASET OF THE DEIMOS-2 IMAGE   TABLE IV  ASSESSMENT FOR THE TEST DATASET OF THE GEOEYE-1 IMAGE scores of OA, AA, and the Kappa coefficient indicate that our method performs far better than competing methods. When SPP was added for classification refinement, OA, AA, and the Kappa coefficient are slightly improved for almost all images.
In addition, the value of the optimal SPP threshold varies depending on the number of annotated pixels. Based on the  V  ASSESSMENT FOR THE TEST DATASET OF THE SENTINEL-2 IMAGE   TABLE VI  ASSESSMENT FOR THE WHOLE GF-1 IMAGE   TABLE VII  ASSESSMENT FOR THE WHOLE GF-2 IMAGE  TABLE VIII  ASSESSMENT FOR THE WHOLE DEIMOS-2 IMAGE   TABLE IX  ASSESSMENT FOR THE WHOLE GEOEYE-1 IMAGE experiments, it is recommended to set the threshold to 0.90 for less than one million annotated pixels, 0.75 for millions of annotated pixels, and 0.60 for tens of millions of annotated pixels. To pursue the best combination performance, various thresholds can be set for separate categories.

C. Visual Comparison
The classification results for the whole images are shown in Figs. 6-10 where local blocks are magnified to compare the details. Fig. 7 shows that the river in the city is tend to be recognized as farmland except for ResNet-34, SDF 2 N, and our method. In Fig. 8, the boats can be identified only by SDF 2 N and our method.
The results confirm that SPP can improve accuracy and reduce fragmentation. For example, roads in Fig. 6 are unrecognizable in the FGMCN result but are recovered by SPP. The fragments of the river areas in Fig. 7 are also effectively removed by SPP. As can be seen in these figures, SPP can outline ground types at a small scale, which smooths out minor classification errors in the category transition regions.

VI. ABLATION STUDY
The key to the performance improvement achieved by our FGMCN model is the DSB and MRB modules designed for the classification problem. Distinguishing from existing deep learning-based classification models, our FGMCN model first uses the DSB to automatically search for branches that fit various structures while performing downscaling. Then, MRB is used for feature extraction to obtain fine-grained distinguishable perceptual fields.
To illustrate the contribution of these two modules, an ablation study was conducted. In this experiment, the DSB and MRB components were selectively removed to understand the contribution to the overall classification model. Specifically, DSB was replaced by a 3 × 3 convolution with the stride of 2, and MRB was replaced by a common residual block. The modified FGMCN model was then trained, and the convergence is presented in Fig. 11 by assessing the variation of the validation  error during the training of the model. The assessment results are presented in Table XI. In general, error increases significantly when DSB is removed, and removing MRB contributes to incur slight accuracy loss, too. In contrast, FGMCN has the fastest convergence and the lowest errors on the validation dataset.

VII. DISCUSSION
As mentioned in the first section, our classification method is pixel-wise and CNN-based. The motivations are from the practical purpose targeting the best performance as well as convenience. In this section, the advantages of the two strategies are discussed.
To compare the performance difference between pixel-wise classification and patch-wise classification, an additional experiment is conducted by introducing a patch-wise classification algorithm for comparison. Since patch-wise training requires complete patch labels, it is tested only on the GF-2 image. CAEN [2] is chosen as the competing algorithm, which is patch-wise-based. To train CAEN, the Adam optimizer is used iterating 80 epoches with a learning rate of 0.001. The input patch size is set to 56 × 56 pixels.
Two partition ratios are used on the GF-2 data to verify the performance difference. In the first test, 10% of the dataset is used for training and 50% for testing, as is the same to the proportion used in the experimental section. In this case, the OA score is 0.899 and the Kappa score is 0.850. By comparing the results with the scores in Table II, it is concluded that CAEN performs poorly than CANet, SDF 2 N, and the proposed FGMCN model. In order to explore the performance boundary of the patch-wise classification method, 50% of the dataset is used for training and 50% for test, which is consistent with the ratio in this article [2]. In this case, the OA score is 0.907 and the Kappa  [40] score is 0.861. The new results are getting better and only lower than FGMCN. However, patches in the training and test datasets are much similar as the high ratio makes the random clustering not feasible for a fair comparison. In other words, pursuing the extreme performance of patch-wise classification methods are in the cost of huge consistently labeled samples that are difficult to be satisfied. In contrast, our pixel-level classification method achieves higher accuracy with far less tags that are discrete and irregular.
The poor classification effect of transformer can be indirectly demonstrated by the results of the existing literature. In the SDF 2 N study, SpectralFormer [41], a transformer-based classification method, was tested on three images, including two multispectral images and an airborne hyperspectral image. The evaluation results of SDF 2 N and SpectralFormer are partly shown in Table XII, where the Transformer-based SpectralFormer algorithm is not as good as the CNN-based SDF 2 N method. On the other hand, Tables I-V show that the proposed FGMCN method performs better than SDF 2 N for multispectral classification, and the scores of MSPSSRN and CANet are similar to SDF 2 N, which indicate the superiority of CNN-based methods than Transformer-based methods.

VIII. CONCLUSION
This study introduces a novel classification method that utilizes FGMCN and SPP to enhance the quality of multispectral classification in high-resolution images. To improve the effectiveness of learning multiscale information images, the proposed method constructs a multiscale residual network at a finer scale. The proposed method is compared with six widely used image classification algorithms on five remote sensing images acquired by GF-1, GF-2, DEIMOS-2, GeoEye-1, and Sentinel-2 satellites. The experimental results demonstrate that the proposed method performs well in terms of OA, AA, and the Kappa coefficient, with good classification accuracy for highresolution multispectral images. Additionally, SPP can reduce the speckles for pixel-wise classification results substantially, thereby improving both accuracy and visual acceptance of the proposed method.