Convolutional Neural Networks With Class-Driven Loss for Multiscale VHR Remote Sensing Image Classification

Because land covers always have different scales, multiscale methods are widely used in very-high-resolution (VHR) remote sensing image classification. Traditional multiscale methods usually capture multiscale information by using rectangular windows of different sizes. Each scale contains the same number of training samples and is independently trained. Hence, the training process is time-consuming. In this article, a novel convolutional neural network with a class-driven loss (CNNs-CDL) model is proposed for multiscale VHR remote sensing image classification. First, a multiscale sample construction method is proposed to select a training sample and capture the relationships among different scale samples. The lowest-scale samples are selected on the lowest-resolution image and are mapped to the higher-resolution image without additional label information. Then, a CNN with class-driven loss is trained with the lowest-scale training samples. Class-driven loss can effectively learn the spatial dependence between the nonadjacent samples to promote classification accuracy. Finally, the CNN model is fine-tuned with the higher-scale samples. Although the number of higher-scale training samples increases, the fine-tuning process requires only a small number of iterations to converge. Hence, the proposed model can effectively reduce the training time. The experimental results for three VHR remote sensing images show that the proposed method performs better than several recently proposed methods.


I. INTRODUCTION
Very high-resolution(VHR) remote sensing image analysis plays an important role in earth observations [1]- [4]. Many efforts have been made to accurately classify VHR remote sensing images [5], [6]. In earlier decades, this research mainly focused on pixel-level classification methods such as support vector machine (SVM) [7], random forest [8], principal component analysis (PCA) [9], independent component analysis (ICA) [10], local Fisher discriminant analysis (LFDA) [11], multinomial logistic regression (MLR) [12], and neural networks [13]. The pixelwise results often suffer from salt-and-pepper noise, and it is well known that the combination of spectral-spatial information is more effective for accurate land-cover mapping.
Recently, deep neural networks have become a topic of interest in VHR remote sensing image classification. Chen and Lin [18] first introduced stacked autoencoders (SAEs) into hyperspectral image (HSI) classification in the spectral domain. Later, they proposed 3D convolutional neural networks (3D CNNs) for spectral-spatial feature extraction and compared three CNN models in their work, including 1-D CNNs (spectral domain), 2-D CNNs (spatial domain), and 3-D CNNs (spectral-spatial domain) [19]. Li et al. [20] exploited the application of deep belief networks (DBNs) on HSI classification. Cheng et al. [21] and Kang et al. [22] combined a Gabor filter with CNNs for contextual feature extraction and spectral-spatial feature learning.
Qing et al. [23] combined the Markov random fields model with CNNs for HSI classification. Mou et al. [24] proposed recurrent neural networks (RNNs) for HSI classification to handle the relationships between HSI pixels. Zhou et al. [25] applied long short-term memory (LSTM) to learn the spectral-spatial features for HSI classification. Li et al. [26] proposed a double-branch dual-attention mechanism network (DBDA) for hyperspectral image classification to capture the abundant spectral-spatial features. In [27], a multilevel context-guided classification method with object-based CNNs was proposed to exploit the deep discrimination features. Land covers always have different scales; hence, it is difficult to capture object features accurately with a single scale. Multiscale-based classification methods have received wide attention.
Multiscale-based classification methods usually extract multiscale samples from the original resolution image by using multiscale windows. These methods fully consider the multiscale characteristics of the land covers; however, the feature extraction process of each scale is independent in many studies. Meanwhile, sample testing also requires multiple forward propagation calculations, which is time consuming. In addition, traditional multiscale-based classification methods obtain spatial information by a spatial window. Although the spatial window (i.e., rectangular window or superpixel window) can effectively extract the local spatial features of the sample, the design of the networks does not consider the spatial dependence of nonadjacent samples.
Based on the abovementioned analysis, there are two main aspects that limit the performance of the multiscale-based classification methods: first, how to fully explore the multiscale information to obtain higher classification accuracy with limited training samples and training time, and second, how to effectively learn the spatial dependency of the samples to improve the performance of the CNNs.
Hence, in this article, we mainly focus on these two aspects and propose the CNNs-CDL model for VHR remote sensing image classification. On the one hand, we use multiresolution images to construct multiscale samples and propose a novel multiscale sample construction method to establish the relationships among different scales, which can effectively reduce the number of samples. Meanwhile, a multiscale network training method with a pretraining and fine-tuning model is designed to effectively reduce the training and testing time. On the other hand, the purpose of classification is to reduce the intraclass differences and increase interclass differences. Hence, we use the class centre as the constraint in the design of the class-driven loss to improve the spatial dependency during the pretraining process.
The main contributions of the proposed method are as follows.
1. Multiscale sample construction and learning methods are proposed to capture the multiscale characteristics of the image and effectively reduce the training and testing time of the networks.
2. Class-driven loss is proposed to learn the spatial dependency and consider the spatial dependency between classes during the pretraining process.
3. Experimental results verify the superiority of the proposed multiscale structure with limited training samples.
The rest of this article is organized as follows: section II shows the related work of multiscale deep feature extraction method. The details of the proposed method are described in section III. Section IV presents the experimental results and analysis, followed by a conclusion of our work.

II. RELATED WORK OF MULTISCALE DEEP FEATURE EXTRACTION
In recently years, multiscale-based deep feature extraction methods were studied to adaptive capture the objects with different sizes. The commonly used multiscale-based classification method is to extract the features with a rectangular window. In [28], a six-diverse-region-based CNN model was proposed to capture objects with different positions and scales (DR-CNNs). Zhao et al. [29] captured multiscale features using a multiscale convolutional autoencoder in an unsupervised manner; however, it was difficult to obtain better results for a complex scene. Later, they further proposed supervised multiscale convolutional neural networks (MCNNs) for remote sensing image classification [30]. Sun et al. [31] proposed a novel spectral-spatial framework by concatenating localized spectral features and hierarchical multiscale spatial features to improve the classification performance. In [32], a novel CNNs with multiscale convolution with multiscale filter banks was proposed to extract the deep multiscale features of HSI. He et al. [33] proposed a multiscale covariance maps to increase the robustness of the CNN model. In [34], densely based full convolutional networks were introduced for multiscale and multimodel high-resolution remote sensing image semantic segmentation.
To adaptive capture the multiscale information of the object, Li et al. [35] proposed a multiscale superpixel classification method for spectral-spatial classification (MS-SSC), and multiscale superpixels were used to capture the object adaptively with different scales. In [36], a multiscale superpixel-guided filter method was proposed to accurate present the edge information in the image. In [37], Wan et al. proposed graph convolutional network and is applied to the irregular image regions to improve the classification accuracy in class boundaries. To better determine the scale number, adaptive multiscale CNNs were proposed for scale learning [38], [39].
In addition, several works were studied to effective fusion the multiscale features. Liu et al. [40] proposed a multiscale feature fusion method to combine the detailed and semantic features, thus improving the classification performance of hyperspectral images. In [41], Mu also considered integrating multiscale spectral-spatial features and proposed multiscale and multilevel spectral-spatial feature fusion networks for HSI classification. In [42], Li, et al. proposed multiscale deep fusion residual network with a backbone network and a fusion network to adaptive fuses multiple hierarchy features in remote sensing image classification. In [43], multiscale hierarchical recurrent neural networks were proposed to describe the relationships among samples with different scales. In [44], multiple kernel technique is introduced to fuse the multiscale features.
Although there are many multiscale methods which have been proposed for VHR remote sensing image classification, each scale usually has the same number of training samples. When the number of training samples is limited, the features obtained at each scale are not reliable, which reduces the performance of multi-scale classification. Hence, we try to improve the classification accuracy with the limited training sample. The proposed method selects fewer initial training samples and improves the classification performance by enlarging the training set without additional label information.

III. THE PROPOSED METHOD
In this section, a novel framework of the CNNs-CDL model is proposed for multiscale VHR remote sensing image classification. The proposed model mainly consists of two steps: pretrain the CNN with class-driven loss and fine-tune the CNN to capture the multiscale information. Lowest-scale samples are used for pretraining the networks, and high-scale samples are used for fine-tuning with fewer iterations. The flowchart is shown in Fig. 1. The advantages of the proposed method are as follows: 1) the proposed model can capture the multiscale information through the multiscale samples; 2) the number of small-scale samples is smaller, and the number of iterations of high-scale samples is lessened to reduce the network training time with multiscale samples; and 3) class-driven loss can effectively establish the spatial dependency of samples from the perspective of classification targets. In the following, we will describe each part of the proposed method in detail.

A. CONSTRUCTION OF MULTISCALE SAMPLES
Traditional multiscale samples are extracted with rectangular windows of different sizes on the original image; hence, the numbers of training samples per scale are the same. In contrast to traditional multiscale training sample extraction, the proposed method extracts multiscale training samples from images of different resolutions, and the numbers of training samples per scale are different. The construction process is shown in Fig. 2. Low-resolution images can effectively suppress noise and have better spatial consistency. Therefore, the pixel has a higher correlation with the surrounding pixels. We can select fewer samples on the low-resolution image to effectively learn large-scale objects. As the image resolution increases, the details of the image increase, and the correlation of the adjacent pixels decreases. Hence, we increase the number of samples to better capture the features of the small-scale object. The following shows the extraction process of multiscale samples in detail.
Assume the original VHR image is I 1 with size M × N , and the corresponding ground-truth map is G 1 , the low-resolution image I i−1 is constructed by downsampling the high-resolution image I i : where the S is the scale number and D Img (·) is the average downsampling operator. To extract the main energy of the high-resolution image, we calculate the average of the pixels in the nonoverlapping window with size 2 × 2 on high-resolution image I i to obtain the low-resolution image I i−1 .
A similar process is used to deal with the multiresolution ground-truth maps: and D GT (·) is the statistical downsampling operator.
Considering that the class of the pixels within the 2 × 2 window may be inconsistent, we consider the class of the four pixels and select the class with the most occurrences as the class of the low-resolution pixel.
Through the above approach, the downsampling process, the multiresolution images, and their ground-truth maps are obtained. In the following, we extract the multiscale samples from the multiresolution images. The extraction process of multiscale samples is the inverse process of the downsampling process. We randomly select a small number of samples (S-scale sample) on the lowest-resolution image (S-resolution image) with a square window, restore the position of these samples to the corresponding four positions of the (S − 1)-resolution image, and extract the samples of the four pixels separately on the corresponding (S −1)-resolution image. We restore the image layer by layer, and samples of each scale are finally obtained.

B. PRETRAIN THE CNN WITH CLASS-DRIVEN LOSS ON THE LOWEST-SCALE SAMPLE
The lowest-scale (S-scale) sample can effectively represent a large-scale object and has better regional consistency and fewer training samples. To reduce the training time and improve the regional consistency of the initial classification map, the lowest-scale sample is used to pretrain the CNN.
CNNs can effectively extract the high-level features of low-scale samples. A CNN consists of several convolutional layers, pooling layers, full connection layers, and a classified layer. Assuming that the input lowest-scale sample is F 1 S , the propagation process is described below.
Forward propagation: Convolutional layer: The convolutional layer extracts the convolutional feature through a trainable convolutional filter, which is shown in Eq. (5).
where F l is the l-level convolutional feature map, K l is the filter, b l is the bias, f (·) is the activation function, which is selected as rectifier function.
Pooling layer: Downsample the input feature map to reduce the feature size and training time, which is shown in Eq. (6).
where g(·) is the max-pooling operator, and 2 × 2 is the nonoverlapping window used for max-pooling operations. Full connection layer and classifier layer: Flatten the output features F l S into a column vector and input them into a softmax classifier. The softmax classifier is shown in Eq. (7).
where θ is the trainable parameters of softmax classifier, and c i,a indicates the probability that the i − th sample belongs to class a, the position where the class with the maximum probability is set as the output label. Back Propagation: In this section, we define a novel class-driven loss to constrain the spatial dependencies of nonadjacent samples. The purpose of classification is to reduce the intraclass difference and increase the interclass difference. Therefore, we improve the classification accuracy by learning the class centre of each class and constraining the outputs through the class centre. The class-driven loss is defined below, and the process is shown in Fig. 3. where ξ soft max is the cross-entropy loss and ξ intra is the intraclass difference, which represents the differences between the features of the sample x i in the L-layer and the class centre feature in which the sample x i is located. ξ inter is the interclass difference, which represents the differences between the features of the sample x i in the L-layer and the class centre features of other classes. C(x p i ) indicates the class centre of the sample x i , and C(x q i ) indicates the class centre of other classes. α is the adjustable parameter, which is used to balance the cross-entropy loss term, intraclass difference term, and interclass difference term.
The optimization of the loss function ξ total consists of two parts: weights and biases in the CNN and the class centre. Weights and biases are updated by the gradient descent method. To update the class centre, at the end of each iteration, we calculate the features of all training samples via the forward process and obtain the class centres by computing the feature mean of each class. The initial value of each class centre is set as a zero vector, and the adjustable parameter α starts from 0 and increases with the number iterations, and for every 50 increments of iterations, the parameter increases by 0.001. Since the calculation of the class center is very inaccurate, the parameter α is set to 0, that is, the class-driven loss function degenerates into a general cross-entropy loss.

C. FINE-TUNE THE CNN WITH HIGHER-SCALE SAMPLES
After pretraining the CNN with the lowest-scale sample, we fine-tune the CNN with the (S-1)-scale sample and the (S-2)-scale sample until reaching the original resolution sample. The distributions of higher-scale samples are the same as those of lowest-scale samples; hence, only local information is considered in the fine-tuning process.
The loss function of the fine-tuning process is: Cross-entropy loss is adopted for fine-tuning in this section. The CNN further learns the small-scale object information during the fine-tuning process and achieves better classification results.
In the proposed method, the lowest-scale training sample has small numbers; hence, the pretraining time can be effectively reduced. Although the sample number increases four times faster than the sample resolution, fine-tuning requires   fewer iterations, so the overall process is more efficient than the traditional multiscale classification methods.

IV. EXPERIMENTAL RESULTS AND ANALYSIS A. DATASETS
In this section, three datasets are used to verify the performance of the proposed method, as shown in Fig. 4,  Fig. 5, and Fig. 6.
Aerial data [45] (in Fig.4(a)) shows the scene of Hang Zhou city of China. It was acquired by ADS80 remote sensor on a plane. The spatial resolution of Aerial image is approximately 0.32m with spatial size 560 × 360. It contains six classes available, including grass, water, road, trees, building, and shadow. The ground-truth map is shown in Fig.5(b).
Xian data (in Fig.5(a)) was acquired by QuickBird satellite on May 30, 2008, which is an multispectral image with  four bands. The size of Xian data is 500 × 500 with spatial resolution 2.44m. It contains seven classes available, including building, flat land, road, shadow, soil, tree, and water. The ground-truth map is shown in Fig.6(b).
Zh11 data (in Fig.6(a)) belongs to the Zurich Summer Dataset v1.0 [46]. The dataset contains 20 VHR remote sensing images, which were acquired by QuickBird for Zurich, Switzerland, in August 2002. Zh11 image contains four bands with six available classes. The size of Zh11 image data is 910 × 786. The ground-truth map is shown in Fig.6(b).
The numbers of the training and testing samples are shown in Table 1, and the average values of overall accuracy (OA), average accuracy (AA), and kappa coefficient are used to evaluate the classification resul

B. COMPARISON RESULTS
In this section, the proposed method is compared with five available methods, which are shown as follows.
Spectral-based method: SAEs method [18]. SAEs are well-known deep neural networks; however, the networks only accept the input of one-dimensional vectors and are introduced into the spectral-based method for classification.
Spectral-spatial-based deep neural network classification methods: 3D CNNs [19], Gabor-CNNs [21], DR-CNNs [28], and MCNNs [30]. 3D CNNs use a 3D convolutional   filter for spectral-spatial classification; Gabor-CNNs combine the Gabor filter with CNNs to improve the classification results; DR-CNNs exploit the deep features of six different regions of the sample to learn more discrimination features; and MCNNs extract the multiscale features through multiscale samples with different window sizes. Different from the proposed method, the multiscale samples are all extracted from the original images.   In our experiments, the proposed CNNs-CDL model contains two feature extraction layers, two max-pooling layers, and one full connection layer. Each feature extraction layer includes 20 6×6 filters and 40 5×5 filters, respectively. The fully connected layer contains 100 units, and the learning rate is set as 0.01. The iteration number for pretraining the CNNs-CDL model is 1000, and for fine-tuning, it is 300. There is no theoretical support for the parameter selection in the networks. Hence, these parameters are determined empirically. The convolutional filter parameters are randomly initialized. For a fair comparison, the parameter settings of the networks and the input window size in the comparison methods are the same as in the proposed method, and for the MCNNs method, the window sizes of the three scales are 21 × 21, 42 × 42 and 63 × 63. In addition, the number and position of the training samples are the same as the 1-scale of the proposed method.    Tables 2-4 show the classification accuracies. Obviously, the spectral-spatialbased methods achieve better classification results than the SAEs method. This contrast is more obvious when the land information is more complicated. Hence, spatial information is very important for VHR image classification. 3D CNNs and Gabo-CNNs are considered single-scale CNNs. Due to the effective extraction of texture information by the Gabor transform, the classification accuracy of the Gabor-CNNs is slightly higher than that of the 3D CNNs by 0.24%,  0.06%, and 0.4% for the Aerial image data, Xian data, and Zh11 image data, respectively. However, because Gabor needs to preset parameters, some small details are lost in the classification maps, such as the ship in the upper right corner of the Zh11 image data. The DR-CNNs, the MCNNs, and the proposed method are considered multiscale-based classification methods. These methods have also achieved higher classification accuracy. Taking a closer look at the classification maps, the DR-CNNs and MCNNs methods contain oversmoothing phenomena, and several small details are lost in the classification maps, such as the port in the Aerial data and the boat in the Zh11 data. In Tables 2-4, we note that the accuracies of the DR-CNN and MCNN methods and the proposed method are also very similar. However, the DR-CNNs and MCNNs have the same number of training samples per scale, and the training of each scale is independent, which requires more training time and testing time than the proposed method. The proposed method can achieve higher classification accuracy by requiring fewer iterations during the fine-tuning process; hence, the proposed method has a lower time cost. Through the above analysis, by balancing the classification accuracy and time complexity, the proposed method achieves the best performance compared to the other comparison methods.

A. PARAMETER ANALYSIS
Specific parameters of the proposed method, including input window size and scale number, will be analysed below.

1) INPUT WINDOW SIZE
It is difficult for windows that are too small to represent the structural information of the sample, and windows that are too large contain more interference information. Therefore, the appropriate window size is important for accurate classification. To analyse the effect of the window size on the performance of the proposed model, the input window size is selected from 17×17 to 25×25. Fig. 10 shows the OA value with different window sizes. For the Xian data and Zh11 data, 21 × 21 is the best choice. For Aerial data, there is no significant change in the classification effect with the window size from 19 × 19 to 23 × 23. Hence, the appropriate window size is selected as 21 × 21 in the proposed model.

2) SCALE NUMBER
In the proposed model, multiscale samples are selected for images with different resolutions. The image size dramatically decreases when the image resolution is reduced. Table 5 shows the image sizes with different resolutions. When the image resolution is reduced too much, more detailed information will be lost, resulting in a dramatic decrease in the classification result. Fig. 11 shows the classification accuracies at each single scale. The training samples at a single scale are used to train the class-driven CNN, and the training and testing numbers of each scale are shown in Table 1. The testing samples are all selected from the original resolution image. Although more scales can effectively reduce the number of training samples and reduce the time cost, when the scale is reduced to 3, the classification accuracy is significantly reduced. In the proposed model, we need to pretrain the networks with the lowest-scale samples, and a classification accuracy that is too low will affect the precision of the fine-tuning process. Hence, to balance the accuracy and time cost, the scale number is selected as 3.  Tables 6-8.
It can be noted that the classification maps of the proposed CNNs-CDL model show a significant improvement with the increased sample resolution. Figs. 12(a)-14(a) show the classification results of the pretrained networks. Although the classification results are not ideal, the low-scale sample shows fewer numbers and shorter training times and provides better initial classification results for the fine-tuning process than random initialization. Furthermore, a closer look at Tables 6-8 shows that 2-scale condition has a higher classification accuracy than only 2-scale condition under the same training set conditions. The same results were obtained for the 1-scale and only 1-scale conditions. Therefore, in the proposed model, the classification accuracy of multiscale samples is effectively improved by using the classification information of the previous scale.

C. COMPARISON RESULTS BETWEEN CLASS-DRIVEN LOSS AND CROSS-ENTROPY LOSS
In the proposed multiscale classification framework, CNNs with class-driven loss are proposed during the pretraining process to constrain the spatial dependencies of nonadjacent samples and improve the initial classification accuracy.   Table 9 presents the classification accuracies. For Aerial data, in comparing the cross-entropy loss, although the classification accuracy with class-driven loss has only improved 0.19%, it can be seen in Fig. 15(a) and (b) that the boundary information of class-driven loss is preserved much better. For the Xian and Zh11 datasets, the class-driven classification accuracies achieve an obvious improvement. Hence, the accuracy of pretraining also has an effect on the final classification accuracy, and the proposed class-driven loss can effectively improve the classification effect of the initial classification map and further improve the final classification accuracies.

D. EVALUATING THE CLASSIFICATION RESULTS WITH DIFFERENT NUMBERS OF TRAINING SAMPLES
The influence of the number of training samples on the classification accuracy is analysed in this section. We further reduce the number of training samples to analyse the effectiveness of the composition methods and the proposed method. The number of training samples for each class for the 3-scale condition is changed from 20 to 100, and the number of training samples of the corresponding scale can be obtained by the algorithm of section 2.1. The comparison methods are analysed using 1-scale training samples. Fig. 16 shows the OA value of each method with different training samples. The proposed method has better performance than other comparison methods, even in the case of fewer training samples.

VI. CONCLUSION
This article proposes the CNNs-CDL method for VHR remote sensing image classification. Through a novel multiscale sample construction and classification method, the number of training samples and the time complexity are reduced. Furthermore, a class-driven loss is proposed for pretraining the CNN to improve the spatial dependency and further improve the classification accuracy. Experiments were conducted on three VHR remote sensing images, including Aerial data, Xian data, and Zh11 data. These datasets have more complex land-cover information, especially the Xian data and zh11 data. Experimental results show that the proposed method achieves better performance than five state-of-the-art methods.
In general, multiscale classification methods can obtain better classification performance than single-scale methods. However, the time complexity will also increase. Although we tried to reduce the time complexity by controlling the number of samples of different scales, classification methods that can further balance accuracy and time cost need to be studied in our future work. In addition, the proposed method only considers the relationships among samples of different scales from the perspective of sample construction but does not consider the relationships among the features of different scale samples. We will consider the relationships among features of different scale samples in our future work, such as using RNN or LSTM network models.