Adaptive Multiscale Superpixel Embedding Convolutional Neural Network for Land Use Classification

Currently, a large number of remote sensing images with different resolutions are available for Earth observation and land monitoring, which are inevitably demanding intelligent analysis techniques for accurately identifying and classifying land use (LU). This article proposes an adaptive multiscale superpixel embedding convolutional neural network architecture (AMUSE-CNN) for tackling LU classification. Initially, the images are parsed via the superpixel representation so that the object-based analysis (via a superpixel embedding convolutional neural network scheme) can be carried out with the pixel context and neighborhood information. Then, a multiscale convolutional neural network (MS-CNN) is proposed to classify the superpixel-based images by identifying object features across a variety of scales simultaneously, in which multiple window sizes are used to fit to the various geometries of different LU classes. Furthermore, a proposed adaptive strategy is applied to best exert the classification capability of the MS-CNN. Subsequently, two modules are developed to fully implement the AMUSE-CNN architecture. More specifically, Module I is to determine the most suitable classes for each window size (scale) by applying majority voting to a series of MS-CNNs Module II carries out the classification of the classes identified in Module I for the given scale used in the MS-CNN and, therefore, complete the LU classification of the entire classes. The proposed AMUSE-CNN architecture is both quantitatively and qualitatively validated using remote sensing data collected from two cities, Kano and Lagos in Nigeria, due to the spatially complex LU distribution. Experimental results show the superior performance of our approach against several state-of-the-art techniques.

to severe degradation to the natural vegetation ecosystem. In turn, this has created substantial effects on the land use (LU), such as loss of prime agricultural lands, destruction of important wetlands, and increased flood risk [1]. However, having access to reliable LU maps, categorizing scene images into a discrete set of meaningful LU classes, greatly assists these tasks in natural resource monitoring and sustainable land management [2]. As the rate of LU change grows ever faster, particularly in developing countries [3], the importance of having accurate up-to-date LU information grows too. Although these maps can be produced by hand, this is both time-consuming and expensive to do [4]. As access to high-power computing resources has significantly increased in recent years, it is perhaps unsurprising that various computational techniques have been presented, which aim to automate this task. In early days, due to the low spatial resolution of satellite images (such as Landsat series), the pixel-level classification paradigms dominated the studies in the field. Some popular machine learning methods had been successfully applied for categorizing scene images at pixel level, including support vector machines (SVMs) and random forests [5]. However, while achieving a fine spatial resolution of remote sensing images, the performance of these pixel-level methods has not improved due to the inherent issues in high-resolution images, such as the within-class variability and low between-class separability [6]. As a result, the object-level delineation and the analysis of remote sensing images have emerged to study semantic entities or scene components rather than individual pixels, as the solution paradigm named object-based image analysis (OBIA) [7], [8], [9]. A typical method, OBIA-SVM [10], [11], which incorporates this object-based paradigm into the traditional SVM solution has successfully moved from pixel-based techniques toward object-based representation for better LU classification. In the last decade, however, the successes of deep learning methods in many areas have led those in the remote sensing community to investigate their potential to become the new state of the art in the field. From convolutional neural networks (CNNs) [12], [13], to recurrent neural networks [14], to autoencoders [15], it is now commonplace to find deep learning methods being used to process remote sensing data, such as multispectral imaging [6], [16] and hyperspectral imaging [17], [18], and then, they often perform tasks such as LU classification with phenomenal precision [5], [6]. Among these methods, the pixelwise CNN had been widely applied by combining This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ a pixel-based classifier with a context-based classifier, such as the multilayer-perceptron-based CNN [19], [20] providing different feature representations with strong complementarity. Recently, an object-based CNN (OCNN) [16] method has also been broadly accepted in LU classification, which incorporates the OBIA technique into the CNN framework for better learning LU objects via within-object and between-object information.
However, while deep learning methods are capable of producing high-accuracy LU classification maps, they are rising to the challenge that they often require large volumes of labeled data to do so, which can be nearly as time-consuming and expensive as producing LU classifications by hand [5]. There have, therefore, been various studies that have implemented techniques such as transfer learning [21] and active learning [22] in order to attempt to replicate high-accuracy results with smaller quantities of training data; however, their performance improved is limited especially for the data with complicated scenes. To address this challenge, a two-stage approach for automatic LU classification, decent-sized training data, and high-accuracy performance has been developed. More specifically, the multiscale superpixel CNN (MS-CNN) model is first proposed for allowing object features to be learnt across multiple scales simultaneously, and then, an adaptive strategy based on majority voting is developed to exert the role of the MS-CNN to find the most suitable scale for each class in order to obtain best LU classification. To this end, the MS-CNN is designed to replicate the ability of deeper CNN models to extract features [23] but does so without using such a deep network in order to avoid requiring the very large quantities of training data necessary to protect against overfitting and to ensure that the model generalizes well. The MS-CNN is, therefore, built upon an initial image segmentation using superpixels [24], where the superpixel segmentation is used to extract image patches in order to inform targeted sampling locations such that fewer locations need CNN classification. Thus, the MS-CNN is an object-based CNN with contextual features for LU classification. Nonetheless, in the viewpoint of its model architecture, the MS-CNN aims to extract multiscale features to classify all the objects without considering the geometric characteristics of each class; for example, residential class may be better identified using a small scale, but highway class could be well recognized using a large scale in this study. For exploring this scale adaptability in the MS-CNN and improving the performance of LU classification, an Adaptive MUltiscale Superpixel Embedding Convolutional Neural Network (AMUSE-CNN) architecture is developed to sufficiently exert the classification capabilities of the MS-CNN when dealing with complicated geographical situations. To this end, the AMUSE-CNN architecture is designed to consist of two modules that each one performs its own respective function. Module I aims to find the most suitable classes for each window size (scale) by applying the majority voting to the outcomes of a series of MS-CNNs. Module II is used to apply the scale identified in Module I to the associated classes so that these classes are able to be better classified compared to the use of a singular MS-CNN. With the seamless integration of the MS-CNN model and the adaptive strategy of scale selection, the proposed AMUSE-CNN architecture is effective and efficient to perform LU classification, as an adaptive, multiscale, and object-based LU classification system. We compare our method against the state-of-the-art techniques, OCNN [16], OBIA-SVM [11], and pixelwise CNN [20], on two real world datasets in Nigeria.
The main contributions of this study can be summarized as follows.
1) An adaptive and object-based LU classification system is developed in terms of a superpixel embedding representation of remote sensing images. 2) For better extracting object features with regard to diverse geometries of LU objects, the MS-CNN model is designed by applying the multiple window sizes to perform LU classification. 3) An adaptive strategy of scale selection is proposed to improve the performance of the MS-CNN using majority voting. The generated LU classification system is named AMUSE-CNN. 4) Two urban areas (Lagos and Kano) are used and annotated to validate the proposed AMUSE-CNN architecture.

II. PROPOSED METHOD
The proposed method is composed of two parts. First, an MS-CNN is designed to initially implement the LU classification capability with multiple window sizes. Based on this MS-CNN model, an AMUSE-CNN is proposed to provide an overall adaptive and accurate schematic solution for LU classification, by considering the fact that the MS-CNN capability is limited due to being reliant on the input of all data into predetermined window sizes.

A. AMUSE-CNN Architecture
As mentioned above, the MS-CNN is reliant on window sizes that, therefore, results in a flawed LU classification as not all the individual classes perform best when just using a given window size for a given class. This concern has driven the design and implementation of the AMUSE-CNN architecture with an adaptive strategy for enhancing the scale capability of the MS-CNN in LU classification. In fact, the AMUSE-CNN architecture consists of two modules whose block diagrams are shown in Fig. 1. Module I aims to automatically assign each LU class to its most suitable window size through majority voting that can then result in the training of further MS-CNN models in Module II, where each class is trained to its highest potential using its most suitable window size obtained from Module I. The relevant functions of each module are described below.
1) Module I: In this module, three specified window sizes are used to separately conduct the best selection of window size for each class. In this study, the window sizes include 48, 96, and 120. As shown in Fig. 1, two functions are designed to carry out the selection implementation for each specified window size. a) Initial model training: A series of four separate MS-CNN architectures are trained with a given window size in order to generate a series of initial accuracies for each class, resulting in 4 × N votes for each class, where N is the number of window sizes, which is 3 in this study. b) Majority voting: Majority voting is applied to determine the most suitable window size for each class by comparing Fig. 1. AMUSE-CNN architecture consists of two modules. Module I is designed to conduct the initial training that aims to find most suitable classes for the given window size with majority voting. Module II is designed for performing LU classification using the classifiers obtained from Module I for relevant classes. the accuracies obtained during initial training, with the highest voted window size for each class being assigned to said class. 2) Module II: This module is created to conduct the new MS-CNN model training so that the accurate LU classification can be achieved. To this end, an MS-CNN model is trained for each window size, in which only the previously assigned classes in Module I are involved in order to gain the highest possible accuracies for these classes; for example, in this study, Classes 1, 3, and 6 get involved in the model training for the case of the windows size 48, as shown in Fig. 1. Afterward, the models obtained are used to carry out the classification of the respective classes so that all the LU classes are well finally classified, and the classification results are then output into a single table, as shown in Fig. 1.

B. MS-CNN Model
This MS-CNN model is proposed to implement a multiscale LU classification CNN, in which input data are rescaled in order to produce data at various sizes. In this study, three window sizes are used, namely, full, half, and quarter resolutions, which means that when any image is entered into the MS-CNN model as input data, the image dimensions will be halved and quartered in order to be provided as input to each individual CNN block. The block diagram of the MS-CNN model can be seen in Fig. 2.

1) LU Classification Problem Descriptor:
The set X of input data consists of A × B vectors, where A and B are the height and width of the data image, respectively. This can be expressed as Each x i also has a corresponding u i and v i corresponding to the pixel's vertical and lateral position in the image data, respectively, where The set Y of output data contains each pixel corresponding LU, expressed as where K is the number of LU classes considered. The training set, L ⊂ X, is composed of a subset of labeled pixels for which target labels Y L ⊂ Y are given. The task of LU classification is to estimate the unknown labels.
2) Superpixel-Based LU Classification: By grouping pixels that have sufficient similarity to one another (in terms of their spatial proximity and spectral homogeneity), it can be expected that such groups of pixels will tend to belong to a common output class. It is this assumption upon which superpixel-based segmentation methods are based. Exactly what metric is used to measure similarity and what is deemed "sufficient" similarity will vary with the choice of the segmentation algorithm and its parameters, but it suffices for now to say that algorithms that attempt to perform such a task exist.
Using the term superpixel to refer to these segmented pixel groups, each superpixel set S is a subset of X. The set X is divided into nonoverlapping superpixel sets whose collective union equals X, i.e., and The shorthand notation i ∈ S will be used to refer to the set {i | x i ∈ S}. The common output class assumption supposes that 3) LU Classification of Superpixels Using Data Windows: In order to harness the power of convolutional layers to perform data-driven feature extraction, a second assumption is made, which considers that the LU label of the pixels within a superpixel depends only on a local neighborhood of the input data, a square window, S w , centered on the segment's centroid. This allows any superpixel of arbitrary geometry to be classified using a square of the input data in its vicinity. Put probabilistically The superpixel centroid vertical and horizontal positions, denoted u S and v S , respectively, can be found according to where |S| refers to the cardinality of the superpixel set or equivalently the number of pixels contained within the superpixel. The shorthand i ∈ S w , used to index pixels contained within the superpixel local neighborhood, refers to the set where W is the width and height in pixels of the data neighborhood considered relevant for LU classification.
4) Data-Driven Feature Extraction Using CNNs: CNNs are designed such that successive convolutional layers are capable of learning increasingly abstract and translationally invariant features using relatively few parameters [25]. This makes them very powerful in many computer vision tasks, particularly performing LU classification such as OCNN [16] and pixelwise CNN [20]. While these methods are capable of producing stateof-the-art performance, their ability to learn multiscale features is limited by how many filters they contain and the depth of their architecture, which are, in turn, limited by the amount of training data available. With insufficient training data, deeper CNNs are liable to overfit training data and perform poorly on unseen data regions [19].

5) MS-CNN Architecture:
In order to more easily allow a model to identify features at various scales, we propose the MS-CNN architecture, as shown in Fig. 2, which directly operates at multiple scales simultaneously. For a given scale, if features that would distinguish one class from another exist at that scale, the CNN block (see Fig. 2) that has filters of size similar to those features ought to more readily learn to identify them.
The design of the MS-CNN model is based upon the notion that, when working with data resolution as high as 50 cm per pixel, some of the features useful for performing LU classification might be comparatively large in terms of the number of pixels they make up. While a deep CNN would likely be capable of identifying these features, the MS-CNN model should be able to learn these features with comparatively fewer layers used in succession. Fig. 2 illustrates that the proposed MS-CNN architecture has three main components: Multiple Scales, CNN Blocks, and Classification Block. Multiple Scales provide multiscale routes for the MS-CNN to learn features and parallel pathways are used, where each pathway operates at a different scale (e.g., half or quarter resolution). The component of CNN Blocks goes to extract the features for each pathway with the respective scale. Each convolutional block is made up of repeated units, consisting of a convolutional layer, a maximum pooling, and a rectified linear unit (ReLU) activation function. In Classification Block, the yielded feature maps of the parallel pathways are then combined using a simple linear combination, which is encompassed within the "weighted feature combine" layer, which has one scalar parameter per resolution pathway such that its output, X, is given byX where X i refers to one of its inputs coming from the output of the ith pathway. This combines the features identified at each  of the network scales into one feature map according to a pathway importance weight w i , whose value is based on empirical knowledge. In this study, each pathway is assumed to be equally important, so w i = 1/3 is given for each i as three pathways are involved. Subsequently, as shown in Fig. 2, the resulting feature map is flattened and classified using fully connected layers and finally applying a softmax activation function to the model output.

A. Area and Data of Study
Using very fine spatial resolution remotely sensed optical images that were released as part of Digital Globe's Open Data Program, two cities of Lagos and Kano, Nigeria, were selected for LU classification analysis. The data, seen in Fig. 3, have three bands (red, green, and blue) and a spatial resolution of 50 cm per pixel.

B. Data Annotation
The OpenStreetMap (https://www.openstreetmap.org) definition of LU ("what the area is used for by humans") will be adopted as the definition of LU used in this study. Following a consideration of the LU classes recognized by OpenStreetMap and a visual analysis of the Lagos and Kano data, the set of LU classes described by Tables I and II was adopted.
By labeling polygons within the raster, which were accurate representations of each of these LU classes, a set of labels was obtained, which could be used to represent both the training and  II  IDENTIFIED LU CLASSES WITHIN THE KANO DATASET AND BRIEF  DESCRIPTIONS OF WHAT THEY ENCOMPASS validation sets. The data annotation was not exhaustive, with the total annotated area making up 36.8% of the total area. The labeled polygons were randomly assigned into one of five sets, with each acting as a validation set for one experiment in a fivefold cross-validation analysis, with the others acting as that particular k-fold experiment's training data.

C. Initial Image Segmentation
Different segmentation algorithms place different importances on various measures of pixel homogeneity (e.g., texture, color, and spatial proximity). In order to segment the data into superpixels that would represent locally coherent subregions with a shared LU class [as per the assumption of (6)], the mean-shift algorithm [26], implemented using the open-source software Orfeo Toolbox, was selected. The parameters of the algorithm were adjusted in order to obtain superpixels that were small enough to sufficiently separate individual LU classes but not so small that the quantity of superpixels that made up the image was unnecessarily high, which would increase inference times (the model inference time increases approximately linearly with the number of superpixels in the data). The mean-shift algorithm has three main hyperparameters: the spatial radius, the range radius, and the minimum region size. Following a trial-and-error approach, these were set to 30, 8, and 196, respectively.

D. Parameter Settings 1) MS-CNN Model:
For the purposes of this investigation, MS-CNN models were trained using three scales (full, half, and quarter resolution). Various models with various window sizes have been trained, in order to determine the optimal window size for performance. The CNN block shown in Fig. 2 for the models trained consisted of three repeated blocks, each consisting of a convolutional layer (32 3 × 3 kernels), a maximum pooling layer (2 × 2 window), and a ReLU activation function. The first dense layer had 32 neurons and the second had seven neurons (one for each LU class).
2) AMUSE-CNN Architecture: For effectively implementing the adaptive strategy, four separate MS-CNN models are used to conduct the initial model training, and thus, 4 × 3 = 12 outcomes participate in majority voting (here, three window sizes used in this study).
3) Sample Size: With the annotated data introduced in Section III-B, the number of samples used for training and testing MS-CNN and AMUSE-CNN is shown in Table III.  TABLE III  NUMBER OF SAMPLES USED FOR TRAINING AND TESTING MS-CNN AND AMUSE-CNN

E. LU Classification Results and Analysis
For well evaluating the AMUSE-CNN model, three state-ofthe-art LU classification models were used as benchmarks to conduct the performance comparison: OBIA-SVM [11], Pixelwise CNN [19], and OCNN [16]. In addition, the performance of the MS-CNN at each scale was also presented for showing the role of the adaptive strategy used in the AMUSE-CNN. In the following subsections, a quantitative analysis is first carried out, followed by a qualitative comparison. : Tables IV and V show the classification accuracies achieved for each of the models assessed. The classification accuracy is measured as the percentage of the area in the validation set representing a given class that is correctly classified. Average accuracy (AA) is calculated by averaging the classification accuracies over all the classes. Equivalently, the overall accuracy (OA) is measured as the percentage of the area of all the polygons in the validation set that is correctly classified. The accuracies presented are the averages achieved over fivefold cross validation.

1) Quantitative Results
It is clearly shown in Tables IV and V that the AA 93.25% of AMUSE-CNN in Kano is significantly better than that 85.44% in Lagos, while its OA 90.16% in Kano is obviously inferior to that 96.71% in Lagos. In fact, these accuracy differences are determined by the geographical complexities of two cities and the nature of the relevant LU classes. In Lagos, some classes are very similar in color and geometries such as Residential and Retail, which leads to the ineffective differentiation of the relevant classes, resulting in lower AA in Lagos compared to Kano. Fig. 8 presented in Section III-E2 visually shows how this issue affects the performance of the relevant methods. On the other hand, the lagoon class takes most of the area in Lagos, so its high detection accuracy dominates the OA of Lagos, which, therefore, greatly helps Lagos achieve better OA than Kano.
To demonstrate the benefits of the adaptive strategy applied in the AMUSE-CNN, the results for the MS-CNN model at each single scale are presented in Tables IV and V. The AMUSE-CNN clearly outperforms the MS-CNN in both Lagos and Kano. For example, for Residential in Lagos, the MS-CNN with the scale 120 achieves the best accuracy 60.99%, while the AMUSE-CNN reaches 78.34%; for Industrial in Kano, 72.81% is the best accuracy obtained by the MS-CNN with the scale 96, while 91.11% is achieved by the AMUSE-CNN.
Compared to the state-of-the-art methods, the traditional handcrafted method, OBIA-SVM, is clearly ineffective to deal with LU classification in these two datasets due to its AAs for both the datasets being 42.67% and 46.60%, respectively. In particular, Retail (0.17%) in Lagos is almost all misclassified due to being similar to Residential in color. As for Pixelwise CNN and OCNN, their performance is much better than OBIA-SVM. However, our AMUSE-CNN significantly outperforms these two methods in both AA and OA such as AA: 85.44% versus 78.59% versus 82.20% in Lagos and 93.25% versus 81.70% versus 83.70% in Kano. Tables IV and V show that the AMUSE-CNN is able to well classify some difficult classes; however, Pixelwise CNN and OCNN perform poorly, for example, Residential and Retail in Lagos, and Industrial and Residential in Kano.
On a class-by-class basis, the AMUSE-CNN optimally classifies six of the seven LU classes for Lagos except for Brownfield Construction (Pixelwise CNN being best 85.43%). The nearest AA accuracy benchmark, OCNN, can classify the classes averagely better than any of the other methods presented for both Lagos and Kano. For the most part, the Highway and Lagoon LU classes in Lagos, and the Road and Vegetation in Kano are easiest to classify (with all models achieving > 75% accuracy). The classes that have proved most difficult to classify (Industrial and Residential) are both classes where the AMUSE-CNN has greatly excelled relative to the benchmarks particularly in Kano.
For further demonstrating the overall performance and strength of AMUSE-CNN, the receiver operating characteristic (ROC) curve is plotted for each model used in this study, which Figs. 4 and 5 are presented for Lagos and Kano, respectively. These figures obviously illustrate that the AMUSE-CNN gets higher TPR (sensitivity) and lower FPR (1 − specificity) at any threshold setting in comparison to the benchmarks. Thus, the classification performance of the AMUSE-CNN in both Lagos and Kano appears markedly superior to those of the benchmarks.
2) Qualitative Results: In order to visually compare the performance, Fig. 6 presents one randomly selected region of Lagos (2000 × 2000 pixel area) alongside the ground truth (GT) annotations for that region and LU classification output for each of the models trained. Where used, the subscripts refer to the window size of the model. This visual results show that the overall performance of the AMUSE-CNN is much better than the benchmarks and also the MS-CNN although the classification of some classes is still not best conducted due to the similarity between classes such as a small portion of "Retail" being misclassified as "Industrial." Fig. 7 shows another selected region of the Kano data for visually showing the performance of our method and the benchmarks. It is clearly illustrated that the AMUSE-CNN well classified most classes except for a small portion of "Retail" being misclassified as "Residential." Promisingly, referring to the GT, as shown in Fig. 7, the AMUSE-CNN can perform the classification much more accurate than the benchmarks and the MS-CNN. For noticeably looking into the classification details and issues of the given classes, we zoom on the selected parts of the images and present the relevant classification results of these zooming-in portions as follows.
"Residential" and "Retail" are the classes with similar geographical information in Lagos, and the benchmarks thus often wrongly classify these two classes. Fig. 8 illustrates the classification results of a selected region in which the AMUSE-CNN  can well classify these two classes; however, the benchmarks perform poorly.
In Lagos, "Industrial" is the class with the lowest performance of the AMUSE-CNN. Fig. 9 illustrates that the indistinguishable boundary of "Highway" and "Retail" with "Industrial" actually intervenes in the classification, thus significantly reducing the performance of the AMUSE-CNN in "Industrial." Even so, the AMUSE-CNN obviously outperforms the benchmarks, as visually presented in Fig. 9. Fig. 10 presents the classification results of the selected region in Kano that the performance of the AMUSE-CNN in "Industrial" and "Residential" is with the lowest accuracy in comparison with other classes. It illustrates that the benchmarks is prone to classifying "Residential" as "Industrial," while a small portion of "Industrial" is wrongly classified as "Residential" with the AMUSE-CNN. However, our proposed method is clearly better at conducting the classification than the benchmarks for these two classes.

F. Ablation Study
For examining the benefits of the superpixel solution, the nonsuperpixel-based MS-RCNN is employed to conduct one ablation study. Without loss of generality, the model is trained at the window size 48, and the results obtained are shown in Table VI Thus, we can conclude that the superpixel-based MS-CNN is more accurate and efficient than the nonsuperpixel-based MS-CNN. The superpixel-based MS-CNN method is, therefore, the most suitable for use in the AMUSE-CNN. Another ablation study aims to demonstrate the role of Module II in the AMUSE-CNN, which we conducted LU classification only using Module I of the AMUSE-CNN. Table VII presents the results generated from Module I, in which the "Best"

G. Computational Environment and Complexity
To compare the computational costs of the AMUSE-CNN with those of the benchmarks, an analysis that examines the amount of time to train each model and for each model to perform inference on the complete data is presented in Tables VIII and IX. The experiments have been carried out on a machine with Intel i5-9400F CPU, 32-GB RAM, and a NVIDIA Quadro M400 GPU.
Overall, the amount of time needed to train/perform inference is at a similar level for the MS-CNN and the OCNN. Among the benchmarks, Obia-SVM is quite time-consuming, while the pixelwise CNN stands out as the fastest training but the slowest inference model. For the AMUSE-CNN, due to the multiple MS-CNN models being involved in the adaptive strategy, its computation is costly and pretty time-consuming; however, in compensation, its accuracy of LU classification is highly improved in comparison with the benchmarks.
Referring to the CNN time computation model introduced in [27] and [28], the computational complexity of the MS-CNN is estimated as O(N × D × S 2 × M 2 ), where N is the number of the scales used, D is the number of the layers, S is the spatial size (length) of the kernel, and M is the spatial size of the output feature map. For the AMUSE-CNN, due to the calculation of majority voting, its computational complexity is approximately

IV. DISCUSSION
In the proposed solution scheme, the MS-CNN model is first proposed to improve upon the benchmark methods by offering direct routes for multiscale feature identification through the use of scale-specific CNN blocks. Furthermore, the AMUSE-CNN architecture is designed to explore the capability of LU classification with the use of majority voting to implement an adaptive strategy. The experimental results shown in Sections III-E1 and III-E2 demonstrate that the AMUSE-CNN does accurately outperform the state-of-the-art methods. However, for a better LU classification, there is much room to improve the AMUSE-CNN whatever the classification accuracy or the computational efficiency is.
In this study, data preprocessing is mainly based on homogeneity representation using superpixels and the mean-shift method. However, it is worth noting that remote-sensed satellite data are prone to suffering from spectral variabilities, such as vertical artifacts [29]. Some potential techniques, for example, the augmented linear mixing model [30], could be employed for improving data quality at the preprocessing stage.
It is clear that the overall performance of the AMUSE-CNN is greatly dependent on that of the MS-CNN. In Tables IV and V, the improved AA and OA offered by the MS-CNN with different scales suggests that there may be cause for further investigation into multiscale architectures for LU classification. That being said, the improvements observed are not seen across every class, and therefore, it is proved that some LU classes are better identified using multiscale feature learning than others by applying the adaptive strategy, which is the essence of the AMUSE-CNN architecture proposed. The further work could consider to use a different model selection scheme instead of majority voting.
With regard to the window sizes used, although MS-CNN models presented have used half and quarter resolution as their additional scales, this choice was somewhat arbitrary and further investigation might analyze how the choice of scales in the MS-CNN affects results. This should, therefore, result in improving the performance of the AMUSE-CNN. Similarly, the choice of operating with three scales was also somewhat heuristic, and an extension of this work might assess the balance between performance and increased computational costs by comparing MS-CNNs, which operate with more or fewer scales. In addition, four MS-CNN models are used to conduct majority voting in the current study. We could give a try to taking different number of models involved in the scale selection process such as three or five in further studies.
For sample size in this study, Table III shows the number of training and testing samples used for each class. It is worth noting that the number of samples used is largely dependent on that of the dataset available. As introduced in Section III-B, annotated data were with only 36.8% of the entire dataset, which results in what may seem to be lower numbers of each class, such as that of the retail class used in the Lagos dataset. However, the number of samples that are available for each class are influenced by how populated each urban dataset is with this particular class; thus, the number of the retail class in Lagos was incredibly small. From our empirical experience, a well balance between the numbers of samples used for each class is needed for ensuring that performance is kept as high as possible while preventing an imbalance in the data leading to issues during training such as overfitting.
To better validate the AMUSE-CNN method, although two different urban datasets are used in this study, it is anticipated that the improved results could be replicated for more datasets. The next step would be to produce similar improvements using datasets collected from other continents. This should not only offer some challenges unique to new datasets, but also present some of the same complexities faced with classifying the Lagos and Kano datasets.
Another big challenge to our proposed method is how to reduce the computational complexity and make the solution scheme more efficient. The inherent drawback of majority voting is actually seriously bringing down the calculation efficiency of the AMUSE-CNN; thus, some more efficient method of scale choice should be used in future. In addition, we could consider to improve the computational capability with better hardware such as TPU for more efficient calculation.

V. CONCLUSION
This article presents an adaptive LU classification CNN architecture, namely, AMUSE-CNN, based on a multiscale feature learning model (MS-CNN) proposed as a means for better identifying the subtle visual cues that allow a region's LU to be determined. Using superpixels obtained via a segmentation of the data to represent regions of locally homogenous LU, the MS-CNN learns to identify features at multiple scales simultaneously in order to classify superpixels and obtain a semantic segmentation of the data. Moreover, based on the MS-CNN, the AMUSE-CNN is designed and put forward to find the most suitable scale for classes and, thus, obtain incomparable classification accuracy by resorting to a scale selection process of majority voting. The AMUSE-CNN performance is well examined with two urban datasets, Lagos and Kano in Nigeria, by comparing to three state-of-the-art benchmarks. A barrier to the AMUSE-CNN is the time-consuming step of performing the scale selection; two ways of improving work for reducing the computational complexity could be valuably attempted in future. One could use another efficient scale section scheme instead of majority voting. Another one may optimize the calculation process with more efficient code along with the use of high-performance hardware units such as TPU. In addition, for alleviating the effect of spectral variability, such as vertical artifacts [29], an augmented linear mixing model [30] could be employed for data preprocessing.