Learning Multi-Granularity Neural Network Encoding Image Classification Using DCNNs for Easter Africa Community Countries

Remote sensing scene classification is a fundamental responsibility of earth observation, aiming at identifying information granular for land cover classification. The multi-granular land use for multi-source remotely sensed image categories is now a principal task in remote sensing data augmentation and data selection. Understanding image representations are meaningful for the scene classification task. Training deep learning model-based approaches has to do with scene classification and brings about fantastic achievement. At the same time, these high-level approaches are computationally expensive and time–consuming. This paper introduces multi-granularity Neural Network Encoding architecture base on InceptionV3, InceptionReseNetV2, VGG16, and DenseNet201 architecture into remote sensing scene classification. To improve performance and to solve intra-class variation for multi-class scene issues remote sensing dataset. By using pre-trained CNN, activation function and ensemble learning have been adopting. InceptionV3 and VGG16 are used to extract features. InceptionResNetV2 use for fine-tuning, which consists of unfreezing the entre model and retraining the new data with a lower learning rate. The proposed fine-tune whole pre-trained model produces better results of test set up to 97.84 % than features extracted by InceptionResNetV2. Also, we use DCNNs ensemble average and weighted average to achieve better outcomes for 97.36% and 99.10%. In our experiments, we attempted to fine-tune the deep convolutional neural networks (DCNNs) training method for remote sensing scene classification on two public datasets UCM, SIRI-WHU, and one dataset I collected through the google earth engine from East Africa Community Countries (EACC) within nine classes within total 2112 labeled images. The results indicate that our proposed fine-tuning of the pre-trained model with few epochs and less computational time increases accuracy.


I. INTRODUCTION
Land use is one of the most applications of Remote Sensing (RS) data [1]. RS technology helps collect electromagnetic radiation information from earth object target on satellite aircraft and identify the earth's environment and resources.
The associate editor coordinating the review of this manuscript and approving it for publication was Wai-Keung Fung .
It has been used in many practical applications such as RS classification, RS object detection, and RS segmentation. Image classification methods based on deep convolutional neural networks (DCNN) are one way of turning private and public sensing data into meaningful categories [2], [3]. The land cover and land use classification system provides remote sensor data [4]- [6]. Land use (LU) plays an essential role in applying environmental protection, urban planning, VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ and economic resource management [7]. Land cover detection is necessary to update land cover maps and manage natural resources, constituting a genuine challenge to build environmental satellite imagery applications [8]. LCLU has different challenges such as various data structures, varying data size, uncorrelated physical properties, and scarcity of training data [9]. There is still a lack of a hybrid model for analyzing the spatial, spectral, and temporal approaches to address these information granularities. To make decisions for overcoming the issues such as hazard, uncontrolled development, deteriorating environmental quality, loss of prime agricultural lands, destruction of essential wetlands, and loss of forest. As one application for RS, LC and LU can capture various information about the earth and neighbor through different methods [10]. It is possible to correct a massive amount of data using imaging technology [11]. Classification methods are the crucial components in knowledge discovery from remotely sensed images; the recent studies focused on machine learning methods are supervised, unsupervised, and reinforcement classification methods widely used ML methods to classify remotely sensed images. It is also a challenge, and it is not easy to collect labeled sample datasets in remote sensing areas due to the high cost and time-consuming process.
Using deep convolutional neural networks (DCNNs) has shown powerful performance by extracting multi-granularity and multi-level features with hierarchical architecture. However, efficiently encode each image into a classification probability vector containing global characteristics. Using DCNNs to address RS scene classification problems has two main issues: considerable intra-class variation and confusing information. To analyze multi-source for ecosystem [12] has introduced for learning granularity levels such as contrast, energy, and the correlation between land use and land cover class labeled variation. The class level has been designing to consider the purpose of the service. The main problem of land cover issues as natural resource inventories in every survey defines the same categories differently.
According to the feature extraction and illustration, the existing method has been characterizing into the following: (1) low-level feature, (2) intermediate feature, and (3) highlevel feature or method based on the in-depth feature. Lowlevel features and intermediate features are known as the traditional method. They extract and represent remote sensing image features in a shallow layer, and the complex model could not be captured [13]. The low-level feature method used for image and video-based on image texture, shape, color, intensity, and other characteristic features extracted directly from the image combined with the supervised learning method to classify the image into several semantic classes. Also, low-level features, such as color histogram, were used to extract global features in [14]. The Gabor feature has been applying to produce semantic labels with a specific belief for each image. In [15], morphological texture features have been using to avoid the limitations of spectral features. They have been extracting with circular covariance histogram and rotation-invariant. Another research has introduced an intermediate feature to reduce the gap between low-level and highlevel image processing. Improving image scene classification accuracy focuses on motivating traditional back of visual words (BoW). Also, this has been enhancing by integrating color information into the classification system. Furthermore, mid-level features, such as local binary pattern (LBP), scale-invariant features transform (SIFT), histogram of oriented gradients (HOG), and spatial envelop (Gist). They have been applying [16] to lead information loss to the influence of spatial correlation on-scene images. In [17], a multiscale LBP feature has been extracted from dense patches to obtain a local patch descriptor. The fisher vector has concatenated as the final representation for an image. Other research introduces the multi-neighborhood LBP method [18] for small image patches combined with frequency-domain smoothing to extract features from an image. Then the bag of visual words (BoVW) and support vector machine (SVM) have been used as feature representation and classifier. In the remote sensing image registration task, scale-invariant feature transform (SIFT) [19] improves registration performance. By effectively improving the accuracy of image description and the precision of image classification [20], combines the principal component analysis (PCA) and processed scale-invariant feature transform (P-SIFT). However, machine learning methods have been developing in RS, such as the support vector machine (SVM) [21] iterative self-organizing data analysis (ISOTA) [22], back propagation [23], random forest [24], harming distance [25] and multiscale segmentation [26].
High-level feature extraction methods in computer vision have been successfully using deep learning methods such as image classification [27], object detection [28], and image retrieval [29]. The deep learning method focused on image recognition [30] proved that CNN and DBN better evaluate the MNIST dataset. Nonetheless, convolutional neural networks (CNNs) provided the powerful ability for feature extraction [31] to improve classification accuracy in low intra-class and high interclass variation. For high-resolution remote sensing images segmentation [32], a general adversarial Network (GAN) has been introducing. They were using SegNet and combining encoder features and decoder features to achieve high-quality building segmentation. Researches continued to approve deep attention. The multiscale networkbased ATD-LinkNet model [33] achieved better results for remote sensing image segmentation.
Deep learning is first growing in big data analysis such as image analysis, pattern recognition, data mining, content-based image retrieval, computer vision, and artificial intelligence. The success of CNN assumed one of the best technologies in 2013. By using a different variety of deep neural network, such as AlexiNet [34], GoogleNet, VGGNet [35] and ResNet [36]. CNN has been introduced in computer vision tasks to learn low-level, middle, and highlevel granularity. Deep convolutional neural networks and fine-tuning are needed to classify remote sensing imagery.
Recent research has witnessed meaningful improvement in various classification tasks using deep convolutional neural networks, such as feature extraction for classification [37]. For practical function in Remote Sensing (RS) application, EffcientNet-B3 [38] has improved RS scene classification accuracy. Studies on urban land cover and land use patterns have considered the lowest object in high spatial resolution image classification. High spatial resolution images improved a significant amount of natural geospatial information processing, generally used in object-oriented classification methods. Extracting and analyzing urban land use or land cover for large information granularity in high-level architecture can significantly perform.
In our paper, we implement multi-granularity neural network encoding for learning information granular from remote sensing images classification architecture, with the intuition of pre-trained CNN models. Based on global features, and the fine-grained stage was helpful to RS scene classification. For example, we can easily recognize ''railway station'' when you see ''train,'' other side it is easy to understand ''airport'' when you visit a ''plane.'' The second challenge for RS image contains hidden semantic structural information granular. It could be exploring without using detailed annotation such as bounding boxes and pixel-level annotations. Here, we explore the RS image, which has redundant and confusing information granularity. For example, we cannot focus on the center white tower if we need to identify ''church'' from storage tanks. We need more details related to the structural information such as tower, red-cross, plus encircling object to make our own decision.
According to the above hypothesis, we proposed a Multi-Granularity Feature Encoding Ensemble Network (MGFEEN) to address the RS scene classification task. Besides learning granularity level, in RS image, the global features hold important high-level features. To extract features, we used baseline models (VGG16, Dense Net201, InceptionV3, and Inception ReseNetV2). During each baseline is trained using cross-entropy loss with different weights. Throughout validation, each baseline's final classification probability vector has fused to vote the final classification results. The novel contribution is allowed as follows: • We present the middle-level CNN technique for feature extraction using a pre-trained neural network. The purpose was to perform NN feature fusion of the convolutional layer, activation, and average pooling layers activation is evaluated based other the performance of linear classifiers such as RF, SVM, LG, and XBOOST, and compared with the existing methods.
• We propose Multi-granularity neural network encoding by learning feature extraction ensemble in RS scene classification to handle the sizeable intra-class variance problem.
• We combine all baseline model features and build an end-to-end ensemble network with better results than previous state-of-the-art networks on different data sets.

II. RELATED WORK A. REMOTE SENSING SCENE CLASSIFICATION
Most task scene classification plays a role in automatically identifying the land cover category of remotely sensed imagery patches. Multi-granularity level feature extraction is one phase of the hierarchical reduction process. Whereby granularities are divided and reduced to more manageable levels. The granularity feature extraction method is one best feature extraction from remote sensing imagery. It applies in big data sets by selecting and combining variables into segments and effectively reducing and classifying information granular information. Through on supervised learning as handcraft engineered features or unsupervised learning feature methods. We divided the approaches by thinking granularity levels in three ways of information [39], [40]: lowerlevel granular, mid-level granular, and high-level methods.
To identify the granularity level of remote sensing, use many techniques such as multi-granular feature extraction, feature selection, and machine learning algorithms. It is essential to detect fine features such as shapes, edges, and color motion in digital image processing or video. Lower-level granular methods extract low-level remote sensing information-local handcrafted features such as color histogram [41]. Many hands drafted feature descriptions such as Git descriptors (GIST) [17], which provide a global description for expressing the spatial structure of orienting a scene different domain scale. Histogram of oriented gradients (HOG) [42] feature computing method count occurrences of gradient orientation of location and had one best feature to detect the edge. It implements some successful land-use scene classification methods [43] for highresolution imagery. Scale Invariant Feature Transform (SIFT) descriptor [44] represents feature extraction keypoint detection and densely sampled local regions with extending speedup with features (SURFs) [45] wolfing local feature detector and descriptor.
The middle for granularity level feature extraction is an approach used for scene representation. It attempts to represent grain scenes with a statistical representation of high-degree locally extracted image features based on prior human knowledge. The unsupervised learning method has been exploring using principal component analysis (PCA) [46] for high spatial resolution aerial imagery, K-Means auto-encoder for learning middle-level feature extraction [47]. For instance, to analyze eigenvector of multivariates, variates, iteration to assign row data for predefining centroids [48], and standard way to generate codebook. Many local features contributing to remote sensing imagery, such as the bag-of-visual-words (BoVW) model, have contributed to scene small sensing applications such as land cover/ land use [49]. VOLUME 9, 2021 The in-depth learning approach achieves the state-of-theart in image understanding and pattern recognition [50], speech recognition, semantic segmentation [51], object detection, [52], and natural language procession with high-level methods such as Convolutional Neural Network (CNN) and by a grouping of several convolution layers trained up to the small patch of local receptive range as the deepest input of the hierarchical structure [53], deep belief Network collected of numerous Restricted Boltzmann Machine (RBM) [27]. Moreover, auto-encoder is a neural network model of data-driven and unsupervised learning for improving satellite remote sensing tasks [54]. Such as classification [55] and object understanding because of the superior multiple-layer channel interconnection with high in-depth learning features and classifiers from different architecture. Machine learning and deep learning methods can be applied for complex tasks such as hyperspectral anomaly detection [56] and light detection and ranging (LiDAR) data for land cover classification tasks [57]. The encoding method employs various processing layers within the multi-granularity neural network to learn the hierarchical representation of remote sensing imagery. Understanding the image classification task is a much-generalized problem required with deep learning convolutional neural networks (DCNNs) characteristics. The first layer consistently identifies lower-level features of the image; by combining some number of the layer, we learn a middle-level feature, while the last layers of the network learn semantics and highlevel features. See Figure 1a for learning granular feature level by pre-trained model and applying these features as multi-layer in one way of output classification, which I used as a baseline model. I applied random forest as a machine learning classifier and compared with other machine learning classifier. Figure 1a is level feature extraction using convolutional features maps of CNN, which consider the last layer and two connection layers for constructing the input images. It consists of three phases: multi-layer feature extraction, stack covariance pooling layers, and machine learning classifier either SVM/RF the prediction label of each test, which has been determining by the means of a decision function suitable to deal with relatively large data sets. The feature learning process is significant for transfer learning techniques, for combining local and global features. We learn that connecting these decisions improves the classification performance and reduces the computational cost within an ensemble. Also, we learned transfer learning on different labeled remote sensed datasets, illustrated in Table 1.
Inception-ResNet-V2 is a ConvNet trained on more than a million images from the ImageNet [58]. The network is 164 layers deep and can classify the image into 1000 object classes. As a result, the network has learned rich feature representations for a wide range of ideas, and the network has an image input size of 299 by 299. With remote sensing applications to enhance accuracy performance, Incep-tionResNetV2 [49] has many layers; residual units are also added to each inception module to avoid gradient problems. See Figure 1-b, which shows the structures of InceptionResNetV2.
Deep layers extract high-level features that help image classification. CNN layers find basic features like edges, shapes, and circles. A dense block model has use in Con-vNet that connects with matching feature map size [59]. In DenseNet, every layer gets an additional input from all preceding layers. It passes on its features maps to all subsequent layers, and concatenation has use for receiving collective knowledge from all successive layers. It has computational effectiveness and memory effectiveness. The feature maps at each layer are serially chained to carry on data from previous layers to subsequent layers. Due to the reason that there is a need to learn redundant information, the number of parameters decreases. Dense Net revised the model by concatenating all the feature maps sequentially instead of the output feature maps' summation. Dense Net used dense blocks for the down-sampling, and blocks consist of transition layers.

III. PROPOSED WORK A. FINE-TUNED MULTI-GRANULARITY NEURAL NETWORK ENCODING
The fine-tuning for the multi-granularity encoding network (FTMGN) method learns the intra-class between granularity levels for the small scene classification task. 1) We used baseline pre-trained CNNs for features extracted see Figure 1-a. In this part, we represent all pre-trained CNNs baseline used for features extraction Table 5. These architectures were used for feature extraction as they demonstrated marvelous performance in image classification on the ImageNet dataset. Then we compare the linear classifiers used for remote sensing images classification: RF, SVM, LR, and XBOOST Table 6. The general framework used for feature extraction and compared with different classifiers, its phases are presented in Figure 3. In the second phase, we introduce transfer learning approaches; for learning features such as colors, edges, and shapes.
2) The fine-trained multi-granularity level through the multi-granularity level architecture of deep learning (MGLADL) can improve accuracy according to the pre-trained and additional fully connected layers. We used different activation functions for different architecture into one architecture, forming a single network. Lastly, we combine these trained networks that differ from architecture and different optimization functions. We achieved better results than pre-trained, fully connected, additional connected layers, and varying architecture features. By improving two-level architecture with extra fully connected layer was connected to easily convolutional neural network, see Figure 4. The layer has (l) inputs for dense blocks (l th ), consisting of the feature maps of all preceding convolutional blocks. Its feature maps have been passing on to all subsequent layers L-l. Every layer reads the state from its initial layers and writes to the following layer. It replaces the form but passes on information that needs preserving.  The dense Net structure explicitly differentiates between information added to the network and information held by concatenating features instead of summing features as ResNet did. During the experiments, we put in a fixed training setting for baseline models and our model. Furthermore, all models have been implemented using python3.7, and our code will soon be available online.
The third phase of the multi-granularity level architecture of deep learning (MGLADL) is shown in Figure 4. It takes the values of the output layer in level i − l, which is a probability class values combining and passing them to the group i, after adding a fully connected layer with x neurons and giving the features from the additional fully connected layer to level i as shown in Figure 4. Figure 4 summarized MGLADL to improve the accuracy of deep learning based on different grouping CNN's. Occur in many works of literature [2], [61]- [64], which help enhance performance accuracy based on training size, batch size, hierarchical filter, computing power, and the number of pre-trained networks. There are many previous studies [35], [65]- [68] that have been applying for improving the accuracy performance. Figure 4 is the proposed architecture of deep learning, which uses the feature extracted by different networks. The purpose is to break down the parameters optimization of the network to multi-granularity levels based on deep understanding (DL ij ), where each block is trained and optimized independently to improve performance. Briefly, the networks from different architecture levels have other activation functions to use in further feature extraction. It has been increasing the motivation of finding additional practical activation functions different than ReLU.
The gradients that contribute to the major activation issue in most literature include the vanishing gradient-an understanding of activation function, which is one of the hyperparameters in neural network computation. The parameters that affect the overall performance and hyperparameter of the network include filter size learning rate, regularization strength, and quantization level position. The motivation of MGALADL is optimization based on an activation function that has a small range. For a linear model, input function to an output, as performed in the hidden layers before the final prediction of the class score for each label. Then input vector x transformation is given by: where x = input, w = weights, b = biases, neural networks produce linear results in this equation, and the need for the activation function appears. It converts linear outputs into non-linear output for further computation, basically to learn patterns in data. The output is given by: The outputs are fed into the next layer for multilayers like deep neural networks until the final output is obtained, which is linear by default. The Activation Function (AFs) are transfer functions that are applied to the outputs of the linear models to produce the transformed non-linear outputs after the application of the AF is given by where α = activation function. The ReLU is a faster learning activation function (AF) that performs a threshold operation to each input element where values less than zero has set zero; thus, the ReLU is given by This function rectifies the inputs less than zero, thereby forcing them to zero and eliminating the vanishing gradient problem observed in the earlier types of AF. Across the granularity level can differ in their architecture and can different activation functions to help to extract different features. The ReLU in two-level architecture where the input of the second level by where b, d and n are parameters to be set and derivative function of this function is expressible by: Using CNN for feature extraction, by setting different parameters, can extract a variety of features. Using modified ReLU activation function layer to the proposed function. The results proposed function have been modified in equation (7) f By derivative in equation (7) is expressed in equation (8) This paper used Inception-ResNet-V2 [58] and Inception-V3 [60], trained on ImageNet with 11 inception blocks. It performs two different experiments: fine-tuning Inception V3 from the last two inception blocks. The second fine-tunes the pre-trained model-training details from these baseline models trained with ReLU as an activation function. Furthermore, using Adam for 30 epochs, then transfer learning from the pre-trained network by preserving the weights of the three first layers to improve the convergence of the model.
During training, the network has been continuously using the mini-batch statistics either the BN layer is frozen or not. During inference, it used the previously learned statistics of the frozen BN layers. We can fine-tune the weight of the top layer that has been adjusting to the mean and variance of the new dataset. Also, during the inference, they received data scaled differently because the mean and variance of the original ImageNet dataset would be using. We cannot use the same process when fine-tuning VGG16 with IncenceptionV3 because of its lousy validation accuracy. Dense Net [59] as baseline models to build legitimate comparison with process model shown in Table 4.

B. IMPLEMENTATION AND DATA ANALYSIS
Our implementation has been using Tensor Flow-GPU1.14.0 framework, Keras, python 3.7, and tested on a highperformance computer (Intel R Core TM i5 CPU, 8GB RAM) equipped with an NVIDIA-GTX 1070 with 8GB of memory. First, we normalized the dataset and applied one-hot encoding to facility multi-granularity neural network encoding feature. The encoder network can be used as a preparation technique to perform feature extraction on raw data to train different machine learning approaches. One-hot encoding is applied on the labels of image data to indicate the case of similarity EAC or dissimilarity for each image in the dataset. In another case, high-level information granular becomes concentrated in fewer maps as the analysis proceeds through the entire framework. The original trained image classified has been transferring to the dense layer. It has a prediction task for granularity levels of various multiple granularity features. However, up-sampling achieves a lower resolution feature map.
The encoder net comprises three mains steps: convolution layers, non-linear layers, and pooling layer; the other way decoder is composed of convolutional layers, non-linear and up-sampling processes. The output convolutional layer in the encoder contains the features captured by kernels with different sized receptive fields. The decoder enlarges the feature maps extracted by the encoder to produce a final segmentation map with the exact resolution as the input image. It used fine-tuning and assembling DCNNs, on the remote sensing dataset. By combining them, the networks across the granularity level can differ in the architecture. They can use different activation functions to extract features, Figure 4. We used a sample public dataset; see a sample of Figure 5 and Figure 6.
In advance of fine-tuning DCNNs, it constructs a small CNN architecture to consider the complexity of classifying EAC Dataset and compare it with different public benchmark datasets. We started a convolutional layer with 32, 64, and 128 kernels, each size three and padding. The size of the images is conserved and followed with a max-pooling layer 2×2 and feature map with spatial activation size reduced by a fracture of two. Model-based on the heuristic model, the convention in famous DCNNs using 3 × 3 convolutional layers. Furthermore, the number of filters in the output whenever the spatial activation size halved to maintain roughly constant hidden dimensions. Training model, feature extraction is applied data augmentation and learning rate to prevent overfitting.
This approach transforms the training dataset in each epoch to build disparity and promise that the model never sees the same image two times. The learning rate has initialized at 0.001, and an Adman optimizer is applied. Learning rate decay has also been using to halve the learning rate whenever the validation accuracy plateaus for ten epochs. A model was used in the EAC dataset of 80% training and 20% validation. We achieved validation accuracy of 92.82% and 91.50% test accuracy for 100 epochs. Also, we achieved 78.57% validation accuracy and 83.51% test accuracy for the UCM dataset and validation accuracy of 80.96%, test accuracy 83.54% for SIRI-WHU. Second, many DCNNs approaches are achieving better results on ImageNet than VGG16.
We suggested they fine-tune-VGG16. The performance achievement VGG16 Net corresponds to the third to fifth convolutional blocks have four convolutional layers. Fine-tune VGG16, the top fully connected layers removed. The new FC layers consist of one global max-pooling layer, 512 units, one FC layer, one drop out layer with 0.3 rates, one soft-max activation layer for nine categories for the East Africa countries dataset. We freeze all layers in VGG16 and execute feature extraction for the newly added FC layers. After ten epochs of feature extraction, we unfreeze the final convolutional block of VGG16, and we begin to fine-tune the model for 100 epochs. The learning rate of 0.0001with high momentum of 0.9 and Adam optimizer applied throughout the training process. The same data augmentation and learning rate decay strategy as in the same baseline model used. VGG16 was used to fine-tune for a total of 100 epochs.  Thirdly we applied inception V3 trained on ImageNet with 11 inception blocks, and we executed fine-tune the whole pre-trained model. During training the network, we decided to fine-tune all layers of Inception V3 and fine-tune the top two inception blocks with batch normalization layers in the model set to trainable; this experiment lasted 100 epochs, fine-tuning the whole Inception V3 was 100 epochs. The top performer on ImageNet is Inception-ResNet. Inception-ResNet incorporates residual connection, which provided to be inherently necessary for training profound convolutional models. See Figure 3 displays the structure of the Inception-ResNetV2 model, which has been fine-tuning-the training strategy in VGG16 used for finetuning InceptionV3 and Inception-ResNet V2. The top layers of Inception-ResNet V2 have been fine-tuning for 100 epochs.
Lastly, we applied DenseNet 201which consists of four dense blocks, and we performed two experiments of this network. First, we used Fine-Tune on the last Dense block, and second, we applied to fine-tune on the whole network. The same training strategy as in previous sections has been using to fine-tune DenseNet. The top layers of DenseNet 201 have been fine-tuning for a total of 10 epochs. Furthermore, the whole DenseNet201 was fine-tuned for 100 epochs. When fine-tuning the Dense Net201 and Inception V3, we let the weights be initialized as the original weights from the pretrained model on ImageNet.
Finally, we combined the three best models to improve the EAC dataset's accuracy to make the ensemble and weighted averages. We run every experiment five times to give out the mean and standard deviation of overall accuracy (OA). The different comparison results have shown in Tables 2-5. The comparison results form the backbone of MF-FENN show better performance than the state-of-the-art (SOTA) methods. When our training rate is 40% and 50%, MGFENN achieved 97.38% WHU-RS19, 99.94% UC-Merced, and 99.10% EAC dataset OA.
We use experiments on the dispute with an intra-classes variation on three remote sensing imagery datasets with different size pixels. See Table 1. Using Multi-granularity neural network encoding by performing fine-tuning direct would result in a massive gradient. Dataset collected from   East African Community in six countries (Rwanda, Uganda, Kenya, Tanzania, South-Sudan, and Burundi). The images have been captured in Eastern Africa Community Countries used the google earth engine.inc (EAC). It is a novel land-use dataset that contains 2112 images and is extracting from the Google Earth satellite. It is divided into nine object categories with 256 × 256 pixels in the RGB in Figure 6 B. We introduce the experimental dataset; second, we evaluate indexes and testing procedures by constructing an experimental dataset of high-resolution remote sensing images.      Table 5 compares our baseline with six layers, as we explored in section 3, to understand the difficulties of CNN feature extraction. Our baseline achieved the best results via data augmentation and learning rate within the machine   learning as a final classifier. In table 6, we compared the pretrained VGG16 structure as a feature extracted. We trained these features via a different traditional machine learning as a final classifier. We achieved VGG-Dense Net with RF with various state-of-the-art methods that have reported classification accuracy on UCM, SIRI-WHU, and EAC datasets. As shown in Table 7, we compared baseline results with four DCNNs by fine-tuning the top layer. We applied fine-tuning all layers and compared with the second scenario for ensemble through weighted average and within equal-weighted of 0.5 Table 8 IV. DISCUSSION Significant progress has been made by applying remote sensing data via deep learning methods. We analyze and compare the current state-of-the-art CNN topology via deep learning, intermediate level for RS image classification literature, and our proposed method Table 2-4. Table 5 represents the comparison results for the pre-trained model with different machine learning classifiers. An approach illustrates in Figure 4; the network used DCNN with six convolution layers, with an alternative of batch normalization and maxpooling. Using the ADAM optimizer, the initial learning rate has been setting to 0.001. Table 8 shows the obtained results.
In the second approach illustrated in Figure 4, we combined VGG16, DensrNet201, and different traditional classifiers, the performance accuracy is significant. It is better to combine VGG16, DenseNet201, and random forest (RF), as Table 6 illustrated. The third approach has been presented in Figure 4. We trained 80% and tested 20% on the EAC dataset with data augmentation. Still, different test loss and number of parameters, see Table 7.
The last approach illustrated in Figure 4, the multigranularity level architecture deep learning (MGLADL) concept, helps learn multi-granularity feature extraction. VGG16, Inception-v3, and DenseNet201 in backbone build fine-tune multi-granularity neural network encoding (FTMGNN) approach by taking the features of the output layer with softmax. The parts from each network are combined to feature vector for each input image fed as input to a simple network containing convolutional layers and one fully connected layer. Combining these networks with a weighted average ensemble network of 96.40% and 97.84% accuracy results in an equal weight of 0.5 achieved 98.60% validated and 99.10% that is an improvement of 0.22% over the highest precision and test loss reduced to 0.00050. Table 8 illustrates fine-tuning all layer and ensemble models. The results are shown in the confusion matrix Figure 9-10. By using MGLADL technique 3 overcomes this limitation by preserving a higher number of features. From granularity level one and hence giving better improvement. We conduct MGLADL approach three as the proposed model, which is used to break down the optimization of the networks to level while optimizing one large network, also ensembling method.
The challenge for RS imagery in the granular computing computer vision area, feature extraction from the remote sensing image, has the object's difficulty in RS images. The image classification still has the challenge within intra-class variation to investigate in-depth the number of problems in remote sensing; there are always the following shortcomings:   RS scene image classification needs an advanced mathematical interpretation. Deep learning can be merely a process fitting the input dataset and the output result; there is a lack of a relentless mathematical basis for the networks' design and improvement. For successful RS image classification, CNN's topology for the number of layers, including input data, fully connected layers, classifiers, loss functions, and network structure. Research of CNNs with multiple RS image input has utilized many more features extracted in the spectrum and shape texture.
The second requirements, insufficient training data in remote sensing, for training samples are high. The training dataset sample could be very high to achieve better application results. They are inadequate when training complex remote sensing datasets. However, some scholars have made particular progress in small sample training. Many training samples have been requiring for higher accuracy for practical application in specific areas. The Comprehensibility of network features is poor.
The features extracted by the network lack practical significance after being passed to a hierarchical feature level. Though there would have been available visual developed tools, the specific meaning of automatic network extraction cannot design. The construction, adjustment, and improvement of the deep network still rely on the experience of developers.
The acceleration of training and testing models, meets the essential dataset requirements in a particular production, high-performance computing devices such as GPU. For the machine learning approach where the model advanced for a task, transfer learning is re-constructed as the starting point for a second or third practical approach accelerates the CNN  An image has been mapping to specific results through complex computations. However, hierarchical feature understanding does not understand the particular mapping in this process, so using prior knowledge for image recognition and judgement.

V. CONCLUSION
In our studies, we learn feature construction, feature extraction. Then we build decisions of granularity level fusion to increase accuracy in the classification stage. Fine-tuning all layers give us excellent results. The fine-tuning only the top layers and the time needed to fine-tune all layers are lower than fine-tuning the top layers. The cause for that, when fine-tuning all layers, we only executed for 30 epochs, while VOLUME 9, 2021 fine-tuning the top layers, we performed for 100 epochs. By state-of-the-art transfer learning and ensemble learning, we created an ensemble of fine-tuned VGG16, Inception V3, and Dense Net 201. We achieved after 100 epochs 96.81% accuracy on the validation set and 97.84% on the test accuracy set for EAC Dataset.
Moreover, 97.36% average accuracy on the test set gives the best single weighted average result for the test set with accuracy up to 99.10%. The significant progress of applying remote sensing data via deep learning methods is an essential critical image classification study. Global operation analysis is still a significant challenge for interpretability and reproducibility, with limited ground data, especially for under-developing countries without national programs. Future works would investigate different approaches to encode the dense block due to multi-granular representation invariance using intuition from CNN's transfer grained to semantic labeling segmentation large satellite imagery. Also, to avoid overfitting and better training strategy by learning different activation functions.

CONFLICTS OF INTEREST
Authors declare no conflict of interest regarding the publication of this paper.