Multilevel Capsule Weighted Aggregation Network Based on a Decoupled Dynamic Filter for Remote Sensing Scene Classification

The improvement of remote sensing scene classification(RSSC) by effectively extracting discriminant representations for complex and diverse scenes remains a challenging task. The capsule network(CapsNet) can encode the spatial relationship of features in an image, which exhibits encouraging performance. Nevertheless, the original CapsNet is unsuitable for RSSC with complex image background. In addition, conventional neural network methods only use the last features from their last convolutional layer and discount the intermediate features with complementary information. To search the additional information in intermediate convolutional layers and increase the performance of feature aggregation, this paper proposes a multilevel capsule weighted aggregation network (MCWANet) based on a decoupled dynamic filter(DDF), in which a new multilevel capsule encoding module and a new capsule sorting pooling (CSPool) method are implemented by combining the advantageous attributes of a residual DDF block, weighted capsule aggregation, and the new CSPool method. Extensive experiments on two challenging datasets, AID and NWPU-RESISC45, demonstrate that multilevel and multiscale features can be extracted and fused to extract semantically strong feature representation and that the proposed MCWANet performs competitively in RSSC.


I. INTRODUCTION
A remote sensing image (RSI) contains the detailed spatial structure of Earth's surface, which is captured by satellite imaging sensors. Remote sensing scene classification (RSSC) has been a popular research topic in the field of highresolution RSI in recent decades; its purpose is to assign semantic labels to the images [1] automatically. Due to the rapid development of aeronautics and astronautics technology, many RSIs provide us with more applications for RSSCs, such as land planning, urban change, traffic control, disaster monitoring, military needs, agriculture, and forestry [2], [3].
The key to RSSC is to extract robust and distinctive feature representations. However, the special bird's eye view image acquisition mode makes traditional RSSC methods ineffective. In addition, an RSI might include multiple land cover The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . units of different sizes, or different RSIs might include the same land cover unit, which causes great difficulties in RSSC. Thus, realizing automatic and high-precision RSSC is still a challenge [4].
Robust feature representation is the focal point in RSSC. In accounting for its applications, RSSC techniques can be divided into two categories: human-defined feature extraction and depth feature extraction. Owing to the excellent capability of deep learning, the depth feature approach exhibits better results from RSIs than human-defined features [5], [6]. Therefore, using deep learning technology has been widely considered and proven to be the central topic in RSSC tasks. Because of their advantages in data processing power and hierarchical learning abilities, CNNs (convolutional neural networks) have been extensively applied in information extraction and RSSC and have achieved many state-of-the-art results [7], [8]. Although CNNs perform well in many tasks, they also have many limitations. CNNs cannot explicitly learn VOLUME 9, 2021 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ the relationship between feature locations, and it is difficult to identify similar objects with different location relationships. Moreover, CNN uses a pooling operation to reduce the number of parameters and maintain translation invariance, but conventional pooling methods cause the loss of essential details and location information.
Recently, Hinton et al. [9] advanced the concept of ''capsule'', an interpretable deep learning model and is considered a powerful alternative to CNN. Different from the traditional deep learning model based on scalar neurons, the capsule network (CapsNet) uses a vector capsule representation to learn equivariant representation, which is more robust to the spatial relations between objects and the changes in their poses in the image. CapsNet has been extensively applied to classification, recognition, segmentation and prediction tasks [10]- [12]. However, the original CapsNet [13] only employs two convolution layers and uses a great number of training parameters to attempt to explain all the contents of an image, which is unsuitable for RSIs with complex backgrounds. In addition, most of these methods only focus on the feature maps of the last layer and discount the intermediate feature maps. Those feature maps obtained by different intermediate layers possess considerable supplementary information and discrimination ability.
This paper proposes a novel CapsNet for RSSC, named MCWANet, which employs a new multilevel capsule encoding module and a new capsule sorting pooling (CSPool); these elements include a residual decoupled dynamic filter (DDF) block, weighted capsule aggregation, and the new CSPool method. A DDF [14] decouples the conventional dynamic filter into spatial and channel dynamic filters with adaptive content and lighter weight, which can enhance the abundant information in different spatial locations and different channels. Here, we construct a residual structure [15] DDF to alleviate the vanishing gradient problem caused by increasing the depth of a deep neural network. The multilevel capsule encoding module is enhanced to generate multilevel capsule aggregation directly from intermediate feature layers to enrich the feature space with different receptive fields. The feature obtained by the final convolution has the largest receptive field and can locate large objects with high-level semantic information. The features obtained by the intermediate layers are more appropriate and robust in the location of minimal objects, containing essential details and lowlevel texture patterns. We use a weight matrix to condition the aggregation extent for extracting intermediate features.
To alleviate the overfitting problem and reduce the number of feature parameters, the CSPool method is proposed to improve the output quality and the robustness of capsules. The novel MCWANet achieves competitive and promising performance in RSSC.
The contributions of this paper are summarized as follows. 1. A novel end-to-end MCWANet is structured to extract multilevel and informative features to provide a semantically strong capsule representation that increases RSSC accuracy; capsule encoding module is designed  with a residual-structure DDF and a weighted aggregation  strategy, which highlights both the spatialwise and channelwise salient and informative features and makes full use of the  feature maps of intermediate layers, playing a positive role in  improving the ability of capsule representation; 3. A new CSPool method is proposed to provide a high-quality and robust capsule representation that reduces the number of capsules to prevent feature redundancy and destruction, further improving the RSSC result.

A new multilevel
4. Our novel model achieves excellent RSSC accuracies compared to those of some state-of-the-art algorithms on two challenging datasets. The structure of this paper is as follows. Related studies are introduced in Section II. Section III describes the novel architecture of the MCWANet in detail. In Section IV, the experimental results are discussed. Finally, the conclusion is given in Section V.

II. RELATED STUDIES
Robust representation of RSIs is the key to improving RSSC accuracy. Human-defined feature descriptors are usually used for RSSC. In [16], the BoVW method was proposed to encode the scale invariant feature transform [17] for local blocks. In [18], two kernels were used to construct spatial relationships: cooccurrence kernels and pyramid matching kernels. Zhao et al. [19] introduced the Fisher kernel [20] and used a gradient vector to encode human-defined features. Negrel et al. [21] studied several second-order methods to fuse the histogram of the directional gradient [22]. However, human-defined methods are challenging to apply in practice because of the complexity and diversity of RSIs, which encourages the development of more robust and accurate recognition schemes [23].
Hu et al. [24] extracted depth features from multiscale images based on CNN and encoded them into scene representations by using traditional feature coding methods. To achieve the high-level feature, Yu and Liu [25] constructed three neural networks with different receptive fields, which were fused by a probability fusion model. In [26], a multilayer stacking covariance gap was advanced to fuse intermediate features, where a covariance matrix was proposed to represent the complementary information of the superimposed feature map. Wang et al. [27] proposed a coding hybrid resolution representation that integrates low-level and high-level features. Zhu et al. [28] integrated multilevel semantics from global texture, visual word packages and preprocessed CNNs and then introduced extra descriptors to optimize several feature representations. Unfortunately, the descriptors usually damage the original structure of feature spatial information by reconstructing new feature tensors. In addition, DenseNet [29] found that the fusion of feature maps generates fusion redundancy.
CapsNet, which was first implemented by Sabour et al. [13], is an effective network for classification. There have been some other innovative achievements. Hinton described a version of the capsule in [9], in which a matrix capsule that learns entity and posture was proposed to represent the different attributes of an object better. Chen et al. [30] embedded the routing process and all the other neural network parameters into the optimization process, which solves the problem of manually setting the optimal route times. In [31], a hit net was proposed, which is an improved vertex net constructed from a hit or dead middle layer. Rosario et al. [32] introduced MLCN, which is a new type of poverty reduction center network composed of several parallel lanes. It makes parallel computation, high-speed training and minimal training parameters possible. Cheng et al. [33] proposed a complex valued vertex network that uses a multiscale complex convolution layer to extract multiscale complex features and construct complex capsules. Experiments showed better performance than the original CapsNet. However, the design of their methods only adopts the spatial structure feature, which restricts the ability of CapsNet. In addition, Yin et al. [34] applied CapsNet to hyperspectral image classification. Mobiny and Van Nguyen [35] achieved lung cancer screening by using a fast CapsNet. Afshar et al. [36] successfully realized the classification of brain tumor images by using CapsNet.

III. MATH
In this paper, a novel MCWANet is proposed, as shown in Fig. 1. Compared with the original CapsNet, the MCWANet constructs three new modules: a residual DDF module, a multilevel capsule encoding module and a capsule sorting pooling module. Dynamic routing is used between the primary capsule layer (Pri-CapsL) and the digital capsule layer (Dig-CapsL).

A. DECOUPLED DYNAMIC FILTER
It is proven that the excellent properties of a DDF are as follows [14]. A DDF provides a spatially variable filter to guarantee content adaptation; it has the same amount of computation as deep convolution, which makes its reasoning speed faster than that of standard convolution and dynamic filters; and it can significantly reduce the memory consumption of dynamic filters, which enables the direct replacement of all standard convolutions with a DDF.
Setting F ∈ R c×n , n = h × w as a given input feature, the standard convolution operation of a feature vector F (.,i) at the i-th pixel can be described as follows: where b is the bias vector. W represents a k × k convolution.
Here, the filter W is the same for all pixels; that is, the filter is independent of the input content. Different from standard convolution, the dynamic filter uses an additional self-network to generate a filter for each pixel; that is, the spatial invariant filter in the above formula becomes a spatially variable filter. Dynamic filters can enhance content-adaptive learning, but they also entail considerable computation and occupation.
To meet the challenge that a single filter should be both ''content adaptive'' and ''lighter than standard convolution'', a DDF achieves both goals. The key is constructing DDFs as both spatial and channel dynamic filters. A DDF is described as follows: where F' (r,i) ∈ R c' and F (r,j) ∈ R c represent the output feature vector and the input feature vector of the i-th pixel in the r-th channel, respectively, c' and c represent theoutput channels and the input channels, respectively, and D sp i p i − p j and D ch r p i − p j represent the spatial dynamic filter and the channel dynamic filter, respectively. The DDF operation and its procedure are shown in Fig.2, and the filter prediction branch is expected to be as light as possible. The channel and the spatial dynamic filters can be predicted from the input, and the output characteristics can be calculated according to the above formula. Compared with the conventional dynamic filter, the DDF spatial domain reduces the original pixel-wise filter values (n × c' × c × k × k size) to nk 2 spatial filters and ck 2 channel dynamic filters.  The spatial and channel filters of an attention type of classification prediction are designed according to the correlation between the dynamic filter and its attention focus. For the prediction branch of the spatial filter, only one convolution is used. For the prediction branch of the channel filter, a structure like the squeeze and extraction (SE) [37] block is adopted, which is GAP+FC+ReLu+FC. Because direct use of these direct prediction filters, which may be too large or too small, would lead to instability of training, the filters are normalized (here, FN is designed concerning batch normalization (BN)): where D sp i and D ch r are filters before normalization; µ and δ represent the mean value and the standard deviation, respectively; and α and β represent the moving standard deviation and moving mean value, respectively. FN helps limit the generated filter values to a reasonable range so that it can avoid the vanishing and exploding gradient problems in the training process.
Inspired by the residual pattern of ResNet, MCWANet employs a residual pattern to join the DDF convolution module, as shown in Fig. 1. Because the input image must undergo a multilevel convolution operation from its initial input to its final output, the receptive field is increased.
In the conventional CapsNet, the input of a convolutional operation is the output of the previous convolutional operation. The feature obtained from the final convolutional operation is employed as the representative feature constructed the capsules. As shown in [25] and [26], different levels of convolution features contain diverse spatial structure information from the image. The receptive field of low-level convolution is microscopic, which can obtain appearance information from images. In contrast, high-level convolution has a large receptive field that can obtain the spatial structure information between objects.
To integrate the intermediate convolutional feature maps from multiple convolutional layers, a multilevel capsule encoding module is proposed, as shown in Fig.1. The feature aggregation can be represented as: where Ag refers to the aggregation method, CF i represents the capsule feature maps of the i-th layer, W i expresses the corresponding weight parameters, and n is the aggregation layer number. Multilevel feature aggregation is realized by stacking all the feature maps in a parallel connection form. Affected by the stacking aggregation, some useless information will inevitably be fused, which would lead to redundancy, destruction of features and reduced efficiency. On the one hand, there will be considerable redundant information in the stacking feature maps from overlapping adjacent convolutional layers. On the other hand, the change of some pixels can destroy the original feature structure, especially the immature abstraction of the features transferred from the previous layer to the rearwards layer. We designed a weight matrix to address these issues and enhance the aggregation effectiveness to condition the aggregation extent.

2) WEIGHT MATRIX FOR THE AGGREGATION
The weight matrix is used to control the degree of feature map participation in the aggregation. Each level of a feature map is assigned a weight W i to control its aggregation extent. When the weight is small, the feature map controlled by the weight has little influence on the aggregation and can reduce redundant information and structural destruction. If the weight is large, the ratio of representative feature maps is large, which is beneficial to aggregation effectiveness. The principle of the prediction vector computed with different weights is as follows: where W 1 , W 2 and W i are the weight matrixes between CF 1 , CF 2 , CF i andĈF 1 ,ĈF 2 ,ĈF i respectively.ĈF i is the prediction vector of the capsules obtained by the DDF.
And MF stack represents the output of the multilevel capsule encoding module which is then used as the proposal primary capsules (ProPri-Caps). The variety of feature representations is encoded by using the weight matrix between the original capsules and the ProPri-Caps. In the training process, the part-whole relationship of every capsule pair is learned by conditioning the weight matrix W 1 , W 2 and W i . Parallel stacking can collect the characteristics of different feature maps and preserve the original pixels. In the final aggregation, the output capsules of the DDF blocks realize multiscale feature extraction and achieve the combination of different numbers and sizes of receptive fields, where the image's structural information and semantic information.

C. CAPSULES POOLING
The multilevel capsule encoding module creates much redundant information, so a new capsule pooling method is proposed to lessen the parameter numbers of MCWANet, which can prevent common overfitting and maintain accuracy. The standard pooling method always operates on each element of a capsule, which can change the capsule vector direction and lead to a change in the entity attributes that the capsule expresses and cause difficulty in learning. Therefore, different from the standard pooling method, and considering that each capsule in CapsNet is a vector, capsule pooling operates on the entire capsule vector instead of a single element to preserve feature information (location, direction, etc.). We regard each capsule as a whole and subsample along the channel direction to protect the complete expression of object parts. This ensures that the direction of the capsule remains unchanged.
In the ProPri-Caps layer, the ProPri-Caps are reshaped from the convolutional results of the DDF blocks where the prediction branch of channel filters has gone through an attention structure like SE. As a result, important channels are given more weight than useless channels that represent redundant information. Therefore, a response factor V rxy is defined as the energy value of a capsule vector, which represents the extent to which the capsule is activated.
where v rxyn is each element value of a capsule vector at location(x,y), and n is the dimension of the capsule vector. For every capsule at the same location in all the capsule decks, we sequence the V rxy and k capsules as follows where N is the capsule number at position (x,y), top_index implies how many capsule responses are larger than the mean capsule response at position (x,y) in all capsule decks, and MIN signifies taking a small value. Therefore, k capsules are passed to the next layer at a pixel location as the primary capsules(Pri-Caps). Since the direction of the capsule vector is not changed, the new Caps-pooling algorithm is more suitable for CapsNet.

D. DYNAMIC ROUTING
The Pri-Caps u are received after the CSpool operation. As shown in Fig.3, between Pri-CapsL and Dig-CapsL, the dynamic routing algorithm [13], which is a kind of information selection mechanism, is used to iteratively update coupling coefficients, where all the Pri-Caps of feature maps in all the levels are collected as the input. The coupling coefficients c ij are obtained with a softmax function where b ij is initialized to 0 before the training begins.
In the dynamic routing algorithm, the coefficient b ij is iteratively refined by measuring the agreement between the Pri-Caps vectors u j|i and v j . Dynamic routing adds a routing coefficient value to a j-digital capsule by using the inner product between two vectors: If a good agreement is reached, Pri-Caps u j|i makes a good prediction for the digital capsule v j . As a result, the coefficient b ij is significantly increased. By using a squash function, the output of capsule v j can be obtained from input s j . The equation used here is as follows v j = ||s j || 2 1+||s j || 2 s j ||s j || (11) VOLUME 9, 2021 After normalization, the short vector is close to 0, and the long vector is almost 1. The loss function calculated with the margin loss, is given as: When the m class really exists, then T m = 1 and k+, k− and λ are parameters that are established during training. The overall loss is the sum of the marginal losses of all outputs in the final layer.

IV. EXPERIMENTAL RESULTS AND ANALYSIS
In this section, our MCWANet is used on public datasets. First, the datasets are presented. Second, the experimental settings are introduced. Third, the experimental results are shown.

A. DATASETS
Two challenging public datasets are used to estimate the proposed MCWANet method. Sample images of the two datasets are displayed in Fig.4. Then, each data set is separately introduced.
AID dataset: This dataset is composed of 10000 samples (600 × 600 pixels). There are 30 classes, each of which has 220 to 420 images. The resolutions of these images are between 0.5 m and 8 m. This large data set is available from the website http://www.captain-whu.com/project/AID/.

B. EXPERIMENTAL DESIGN
In this paper, all experiments are carried out on a personal computer with an Intel 4.2 GHz 4core i7-7700 CPU and an NVIDIA RTX 2070 GPU with 16 GB of memory. The PyTorch framework is adopted to conduct the experiments. All images are scaled to 224 × 224 pixels before training. The learning rate is initialized as 0.001 and plus 0.1 when the loss per epoch does not decrease. The weight decay parameter is set to 0.0005. For the convenience of feature aggregation and pooling operations, the stride is set to 1. Batch normalization (BN) is used after each convolutional operation, and the ReLU activation function is adopted.
The basic CapsNet contains three layers: two convolutional layers and one capsule layer. The convolution kernel of the first convolutional layer is 256, 9 × 9 with a stride of 1. The second layer is a Pri-CapsL with 32 capsules of 8-dimensional vectors at each pixel position, which is obtained from the results of the preceding layer by 8, 9 × 9 convolution kernels with a stride of 2. The final layer is an FC-CapsL with a 16D capsule representing a class. Table 1 describes the parameter settings of the proposed MCWANet in detail.
Three evaluation criteria were used: overall accuracy (OA), standard deviation (SD) and confusion matrix (CM). OA is the number of correctly recognized images divided by the total number of test images. The CM is a kind of information table where the columns represent the ground truth, and rows represent the prediction.
To thoroughly test the effectiveness of the advanced MCWANet, we adopted 10-fold cross validation in the experiment. Furthermore, two sets of training-testing ratios are adopted. For the AID dataset, the training-testing ratios are set to 5:5 and 2:8. For the NWPU-RESISC45 dataset, the ratios are 2:8 and 1:9. Each data set is randomly divided into 10 parts without repeated sampling, and all experiments are performed 10 times. Finally, the mean results of the OA and SD were reported as the final results.

C. RESULTS AND ANALYSIS 1) EXPERIMENTAL RESULTS FROM THE AID DATASET
The AID dataset is a large-scale RSI dataset in which at least 100 images in every scene class are used for training models. All the compared algorithms are based on deep convolutional networks. Table 2 shows the comparison results of the different algorithms. As shown, our MCWANet achieves the highest OA of 95.89% ± 0.23% and 93.62% ± 0.18% with training ratios of 5:5 and 2:8, respectively. In particular, the CNN-CapsNet that combines VGG_16 and CapsNet achieves excellent performance on the AID dataset. For training rates of 5:5 and 2:8, the performance of our MCWANet is 1.15% and 1.99% higher than it of CNN-CapsNet, respectively. Furthermore, the recognition OA of the proposed network is 1.05% and 2.28% higher than that of MF2Net, respectively; MF2Net is one of the newest RSSC methods. This emphasizes that MCWANet can learn more robust feature representations from RSIs. The CM of MCWANet is shown in Fig. 5 for the 5:5 training ratio. The data in each row of the CM represent the prediction category of each image. Our model has a better capability of processing most categories. Some misclassified scenes are difficult to distinguish for human interpreters.

2) EXPERIMENTAL RESULTS FROM THE NWPU-RESISC45 DATASET
This dataset is a large-scale dataset and is one of the most challenging at present. The OA and SD for different methods are shown in Table 3, in which the proposed MCWANet exhibits the best performance. Compared with the secondbest model, the enhancement achieved by MCWANet with a 1:9 training ratio is 0.75% (RAN), and with a 2:8 training VOLUME 9, 2021 ratio is 1.53% (RAN). These excellent results indicate that the proposed method can effectively accomplish RSSC, even on complex and diverse datasets. In addition, the CM of MCWANet is shown in Fig. 6 for 2:8 training sets. From the CM, it is clear that MCWANet is valid for most categories. The accuracies of MCWANet are higher than 90% in 33 out of 45 classification tasks and are higher than 85% in 42 out of 45 classification tasks. Particularly for the ''Chaparral'' category, the classification results of the test images are all correct. The classes with the lowest classification accuracy are subject to error because they share very similar objects, making it challenging to classify those categories correctly. These encouraging results prove the merit of the proposed model once more.

3) EFFECTIVENESS OF DIFFERENT SUBPARTS
As described in Section III, our MCWANet includes three main improvements: a residual-structure DDF module, a multilevel capsule encoding module, and a CSPool strategy. To validate their effectiveness in MCWANet, we implement the following ablation experiments.
The OA and SD results obtained by inserting different modules into the original CapsNet [18] are shown in Table 4, and the training times of an epoch and the number of training parameters for the AID dataset are listed in Table 5.
''Convs-Caps'' identifies the model that substitutes the 2 ConvLayers with a 9 × 9 convolution kernel for the 4 ConvLayers with 3 × 3 convolution kernels. ''Res-Caps'' identifies the model that substitutes the 3 × 3 convolution kernels in ''Convs-Caps'' for three residual-structure cascade 1 × 1 convolution kernels. ''SF-Caps'' identifies the model that substitutes convolutions for spatial dynamic filters, and ''CF-Caps'' identifies the model that substitutes convolutions for channel dynamic filters. ''DDF-Caps'' identifies  the model that uses a residual-structure DDF, as shown in Fig.1. ''DDF-MCE-Caps'' identifies the model that adds the multilevel capsule encoding module to the ''DDF-Caps'' model, where the aggregation method is divided into a weighted matrix and a nonweighted matrix for analysis. ''DDF-MCE-Pool'' identifies the model that adopts the conventional maxpooling method after the multilevel capsule encoding module. ''DDF-MCE-CPool'' identifies the model that adopts our cap-sorting pooling method in ''DDF-MCE-Caps'' mode, which is also called MCWANet.
From the ablation test results, some conclusions can be drawn. First, among the top five models, ''DDF-Caps'' has the best behavior from using the residual-structure DDF. When only the spatial dynamic filter is used, the model's performance decreases significantly, and when only the channel dynamic filter is used, the performance of the model is improved by 0.96% on the AID dataset and 0.98% on the NWPU-RESISC45 dataset. When the spatial and channel dynamic filters are used at the same time, the performance of the model is improved on the two datasets by 1.33% and 1.46%, respectively. These results show that joining the spatial and channel dynamic filters is helpful in the learning of scene representation, and the local features are very useful for scene understanding.
The multilevel capsule coding module is the crucial part of end-to-end MCAWNet, which is used to aggregate the convolutional features of the intermediate layers. Next, we study the effects of the multilevel capsule coding module based on weight adjustment. It can be seen from Table 4 that compared with the ''DDF-Caps'' model, the multilevel capsule encoding module improves the OAs from both datasets. In particular, the ''DDF-MCE-Caps'' model with weight adjustment achieves more significant improvement. This means that different intermediate convolutional layers have different contributions to different scene objects, and different levels of feature detail can compensate for the information loss of high-level features. For the NWPU-RESISC45 dataset, the OA with the weight matrix is much higher than that without the weight matrix, which is up to 0.83%. These results show that weight adjustment can effectively remove feature redundancy and aid in learning differentiated scene representations.
When the new CSPool method is used in ''DDF-MCE-CPool'', the model's performance is increased for the two challenging datasets, and at the same time, the number of training parameters is lessened. Many small convolution kernels in MCAWNet replace the large convolution kernels in the original CapsNet. This reduces the number of parameters, improves the capability of depth feature extraction, and indicates that the CSPool method plays an active role in RSSC.
According to the experimental results, the residualstructure DDF module, multilevel capsule encoding module, and CSPool strategy can extract robust features representing most of the classified scenes. The strongest Pri-Caps are employed to structure the digital capsules to conduct classification. All these strategies help increase the capability of the original CapsNet.

V. CONCLUSION
In this paper, we proposed a novel CapsNet, named MCWANet, in which a new multilevel capsule encoding module and a new capsule sorting pooling method are incorporated to improve RSSC accuracy by making use of the superiority of CapsNet, the advantages of the residual DDF block, the weighted capsule aggregation strategy, and the new CSPool method. The proposed MCWANet adopted a deep capsule architecture, which has good application prospects for extracting capsule features and restoring high-quality and strong semantic feature representation. By weighted integration of the DDF features of intermediate layers, the MCWANet enhanced feature representation functionality and robustness. Moreover, by incorporating the CSPool module, the MCWANet largely preserved descriptive information while remaining computationally and memory efficient. Experimental results on two challenging datasets clearly show that the MCWANet can produce competitive results compared with those from a set of existing algorithms. The highest OAs for the two datasets were 95.89% and 92.15%, respectively. Our future work will embed a fine-tuning strategy into our end-to-end classification framework to further improve its classification performance. YAN GAO received the Ph.D. degree in control theory and control engineering from Donghua University, in 2013. She is currently a Teacher with Shanghai University of Engineering Science. Her research interests include stability, synchronization, control of neural networks, and complex networks. VOLUME 9, 2021