RS-CapsNet: An Advanced Capsule Network

,


I. INTRODUCTION
In recent several years, convolutional neural networks (CNNs) [1]- [4] have been developing rapidly due to their excellent learning ability, which have been widely applied to many image tasks, including image classification [1]- [4], image retrieval [5], object detection [6], image segmentation [7], scene recognition [8], and so on. Although CNNs perform well in many fields, they also have many limitations. Pooling operation can provide a little translation invariance, but it will lose the precise location information of features and what we really need is equivariance. In addition, CNNs cannot learn the relationships between features, and they tend to memorize data rather than understanding it [9]. Therefore, CNNs need a large amount of training data.
In order to address the drawbacks of traditional CNNs, Sabour et al. [10] propose CapsNet, which abandons the pooling layer and retains the location information of features. It also replaces scalar output with vector output and increases The associate editor coordinating the review of this manuscript and approving it for publication was Donato Impedovo . the levels of the structure. Different from the translation invariance provided by the pooling operation, CapsNet can provide the translation equivariance. Furthermore, unlike the CNNs, CapsNet proposes the idea of using parts to construct the whole, which is realized by using capsules and dynamic routing algorithm.
However, the original CapsNet [10] has its own drawbacks. Firstly, it uses only two convolutional layers to extract image features, which is not suitable for complex images. Secondly, the size of the convolutional kernels in the original CapsNet is 9 × 9, which involves a large number of training parameters. Moreover, the original CapsNet tends to explain everything in the image, which is not suitable for classification tasks for images with complex backgrounds.
So as to solve the limitations of the original CapsNet, we propose the following methods. To enhance the ability of convolutional layers to extract image features, we propose to use Res2Net block [11]. Different from the traditional residual block in ResNet [4], Res2Net block uses small convolutional kernels and different groups of convolution operation. The different groups of convolutional kernels are VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/ connected in a hierarchical manner to increase the number of scales of features [11]. Res2Net block can extract multi-scale image features at the granular level and increase the receptive fields for each convolutional layer without involving many parameters. Because of the drawback of too many parameters in the original CapsNet, besides Res2Net block, we also use several extra convolutional layers with small convolutional kernels to extract the features and change the shape of feature maps. In order to eliminate the influence of the characteristic that original CapsNet tends to explain everything in the image on the classification task of images, we propose to use the Squeeze-and-Excitation (SE) block [12], which can highlight useful features and suppress useless features by assigning different weights to different channels of feature maps. We also employ a method of linear combination between capsules to enhance the ability for capsules to represent detected objects and reduce the number of capsules at the same pixel location. In addition, based on the idea that CapsNet uses parts to construct the whole, we think that large parts of the detected object can construct the whole better than small parts of the detected object. Thus, we propose to firstly construct intermediate capsules representing the large parts of the detected object, and then use the intermediate capsules and primary capsules representing the small parts of the detected object to construct the classification capsules together, in which the feature information of different types of capsules can complement each other. Based on the above methods, we propose a novel Capsule Network that can extract multi-scale features of the input images, highlight useful features, and suppress useless features. By using the method of linear combination between capsules, the ability for capsules to represent the detected object can be enhanced. RS-CapsNet can also construct the classification capsules better by the idea that use large parts of the detected object and the small parts of the detected object to construct the whole together. In addition, due to the use of small convolutional kernels and Res2Net block, RS-CapsNet only involves fewer parameters.
The contributions of this paper can be summarized as follows: (I) We propose to use Res2Net block to extract multi-scale image features and increase the receptive field of each convolutional layer, so that it can learn long-range spatial pixel relationship and extract richer image features. Then, for the extracted features, we use SE block to highlight useful features and suppress useless features, which can mitigate the effect of the image background features on the construction of the properties that represent the detected object in the classification capsules.
(II) We use Res2Net block and four extra convolutional layers with small convolutional kernels to reduce the number of training parameters in the convolutional layers.
(III) The method of linear combination between capsules is proposed to enhance the ability for capsules to represent the detected object and reduce the number of capsules at the same pixel location, thereby reducing the computation complexity and training parameters.
(V) We propose a method of vertical and horizontal sliding windows and an improved routing process. Firstly, we use the method of vertical and horizontal sliding windows to slice the whole feature map into partial feature maps, and then use them to construct intermediate capsules that can represent the large parts of the detected object. Finally, we use the intermediate capsules and primary capsules representing the small parts of the detected objects to construct the classification capsules together.
The rest of the paper is organized as follows. In Section II, some related works are introduced. Our proposed advanced Capsule Network is described in detail in Section III. In Section IV, we demonstrate our experimental results and related analysis. Finally, the conclusion will be given in Section V.

II. RELATED WORKS
Based on the concept of ''capsule'' that is introduced by Hinton et al. [13] in 2010, Sabour et al. [10] propose the dynamic routing algorithm between capsules and present a novel neural network named CapsNet. Since CapsNet can overcome some drawbacks of traditional CNNs, including the loss of location information in the pooling operation, and it performs well in many classic datasets, many researchers have paid attention to this [14], [15] and are devoted to improving its algorithm [16]- [20] or architecture [21]- [23].
Based on the Sabour's work, Hinton et al. [9] propose matrix Capsule Network, which replaces the vector capsule with matrix capsule and uses EM algorithm to update the coupling coefficients such that lower-layer capsules can be routed to the higher-layer to construct the higher-layer capsules. Xiang et al. [24] introduce Multi-Scale CapsNet (MS-CapsNet), in which the multi-scale features are extracted by multi-scale convolutional kernels and then used to construct the multi-dimension primary capsules. They also propose an improved dropout for the capsules. Deliege et al. [25] propose the HitNet, an improved Cap-sNet by constructing a new layer named Hit-or-Miss layer and introducing a centripetal loss function. HitNet also develops a method of hybrid data augmentation by combining the information between the data space and feature space. Rosario et al. [26] introduce the Multi-Lane CapsNet (MLCN), a novel CapsNet which consists of multiple parallel lanes. MLCN allows parallel processing, faster training, and fewer parameters. They also propose an improved dropout for the multi-lane. Cheng et al. [27] propose Complex-valued CapsNet (Cv-CapsNet). They use multiscale complex-valued convolutional layers to extract the multi-scale complex-valued features and construct complexvalued capsules. They also generalize the dynamic routing algorithm to the complex-valued domain. The experiments show that the Cv-CapsNet needs fewer training parameters, fewer training epochs, and it performs better than the Real-valued CapsNet. Gagana et al. [28] test the effects of different activation functions on the performance of the Cap-sNet, including ReLU, LReLU, and e-Swish. The experiments show that LReLU and e-Swish activation functions can better optimize the CapsNet than the ReLU activation function in aspect of better accuracy and faster convergence.
Furthermore, CapsNet has been applied in many different fields [29]- [32]. Saqur and Vivona [33] introduce the CapsGAN, a novel Generative adversarial Capsule Network that replaces the traditional convolutional neural network with the CapsNet as the discriminators. Yin et al. [34] apply the CapsNet to the hyperspectral image classification. They transfer the well-initialized parameters of the convolutional layers to the CapsNet and show that the performance on the CapsNet is better than that on the traditional CNNs. Mobiny et al. [35] introduce a fast CapsNet for the lung cancer screening by proposing a consistent dynamic routing mechanism. Afshar et al. [36] successfully use the CapsNet to classify the brain tumor image.

III. PROPOSED ADVANCED CAPSULE NETWORK
In this paper, a novel Capsule Network named RS-CapsNet is proposed, in which convolutional layers are augmented by Res2Net blocks and SE blocks, and the capsule layer is based on the method of linear combination between capsules and the proposed routing process. The overview of proposed RS-CapsNet is shown in Fig. 1.

A. CONVOLUTIONAL LAYERS
Compared with the original CapsNet [10] that uses only two convolutional layers to extract the features of the input images, we use Res2Net block [11] to enhance the ability that the convolutional layers extract the features and increase the receptive fields of each convolutional layer. We also use the SE block [12] to highlight the useful features and suppress useless features. In addition, we use extra four convolutional layers to change the shape of the convolutional feature maps.

1) RES2NET BLOCK
Res2Net block is a variant of the ResNet, which uses several groups of convolutional operation and constructs hierarchical connections within one single residual block. Different from the multi-scale feature extraction methods which are in a layer-wise manner, Res2Net block extracts multi-scale features at a granular level and can increase the range of the receptive fields of each convolutional layer. The architecture of the Res2Net block is shown in Fig. 2. In order to represent the multi-scale features at a granular level and increase the range of the receptive fields, the Res2Net block replaces the 3 × 3 convolutional kernels of c channels with n smaller groups of convolutional kernels of m channels, where c = n × m. These smaller groups of convolutional kernels are connected in a hierarchical manner to increase the number of scales of the features that the network outputs [11].
As shown in Fig. 2, firstly, the input is sent to a set of 1 × 1 convolutional kernels, and the output feature maps are divided into four groups after the 1 × 1 convolution. For the first group of feature maps x 1 , there is no convolution operation to it. For the second group of feature maps x 2 , a set of 3 × 3 convolutional kernels are used to extract the features from it, the output is y 2 . Secondly, y 2 and the third group of feature maps x 3 are sent to the second set of 3 × 3 convolutional kernels together and the output is y 3 . Then, y 3 and the fourth group of the feature maps x 4 are sent to the third set of the 3 × 3 convolutional kernels together and the output is y 4 . Finally, output feature maps from all groups are concatenated and sent to another group of the 1×1 convolutional kernels to fuse the features. Similar to the residual block in the ResNet, Res2Net block uses the residual connection to connect the input with the output of the last group of convolution operations. Because input features can be transformed into output features through multiple paths, so the receptive fields will increase when they pass a set of convolutional kernels. Due to the final combination effects, the output feature maps of the Res2Net block achieve multi-scale feature representation, and they also achieve the different numbers and different combinations of the receptive field sizes [11].

2) SE BLOCK
''Squeeze-and-Excitation'' (SE) block is a kind of attention block based on the channels of feature maps. It can adaptively adjust the channel-wise feature responses by explicitly modelling the interdependencies between channels [12]. SE block can learn to use global information of feature maps to highlight useful features and suppress useless features. It has been demonstrated that it can significantly improve the performance of existing CNNs with a minimal computational cost. The architecture of the SE block is shown in Fig. 3.
For any feature map G ∈ R H ×W ×C , it can be transformed into M ∈ R H ×W ×C by a convolution operation. For the generalized feature maps M ∈ R H ×W ×C , SE block is used to perform the feature recalibration. In order to model the interdependencies between the channels, the feature maps M is firstly passed through a ''squeeze'' operation to squeeze the global information of each channel. SE block uses global average pooling to be the ''squeeze'' operation, where global average pooling operation is represented by F sq (•). When the feature maps M passes through the global average pooling, a data description s ∈ R C will be produced, in which each dimension can describe the spatial information of corresponding channel.
In order to use the information that is produced by the ''squeeze'' operation, SE block uses ''Excitation'' operation which is represented by F ex (•, w) to model interdependencies between the channels. The ''Excitation'' operation is to use two fully-connected layers, in which the first fully-connected layer is followed by the activation function of ReLU and the second fully-connected layer is followed by the sigmoid function. The reasons for using two fully-connected layers are that it can learn the nonlinearity between the channels and it is flexible. The reason for using sigmoid function as the activation function of the second fully-connected layer is that the SE block should ensure that multiple channels can be highlighted instead of just one channel [12].
After the ''Excitation'' operation, it can produce a set of channel weights s ∈ ÊR C , in which each dimension represents the feature importance of the corresponding channel. Then, multiplying the channel weights s with the input feature maps M can result in the recalibrated feature maps N .

3) THE ARCHITECTURE OF CONVOLUTIONAL LAYERS
Due to the excellent characteristics of the Res2Net block and the SE block, we apply them to RS-CapsNet. Different from architecture of the convolutional layers in Res2Net, in RS-CapsNet, we only simply use the architecture of the Res2Net block to be the part of the convolutional layers, there is not skip connection between the two adjacent Res2Net blocks. In order to change the shape of the feature maps, we use extra four convolutional layers. After each convolutional layer, we use a Res2Net block to enhance the ability that convolutional layers extract the feature and increase the receptive fields. In addition, after the Res2Net block, we use SE block to highlight useful features and suppress useless features.

B. CAPSULE LAYERS
The idea of the original CapsNet [10] is to use the parts to construct the whole, and the primary capsules are directly used to construct the classification capsules. However, these primary capsules are reshaped from the convolutional feature maps, so they only represent the small parts of the detected object and have much redundant information. We propose to firstly construct the intermediate capsules that can represent the large parts of the detected object and then use all the intermediate capsules and primary capsules to construct the final classification capsules together. In addition, we introduce a method of linear combination between capsules to enhance the ability for capsules to represent the detected object. This method also can reduce the number of the capsules at the same pixel location.

1) THE LINEAR COMBINATION BETWEEN CAPSULES
In the traditional CNNs, fully-connected layers are used to combine the neurons in one layer with different weights into the neurons in another layer, and it can fuse the different features in one layer. In principle, it is same as the multi-layer perceptron neural network.
In RS-CapsNet, we introduce the method of the linear combination between capsules. The neurons at the same pixel 85010 VOLUME 8, 2020 location in the feature maps represent different features of the detected object in the same receptive filed, it means that the detected object in the receptive filed can be represented by these features. When we group these neurons into primary capsules, the detected object in the receptive filed can be represented by these capsules. However, since the primary capsules are directly reshaped from the convolutional feature maps with a large amount of redundant information, so there may be a lot of redundant information in the primary capsules, for example, there may be some features about background information.   4 shows the method of linear combination between capsules that are at the same pixel location. For any feature maps, we firstly reshape them into capsules, then we take each capsule as a whole and construct the fully-connected layer between the capsules which are at the same pixel location, where D 1 represents the dimension of the capsules, N 1 the number of capsules at each pixel location, and N 2 the number of capsules at each pixel location after the linear combination. Finally, we perform a squashing function on the capsules that are achieved by using the linear combination method so that it can shrink the length of capsules to [0,1), keep the direction of capsules constant, and provide more nonlinearity for the entire network. There are three reasons that we use the linear combination between capsules. On the one hand, it will not involve many parameters, because the linear combination between capsules only exists between all capsules which are at the same pixel location (e.g. the number of channels of the feature maps is 512 and the dimension of each capsule is 8, so there are only 64 capsules at the same pixel location.). On the other hand, it can reduce the number of the capsules which are at the same pixel location after the linear combination. What's more, it can fuse the features of the different capsules to reduce the redundant information and enhance the fused capsules' ability of representing the detected object.

2) PROPOSED ROUTING PROCESS
In the original CapsNet [10], the primary capsules are reshaped from the feature maps generated by the last convolutional operation, and then the primary capsules are directly used to construct the classification capsules. The routing principle between the primary capsules and the classification capsules is to use the parts to construct the whole. Because each primary capsule with a fixed dimension is reshaped from the channel-wise features that are at a pixel location, so each primary capsule only represents a part of the detected object. Compared with the effect of predicting the whole with the small parts of detected object, the effect of predicting the whole with the large parts of the detected object is better. The capsules that represent the large parts of the detected object can be constructed by applying the routing algorithm to primary capsules, it means that we can construct the intermediate capsules that represent the large parts of the detected object by increasing the capsule layers. However, it has recently been shown that directly stacking more fully-connected capsule layers will result in poor performance in the middle layers [37], so we cannot directly use intermediate capsules to construct the final classification capsules. In order to use intermediate capsules, we propose to use intermediate capsules that represent large parts of the detected object and primary capsules for small parts to construct the final classification capsules together. Although the effect of using the large parts of the detected object is better, the small parts of the detected object can represent the whole in more details, so using capsules that represent the small parts and the large parts of the detected object can complement the information each other.
In order to construct the intermediate capsules that can represent the large parts of the detected object, we firstly slice the feature maps which are produced from the last convolutional operation into many small partial feature maps, then use these small partial feature maps to construct the capsules that can represent the large parts of the detected object. For the method of slicing the feature maps, we propose to use vertical and horizontal sliding windows, which is shown in Fig. 5(b). There are two reasons that we choose the vertical and horizontal sliding windows method. Firstly, for the object with horizontal structure, vertical structure, or symmetrical structure, the vertical and horizontal sliding windows method is more conducive to preserving the integrity of the object. Secondly, we hope to use more small partial feature maps. Compared with traditional sliding windows method as shown in Fig. 5(a), vertical and horizontal sliding windows method can provide more partial feature maps. For partial feature maps, if we directly use them to construct the capsules representing large parts of the detected object, more capsules and more computation will be involved. Thus, we firstly use the 1 × 1 convolution to reduce the number of channels of original feature maps and partial feature maps. Then, the method of linear combination between capsules is used to fuse the features between different capsules at the same pixel location, so as to reduce the number of capsules. Note that, for the capsules reshaped from the original feature maps, we perform the linear combination operation to them, but we do not reduce the number of these capsules.
As shown in Fig. 6, for the feature maps FM , we firstly use the method of vertical and horizontal sliding windows to construct the partial feature maps fm i (1 ≤ i ≤ 6), then we use 1 × 1 convolution operation to reduce the number of channel of all partial feature maps fm i (1 ≤ i ≤ 6) and the original feature maps FM to the half of the original number of the channels, the output of feature maps fm i (1 ≤ i ≤ 6) and FM are fm i (1 ≤ i ≤ 6) and FM respectively. For all achieved feature maps fm i (1 ≤ i ≤ 6), we firstly reshape them into capsules, then use the method of linear combination between capsules to fuse features to enhance the ability for capsules to represent the detected object and reduce the number of capsules by half. For the feature maps FM , we reshape them into primary capsules, then apply the method of linear combination between capsules to them, but we do not reduce the number of primary capsules, because we think the primary capsules can more represent the details of the detected object.
For the different groups of capsules which are obtained from different partial feature maps fm i (1 ≤ i ≤ 6), we perform the dynamic routing algorithm to them and construct the intermediate capsules that can represent the large parts of the detected object. As shown in Fig. 6, each partial feature maps fm i (1 ≤ i ≤ 6) can construct N 3 capsules, where the dimension of each capsule is D 2 . Finally, we use all intermediate capsules which are constructed from partial feature maps fm i (1 ≤ i ≤ 6) and primary capsules which are obtained from the feature maps FM to construct the classification capsules together.

C. LOSS FUNCTION
Because the length of the capsule represents the probability that the detected object exists, so we use the margin loss [10] as the loss function of RS-CapsNet. The margin loss can ensure the true class capsule has a long length of the capsule and the false class capsules have short length of the capsules. It also allows multiple classes to exist.
In RS-CapsNet, the loss function is defined as: Here T k = 1 if the true class is k and zero otherwise. m + and m − are the lower boundary of the true class capsule and the higher boundary of the false class capsules. The α factor of the loss for the false class capsules stops the initial learning from shrinking the length of the capsules [10]. We use m + = 0.9, m − = 0.1, and α = 0.5. The Baseline-CapsNet consists of two convolutional layers and a capsule layer, which is similar to the original Cap-sNet [10], except that it does not have reconstruction subnetwork. The first convolutional layer has 256 convolutional kernels with the size of 9×9, stride of 1, and ReLU activation. The second convolutional layer is a convolutional capsule layer, which consists of 32 capsules with 8 dimensions at each pixel location. Each primary capsule is obtained by a convolution operation with 8, 9 × 9 convolutional kernels and stride of 2 on the output of first convolutional layer [10].

85012
The final classification capsule layer has 10, 16-dimension capsules, each capsule represents a class.
RS-CNN stands for the proposed network without routing process, where we replace the routing process and margin loss function with global average pooling and cross-entropy loss function. Table 1 shows the comparison of the network performance after applying different methods to the original CapsNet [10] without reconstruction sub-network. ''Convs'' represents the model that replaces the two convolutional layers that use 9×9 convolutional kernels with four convolutional layers that use 3×3 convolutional kernels. ''Convs-Res2Net'' represents the model that adopts the Res2Net block after each convolutional layer of the ''Convs'' model. ''Conv-Res2Net-SE'' represents the model that uses the SE block after each Res2Net block of the ''Convs-Res2Net'' model. RS-CapsNet represents the model which is obtained by applying the method of linear combination between capsules and the method of using the intermediate capsules and primary capsules to construct the classification capsules together to the ''Convs-Res2Net-SE'' model. The method of linear combination between capsules is applied to the capsules at the same pixel location, then we perform a squashing function on the capsules that are constructed by the linear combination method. The reasons for using squashing function are that it cannot only shrink the length of capsules to [0,1) and keep the direction of capsules constant, but also provide more nonlinearity for the entire network. Table 2 shows the comparison of network performance after applying the squashing function or not to the capsules that are constructed by the method of linear combination between capsules. From the results, we can know that the method of linear combination between capsules followed by a squashing function can make the network performance better. VOLUME 8, 2020

2) SELECTION OF HYPERPARAMETERS
When we use the method of using the partial feature maps to construct the intermediate capsules that represent the large parts of the detected object, it will involve two hyperparameters, the one is the number (N 3 ) of the intermediate capsules that are produced by each partial feature maps, and the other is the dimension (D 3 ) of each intermediate capsule.
For the hyperparameter N 3 , we think it should not be too big. Different from the capsules which are directly reshaped from the convolutional feature maps, these intermediate capsules are produced by the routing operation, so they can represent the detected object better and have less redundant information. At the same time, the hyperparameter N 3 shouldn't be too small, because the classification capsules are calculated by the weighted sum of all prediction capsules of the lower-level capsules, if N 3 is too small, the effect of the intermediate capsules that are produced by the partial feature maps will not be obvious. We test the performance of RS-CapsNet when N 3 = 5, 8, 10, 12, 16 respectively, and the results are shown in table 3. From the results, we can know that the performance of RS-CapsNet with N 3 = 10 is comparable to that with N 3 = 16 and it is better than others. However, there are more capsules and more parameters in RS-CapsNet with N 3 = 16 than that with N 3 = 10. Thus, we use N 3 = 10. For the hyperparameter D 2 , we test D 2 = 8 and D 2 = 12 respectively, the results are shown in table 4. From the results, we can know the performance of RS-CapsNet with D 2 = 8 is better than that with D 2 = 12. Though the dimension of the primary capsules is 8, they are directly reshaped from the feature maps instead of routing operation, so the primary capsules have much redundant information. For the capsules that are produced by the routing operation, they can represent the detected object better and have little redundant information. When D 2 is 12, routing operation will involve more training parameters. Thus, we use D 2 = 8.    Fig. 8 show the effects of different N 3 and D 2 values on the loss variation during the RS-CapsNet training, respectively. In both figures, before the first decline in the learning rate, the curves of loss variation are almost same during the network training process. However, in the later training process, the effects of different N 3 and D 2 values on the loss variation become obvious. After the first decline in the learning rate, the rate of loss variation of RS-CapsNet with N 3 = 10 or D 2 = 8 is bigger than that of other networks. When the networks get convergence, the loss of RS-CapsNet with N 3 = 10 or D 2 = 8 is smaller than other networks. Fig. 9 shows the curves of the loss variation of RS-CapsNet and Baseline-CapsNet during the training. Because RS-CapsNet uses the convolutional layers of extracting multi-scale features and improved routing process, the initial loss of network training is bigger than that of the Baseline-CapsNet. However, the convergence rate of RS-CapsNet is faster than that of the Baseline-CapsNet, and the number of epoch required for getting convergence is less than that of Baseline-CapsNet. When two networks all get convergence, the loss of RS-CapsNet is smaller than that of Baseline-CapsNet.  CIFAR10 is a relatively complex dataset, each image in the CIFAR10 dataset has not only rich features, but also backgrounds and a lot of noises. Table 5 shows the comparison of the test accuracy, the number of training parameters, and the training time of an epoch. The performance of RS-CapsNet is better than that of all the baselines on the CIFAR10 dataset. When we only use a single model, there is a 0.41% improvement compared with Sabour's CapsNet [10]. If we use the ensemble learning method, there is a 1.88% improvement.
For the networks trained on CIFAR10, RS-CapsNet has only 5.01million parameters, while Sabour's CapsNet has 14.36 million parameters, so we achieve the reduction of parameter number by 65.11%. And we can achieve 89.81% on CIFAR10 with a single network, but Sabour's CapsNet can only achieve 89.40% with ensemble learning of 7 models.
The SVHN dataset is not as complex as CIFAR10 dataset, but the images in SVHN dataset are RGB images and it can be seen as similar in flavor to MNIST (e.g., the images are small cropped digits). Table 6 shows the results of the comparison. We can achieve 96.50% on SVHN with a single network, there is a 0.8% improvement compared to Sabour's CapsNet. And there is a 1.38% improvement with the ensemble learning of 7 models. The FashionMNIST is a relatively simple dataset, because all the images have been normalized to be the image with single channel and they do not have backgrounds. Because the images have no background, so we do not use the SE block when we use FashionMNIST dataset. Table 7 shows the comparison of the performance between RS-CapsNet and some baselines. From the results, the performance of RS-CapsNet is comparable to that of the baseline on the dataset without background. In addition to three commonly used datasets CIFAR10, SVHN, and FashionMNIST for testing the performance of the Capsule Network algorithm, we also perform experiments on the CIFAR100 dataset. Compared with CIFAR10 dataset VOLUME 8, 2020 with 10 classes, 60000 images, CIFAR100 dataset with 100 classes, 60000 images, is more complex. Table 8 shows the performance comparison between RS-CapsNet and all baselines. Compared with Baseline-CapsNet, the performance of RS-CapsNet gets a great improvement. Meanwhile, RS-CapsNet has fewer parameters and needs less time to train for an epoch.

4) ROBUSTNESS TO AFFINE TRANSFORMATION
In order to test the robustness of RS-CapsNet to affine transformation, we firstly train the RS-CapsNet on a padded MNIST dataset, in which each image is simply centered in the 40×40 image by adding 6 edge background pixels around the original 28 × 28 image. We then test the trained RS-CapsNet on the AffNIST dataset. Table 9 shows the comparison of the robustness of different networks. The experimental results show that the robustness of RS-CapsNet to affine transformation is better than that of RS-CNN and all other baseline Capsule Networks. On one hand, it means that RS-CapsNet can provide translation equivariance as original CapsNet [10]. On the other hand, it proves that RS-CapsNet can provide better translation equivariance than other baseline Capsule Networks.

5) DISCUSSIONS
RS-CNN is a network architecture using the convolutional layer of RS-CapsNet, global average pooling, classification layer, and cross-entropy loss function. The performance of RS-CapsNet is comparable with RS-CNN on CIFAR 10 dataset, and better than RS-CNN on other datasets. In particular, RS-CapsNet can provide better translation equivariance. Baseline-CapsNet is similar to the original CapsNet [10], except that it does not have reconstruction sub-network, so its network architecture is very simple. Compared with Baseline-CapsNet, RS-CapsNet uses the convolutional layers to extract multi-scale features and improved routing process. Therefore, the whole network architecture is a bit more complicated than Baseline-CapsNet, and it will take more time to train for an epoch. Nevertheless, RS-CapsNet has fewer parameters, fewer training epochs, and better classification performance. Cv-CapsNet++, MS-CapsNet, and MLCN all use multi-lane-like architecture, different lanes are responsible for different dimensions of classification capsules, so these three networks take fewer epochs to get convergence. However, the training time of an epoch on CIFAR10 dataset by using Cv-CapsNet++, MS-CapsNet, and MLCN is 1.64 times, 2.89 times, and 2.40 times as long as that of RS-CapsNet respectively. The training time of an epoch on FashionMNIST dataset by using Cv-CapsNet++, MS-CapsNet, and MLCN is 1.66 times, 4.00 times, and 2.22 times as long as that of RS-CapsNet respectively. The number of parameters of RS-CapsNet is reduces by 55.26% and 64.84% than that of MS-CapsNet and MLCN respectively. Thus, compared with RS-CapsNet, all three networks need more time to train for an epoch, and the last two networks have more parameters. In addition, HitNet and DeeperCaps have 43.64% and 13.76% more parameters than RS-CapsNet respectively. Except for the performance of Cv-CapsNet++ on the FashionMNIST, the performance of RS-CapsNet is better than all other Capsule Networks.
All the experimental results show that RS-CapsNet has a better performance. On the image dataset in which the images have the backgrounds, the performance of our model is better than that of the all baseline Capsule Networks. On the image dataset in which the images have no backgrounds, the performance of our model is comparable to that of the baselines. Our model not only has better performance, fewer training parameters, less time for an epoch, but also provides better translation equivariance.

V. CONCLUSION
In this paper, we propose an improved Capsule Network called RS-CapsNet, which can extract multi-scale features and highlight useful features. Meanwhile, small convolutional kernels are used to reduce the number of the parameters, which is 65.11% less than the original CapsNet. Moreover, the method of linear combination between capsules is employed to fuse the features of capsules to enhance the representation ability of capsules for detected object and reduce the number of the capsules. In addition, RS-CapsNet constructs the classification capsules by using the intermediate capsules representing the large parts of the detected object and the primary capsules representing the small parts of the detected object so that different types of the capsules can complement each other with the feature information to better construct the classification capsules. All of the experimental results show that RS-CapsNet achieves better performance than Sabour's CapsNet and some other baselines on CIFAR10, CIFAR100, SVHN, FashionMNIST, and AffNIST datasets. As a future work, we plan to further improve the dynamic routing algorithm of RS-CapsNet. SHUAI  KOJI KOTANI (Member, IEEE) received the B.S., M.S., and Ph.D. degrees in electronic engineering from Tohoku University, Japan, in 1988, 1990, and 1993, respectively. He is currently a Professor with the Department of Intelligent Mechatronics, Akita Prefectural University. He is engaged in the research and development of high performance devices/circuits as well as intelligent electronic systems. He is a member of the Institute of Electronics, Information and Communication Engineers of Japan.
QIU CHEN (Member, IEEE) received the Ph.D. degree in electronic engineering from Tohoku University, Japan, in 2004. Since then, he has been an Assistant Professor and an Associate Professor with Tohoku University. He is currently a Professor with Kogakuin University. His research interests include pattern recognition, computer vision, information retrieval and their applications. He serves on the editorial boards of several journals, as well as committees for a number of international conferences.